How fast can a computer read all of Wikipedia?
The website Wikipedia is consistently ranked amongst the most-visited websites in the world, and contains a huge amount of content. Pages can be edited by anyone, and there is a large network of contributors around the world.
If a website contains a large database with information available to all internet users, then it helps to be able to automate the accessing of that information for further analysis. This can be done through an Application Programming Interface, or API.
In the case of Wikipedia, this information is in the form of text strings. The systematic analysis of strings to study their properties is sometimes called text analytics. It can be used in a range of settings and used for studying large volumes of text far greater than a human could ever read in a lifetime.
This challenge looks at using a simple API for accessing Wikipedia content, and then running some simple text analytics functions on the resulting data.
Target:
Write one page of Python that will:
- Search and select the ‘Hello World’ page, then print the ‘Summary’ to the terminal.
- Run frequency analysis (ie count the occurrences of each letter) on the content of the ‘Oxford’ page.
- Create a Wordcloud for the ‘Content’ data from a page of your choice.
Extensions:
- Starting at a page of your choice, choose a random link on the page, and print the title of the page selected.
- Repeat the above random link selection 20 times, printing the title of each site you visit.
- Estimate how long it would take to read all of Wikipedia by calling pages from this API.
Python Challenge 3 Hints and Tips
Hints for Python Challenge 3
Packages
- In this challenge we use Wikipedia as a wrapper to call the Wikipedia API. As this is a Python wrapper, it is relatively straightforward to access documentation using the help() command within Python.
- To create the Wordcloud, we used the wordcloud package, although it would be possible to create your own Wordcloud using a powerful technique called regular expressions. However, we do not explicitly use these in this challenge.
Hints
- Python has a wide range of built-in commands for analysing string data. Try and use this functionality without writing your own code, where possible.
- The API requires quite specific page names to withdraw the information. Using the results of the wikipedia.search might help find the specific page name you are looking for.
Notes
- The relative frequency of characters is probably the simplest analysis that can be done from text data - search for the Text Mining page to find out more on this topic.
- You are likely to find that it would take a very long time for a computer to ‘read’ all of Wikipedia using the API. What is the bottleneck?