Here's the assignment:
Download this raw statistics dump from Wikipedia (360mb unzipped):
http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-10/pagecounts-20141029-230000.gz
Write a simple script in your favourite programming language that:
- Gets all views from the English Wikipedia (these are prefixed by "en ")
- Limit those articles to the ones with at least 500 views
- Sort by number of views, highest ones first and print the first ten articles.
- Also measure the time this takes and print it out as well.
Here are some measurements:
- Node.js: 7.10s (7.56, 7.18, 7.01, 6.89, 6.88)
- PHP: 8.42s (8.54, 8.31, 8.47, 8.41, 8.39)
- Python: 9.26s (8.35, 8.54, 10.43, 9.35, 9.62)
Your output should look like this:
Query took 7.56 seconds
Main_Page (394296)
Malware (51666)
Loan (45440)
Special:HideBanners (40771)
Y%C5%AB_Kobayashi (34596)
Special:Search (18672)
Glutamate_flavoring (17508)
Online_shopping (16310)
Chang_and_Eng_Bunker (14956)
Dance_Moms (8928)