Here's the assignment:
Download this raw statistics dump from Wikipedia (360mb unzipped):
http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-10/pagecounts-20141029-230000.gz
Write a simple script in your favourite programming language that:
- Gets all views from the English Wikipedia (these are prefixed by "en ")
- Limit those articles to the ones with at least 500 views
- Sort by number of views, highest ones first and print the first ten articles.
- Also measure the time this takes and print it out as well.
Currently i've got these results (1 average from 5 runs)
- Python: 11.35 (12.29, 10.99, 10.99, 10.85, 11.63)
- PHP: 14.23 (14.079, 13.937, 14.295, 13.889, 14.931)
- Node.JS: 8.74 3(8.86, 8.20, 8.23, 9.65, 8.75)
Your output should look like this:
Query took 8.20 seconds
Main_Page (394296)
Malware (51666)
Loan (45440)
Y%C5%AB_Kobayashi (34596)
Glutamate_flavoring (17508)
Online_shopping (16310)
Chang_and_Eng_Bunker (14956)
Dance_Moms (8928)
Browser_game (8676)
Nikola_Tesla (8272)