Skip to content

Instantly share code, notes, and snippets.

@c-rack
Forked from hay/WMStats.java
Last active August 29, 2015 14:08
Show Gist options
  • Select an option

  • Save c-rack/06a2b34721ad0812e908 to your computer and use it in GitHub Desktop.

Select an option

Save c-rack/06a2b34721ad0812e908 to your computer and use it in GitHub Desktop.

Here's the assignment:

Download this raw statistics dump from Wikipedia (360mb unzipped):

http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-10/pagecounts-20141029-230000.gz

Write a simple script in your favourite programming language that:

  • Gets all views from the English Wikipedia (these are prefixed by "en ")
  • Limit those articles to the ones with at least 500 views
  • Sort by number of views, highest ones first and print the first ten articles.
  • Also measure the time this takes and print it out as well.

Currently i've got these results (1 average from 5 runs)

  • Python: 11.35 (12.29, 10.99, 10.99, 10.85, 11.63)
  • PHP: 14.23 (14.079, 13.937, 14.295, 13.889, 14.931)
  • Node.JS: 8.74 3(8.86, 8.20, 8.23, 9.65, 8.75)

Your output should look like this:

Query took 8.20 seconds
Main_Page (394296)
Malware (51666)
Loan (45440)
Y%C5%AB_Kobayashi (34596)
Glutamate_flavoring (17508)
Online_shopping (16310)
Chang_and_Eng_Bunker (14956)
Dance_Moms (8928)
Browser_game (8676)
Nikola_Tesla (8272)
import time
filename = "pagecounts-20141029-230000"
minviews = 500
prefix = "en "
f = open(filename, "r").readlines()
start = time.time()
count = []
for line in f:
if prefix and not line.startswith(prefix):
continue
(lang, article, views, size) = line.split(" ")
if int(views) > minviews:
count.append((article, views))
count.sort(key = lambda x: int(x[1]), reverse = True)
elapsed = time.time() - start
print "Query took %s seconds" % round(elapsed, 2)
for item in count[0:10]:
print "%s (%s)" % item
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment