Skip to content

Instantly share code, notes, and snippets.

@c-rack
Forked from hay/WMStats.java
Last active August 29, 2015 14:08
Show Gist options
  • Select an option

  • Save c-rack/06a2b34721ad0812e908 to your computer and use it in GitHub Desktop.

Select an option

Save c-rack/06a2b34721ad0812e908 to your computer and use it in GitHub Desktop.

Here's the assignment:

Download this raw statistics dump from Wikipedia (360mb unzipped):

http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-10/pagecounts-20141029-230000.gz

Write a simple script in your favourite programming language that:

  • Gets all views from the English Wikipedia (these are prefixed by "en ")
  • Limit those articles to the ones with at least 500 views
  • Sort by number of views, highest ones first and print the first ten articles.
  • Also measure the time this takes and print it out as well.

Here are some measurements:

  • Node.js: 7.10s (7.56, 7.18, 7.01, 6.89, 6.88)
  • PHP: 8.42s (8.54, 8.31, 8.47, 8.41, 8.39)
  • Python: 9.26s (8.35, 8.54, 10.43, 9.35, 9.62)

Your output should look like this:

Query took 7.56 seconds
Main_Page (394296)
Malware (51666)
Loan (45440)
Special:HideBanners (40771)
Y%C5%AB_Kobayashi (34596)
Special:Search (18672)
Glutamate_flavoring (17508)
Online_shopping (16310)
Chang_and_Eng_Bunker (14956)
Dance_Moms (8928)
var fs = require('fs');
var readline = require('readline');
var filename = 'pagecounts-20141029-230000';
var minviews = 500;
var prefix = 'en ';
var rd = readline.createInterface({
input : fs.createReadStream(filename),
output : process.stdout,
terminal : false
});
var start = +new Date();
var count = [];
rd.on('line', function(line) {
if (line.indexOf(prefix) !== 0) {
return;
}
var parts = line.split(" ");
var lang = parts[0];
var article = parts[1];
var views = parts[2];
if (views > minviews) {
count.push([article, parseInt(views)]);
}
});
rd.on('close', function() {
count.sort(function(a, b) {
return a[1] > b[1] ? -1 : 1;
});
var elapsed = (+new Date() - start) / 1000;
console.log("Query took " + elapsed.toFixed(2) + " seconds");
count.slice(0, 10).forEach(function(item) {
console.log(item[0] + " (" + item[1] + ")");
});
});
<?php
$filename = "pagecounts-20141029-230000";
$minviews = 500;
$prefix = "en ";
$f = fopen("pagecounts-20141029-230000", "r");
$start = microtime(true);
$count = [];
while (($line = fgets($f)) !== false) {
$len = strlen($prefix);
if (substr($line, 0, $len) !== $prefix) {
continue;
}
$pieces = explode(" ", $line);
$lang = $pieces[0];
$article = $pieces[1];
$views = $pieces[2];
if ($views > $minviews) {
$count[] = [$article, $views];
}
}
usort($count, function($a, $b) {
return $a[1] > $b[1] ? -1 : 1;
});
$elapsed = round(microtime(true) - $start, 3);
echo "Query took $elapsed seconds\n";
$count = array_slice($count, 0, 10);
foreach ($count as $item) {
printf("%s (%s) \n", $item[0], $item[1]);
}
import time
filename = "pagecounts-20141029-230000"
minviews = 500
prefix = "en "
f = open(filename, "r").readlines()
start = time.time()
count = []
for line in f:
if prefix and not line.startswith(prefix):
continue
(lang, article, views, size) = line.split(" ")
if int(views) > minviews:
count.append((article, views))
count.sort(key = lambda x: int(x[1]), reverse = True)
elapsed = time.time() - start
print "Query took %s seconds" % round(elapsed, 2)
for item in count[0:10]:
print "%s (%s)" % item
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment