Created
October 3, 2014 16:39
-
-
Save alexstorer/73219f8386b090dab091 to your computer and use it in GitHub Desktop.
Revisions
-
alexstorer created this gist
Oct 3, 2014 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,146 @@ { "metadata": { "name": "", "signature": "sha256:49da2f337895f3f82db401d063148d8101a5a7a881b7a256ec368b47f70d8b47" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "The Python \"Pipeline\"\n", "-----------------\n", "\n", "The 'pipeline' is the way we go from raw input to processed output. In many cases, the raw input is in a series of files on your hard drive, and the output will be a csv file.\n", "\n", "The two most common ways to query for files from within python are `glob.glob` and `os.walk`. `glob` tries to emulate the Unix `ls` command, while `os.walk` craws a directory for all subfiles.\n", "\n", "My favorite way to write a csv from Python is using the `csv.DictWriter` tool. It takes a python dictionary and treats it as a row of a csv file. Pretend that a dictionary is called `row` and the columns are `c0`, `c1`, etc. In python, we will store data as `row[c0] = exampledata`. To make a `DictWriter`, we will need to proide the location of the file, as well as the columns of the csv file.\n", "\n", "---------\n", "\n", "Now, let's look at a full example that reads things in, finds a simple regular expression, and writes out the data." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import glob\n", "import csv\n", "import re\n", "\n", "fout = open('output.csv','w')\n", "fieldnames = ['file','200','404']\n", "dw = csv.DictWriter(fout,fieldnames)\n", "dw.writeheader()\n", "\n", "weblogs = glob.glob('/Users/astorer/Teaching/2015_programming/Data/weblogs/*.log*')\n", "\n", "\n", "for w in weblogs:\n", " f = open(w,'r')\n", " # load the entire contents of the file into memory\n", " logdata = f.read()\n", " results = re.findall('HTTP/\\d+\\.\\d+\\\" (\\d+)',logdata)\n", " f.close()\n", " \n", " countdict = dict()\n", " for r in results:\n", " if r in countdict:\n", " countdict[r]+=1\n", " else:\n", " countdict[r] = 1\n", "\n", " row = dict()\n", " row['file'] = w\n", " row['200'] = countdict['200']\n", " row['404'] = countdict['404']\n", " dw.writerow(row)\n", "fout.close()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 32 }, { "cell_type": "code", "collapsed": false, "input": [ "countdict" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 41, "text": [ "{'103': 2,\n", " '200': 98655,\n", " '206': 2124,\n", " '301': 4923,\n", " '302': 1828,\n", " '304': 1563,\n", " '403': 990,\n", " '404': 13708,\n", " '500': 46}" ] } ], "prompt_number": 41 }, { "cell_type": "code", "collapsed": false, "input": [ "fout = open('output.csv','r')\n", "for line in fout:\n", " print line\n", "fout.close()" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "file,200,404\r\n", "\n", "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140915,88775,10082\r\n", "\n", "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140916,99968,15044\r\n", "\n", "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140917,100206,14359\r\n", "\n", "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140918,96831,13989\r\n", "\n", "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140919,96998,13430\r\n", "\n", "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140920,96628,12062\r\n", "\n", "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140921,86208,5210\r\n", "\n", "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140922,76706,9033\r\n", "\n", "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140923,89607,11692\r\n", "\n", "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140924,98655,13708\r\n", "\n" ] } ], "prompt_number": 33 } ], "metadata": {} } ] }