Skip to content

Instantly share code, notes, and snippets.

@alexstorer
Created October 3, 2014 16:39
Show Gist options
  • Select an option

  • Save alexstorer/73219f8386b090dab091 to your computer and use it in GitHub Desktop.

Select an option

Save alexstorer/73219f8386b090dab091 to your computer and use it in GitHub Desktop.

Revisions

  1. alexstorer created this gist Oct 3, 2014.
    146 changes: 146 additions & 0 deletions pipeline.ipynb
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,146 @@
    {
    "metadata": {
    "name": "",
    "signature": "sha256:49da2f337895f3f82db401d063148d8101a5a7a881b7a256ec368b47f70d8b47"
    },
    "nbformat": 3,
    "nbformat_minor": 0,
    "worksheets": [
    {
    "cells": [
    {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
    "The Python \"Pipeline\"\n",
    "-----------------\n",
    "\n",
    "The 'pipeline' is the way we go from raw input to processed output. In many cases, the raw input is in a series of files on your hard drive, and the output will be a csv file.\n",
    "\n",
    "The two most common ways to query for files from within python are `glob.glob` and `os.walk`. `glob` tries to emulate the Unix `ls` command, while `os.walk` craws a directory for all subfiles.\n",
    "\n",
    "My favorite way to write a csv from Python is using the `csv.DictWriter` tool. It takes a python dictionary and treats it as a row of a csv file. Pretend that a dictionary is called `row` and the columns are `c0`, `c1`, etc. In python, we will store data as `row[c0] = exampledata`. To make a `DictWriter`, we will need to proide the location of the file, as well as the columns of the csv file.\n",
    "\n",
    "---------\n",
    "\n",
    "Now, let's look at a full example that reads things in, finds a simple regular expression, and writes out the data."
    ]
    },
    {
    "cell_type": "code",
    "collapsed": false,
    "input": [
    "import glob\n",
    "import csv\n",
    "import re\n",
    "\n",
    "fout = open('output.csv','w')\n",
    "fieldnames = ['file','200','404']\n",
    "dw = csv.DictWriter(fout,fieldnames)\n",
    "dw.writeheader()\n",
    "\n",
    "weblogs = glob.glob('/Users/astorer/Teaching/2015_programming/Data/weblogs/*.log*')\n",
    "\n",
    "\n",
    "for w in weblogs:\n",
    " f = open(w,'r')\n",
    " # load the entire contents of the file into memory\n",
    " logdata = f.read()\n",
    " results = re.findall('HTTP/\\d+\\.\\d+\\\" (\\d+)',logdata)\n",
    " f.close()\n",
    " \n",
    " countdict = dict()\n",
    " for r in results:\n",
    " if r in countdict:\n",
    " countdict[r]+=1\n",
    " else:\n",
    " countdict[r] = 1\n",
    "\n",
    " row = dict()\n",
    " row['file'] = w\n",
    " row['200'] = countdict['200']\n",
    " row['404'] = countdict['404']\n",
    " dw.writerow(row)\n",
    "fout.close()"
    ],
    "language": "python",
    "metadata": {},
    "outputs": [],
    "prompt_number": 32
    },
    {
    "cell_type": "code",
    "collapsed": false,
    "input": [
    "countdict"
    ],
    "language": "python",
    "metadata": {},
    "outputs": [
    {
    "metadata": {},
    "output_type": "pyout",
    "prompt_number": 41,
    "text": [
    "{'103': 2,\n",
    " '200': 98655,\n",
    " '206': 2124,\n",
    " '301': 4923,\n",
    " '302': 1828,\n",
    " '304': 1563,\n",
    " '403': 990,\n",
    " '404': 13708,\n",
    " '500': 46}"
    ]
    }
    ],
    "prompt_number": 41
    },
    {
    "cell_type": "code",
    "collapsed": false,
    "input": [
    "fout = open('output.csv','r')\n",
    "for line in fout:\n",
    " print line\n",
    "fout.close()"
    ],
    "language": "python",
    "metadata": {},
    "outputs": [
    {
    "output_type": "stream",
    "stream": "stdout",
    "text": [
    "file,200,404\r\n",
    "\n",
    "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140915,88775,10082\r\n",
    "\n",
    "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140916,99968,15044\r\n",
    "\n",
    "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140917,100206,14359\r\n",
    "\n",
    "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140918,96831,13989\r\n",
    "\n",
    "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140919,96998,13430\r\n",
    "\n",
    "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140920,96628,12062\r\n",
    "\n",
    "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140921,86208,5210\r\n",
    "\n",
    "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140922,76706,9033\r\n",
    "\n",
    "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140923,89607,11692\r\n",
    "\n",
    "/Users/astorer/Teaching/2015_programming/Data/weblogs/access.log-20140924,98655,13708\r\n",
    "\n"
    ]
    }
    ],
    "prompt_number": 33
    }
    ],
    "metadata": {}
    }
    ]
    }