Skip to content

Instantly share code, notes, and snippets.

@Smerity
Last active May 1, 2017 19:45
Show Gist options
  • Select an option

  • Save Smerity/2704d3d65aa191ff5f27 to your computer and use it in GitHub Desktop.

Select an option

Save Smerity/2704d3d65aa191ff5f27 to your computer and use it in GitHub Desktop.

Revisions

  1. Smerity revised this gist Jul 30, 2014. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -1,6 +1,6 @@
    ## Data Location

    The Common Crawl dataset lives on Amazon S3. As part of the [Amazon Public Datasets](http://aws.amazon.com/public-data-sets/) program, downloading them for use is free from any instance on Amazon EC2. The files are also available freely as HTTP downloads.
    The Common Crawl dataset lives on Amazon S3 as part of the [Amazon Public Datasets](http://aws.amazon.com/public-data-sets/) program. Downloading them is free from any instance on Amazon EC2, both via S3 and HTTP.

    As the Common Crawl Foundation has evolved over the years, so has the format and metadata that accompany the crawls themselves.

  2. Smerity revised this gist Jul 30, 2014. 1 changed file with 5 additions and 1 deletion.
    6 changes: 5 additions & 1 deletion gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -1,6 +1,6 @@
    ## Data Location

    The Common Crawl dataset lives on Amazon S3. As part of the Amazon Public Datasets program, downloading them for use is free from any instance on Amazon EC2. The files are also available freely as HTTP downloads.
    The Common Crawl dataset lives on Amazon S3. As part of the [Amazon Public Datasets](http://aws.amazon.com/public-data-sets/) program, downloading them for use is free from any instance on Amazon EC2. The files are also available freely as HTTP downloads.

    As the Common Crawl Foundation has evolved over the years, so has the format and metadata that accompany the crawls themselves.

    @@ -10,8 +10,12 @@ As the Common Crawl Foundation has evolved over the years, so has the format and
    + [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/
    + [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-48/
    + [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-10/
    + [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-15/

    For all crawls since 2013, the crawls are stored in the WARC file format and also contain metadata and text data extracts.
    Starting since the 2014-15 crawl, we also provide file path lists for the segments, WARC, WAT, and WET files.

    By replacing s3://aws-publicdatasets/ with https://aws-publicdatasets.s3.amazonaws.com/ on each line, you can obtain the HTTP path for any of the files stored on S3.

    ## Data Format

  3. Smerity revised this gist Jul 30, 2014. 1 changed file with 22 additions and 5 deletions.
    27 changes: 22 additions & 5 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -1,5 +1,22 @@
    ## Data Location

    The Common Crawl dataset lives on Amazon S3. As part of the Amazon Public Datasets program, downloading them for use is free from any instance on Amazon EC2. The files are also available freely as HTTP downloads.

    As the Common Crawl Foundation has evolved over the years, so has the format and metadata that accompany the crawls themselves.

    + [ARC] Archived Crawl #1 - s3://aws-publicdatasets/common-crawl/crawl-001/ - crawl data from 2008/2010
    + [ARC] Archived Crawl #2 - s3://aws-publicdatasets/common-crawl/crawl-002/ - crawl data from 2009/2010
    + [ARC] Archived Crawl #3 - s3://aws-publicdatasets/common-crawl/parse-output/ - crawl data from 2012
    + [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/
    + [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-48/
    + [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-10/

    For all crawls since 2013, the crawls are stored in the WARC file format and also contain metadata and text data extracts.

    ## Data Format

    Common Crawl currently stores the crawl data using the Web ARChive (WARC) format.
    Before that point, the crawl was stored in [the ARC file format](http://archive.org/web/researcher/ArcFileFormat.php).
    Before that point, the crawl was stored in the [ARC file format](http://archive.org/web/researcher/ArcFileFormat.php).
    The WARC format allows for more efficient storage and processing of Common Crawl's free multi-billion page web archives, which can be hundreds of terabytes in size.
    This document aims to give you an introduction to working with the new format, specifically the difference between:

    @@ -10,7 +27,7 @@ This document aims to give you an introduction to working with the new format, s
    If you want all the nitty gritty details, the best source is the ISO standard, for which the final draft is available.
    If you're more interested in diving into code, we've provided three introductory examples in Java that use the Hadoop framework to process WAT, WET and WARC.

    ## WARC Format
    ### WARC Format

    The WARC format is the raw data from the crawl, providing a direct mapping to the crawl process. Not only does the format store the HTTP response from the websites it contacts (WARC-Type: response), it also stores information about how that information was requested (WARC-Type: request) and metadata on the crawl process itself (WARC-Type: metadata).

    @@ -44,7 +61,7 @@ In the example below, we can see the crawler contacted http://102jamzorlando.cbs

    ...HTML Content...

    ## WAT Response Format
    ### WAT Response Format

    WAT files contain important metadata about the records stored in the WARC format above. This metadata is computed for each of the three types of records (metadata, request, and response). If the information crawled is HTML, the computed metadata includes the HTTP headers returned and the links (including the type of link) listed on the page.

    @@ -65,7 +82,7 @@ The HTTP response metadata is most likely to be of interest to Common Crawl user
    + Links
    + Container

    ## WET Response Format
    ### WET Response Format

    As many tasks only require textual information, the Common Crawl dataset provides WET files that only contain extracted plaintext. The way in which this textual data is stored in the WET format is quite simple. The WARC metadata contains various details, including the URL and the length of the plaintext data, with the plaintext data following immediately afterwards.

    @@ -82,7 +99,7 @@ As many tasks only require textual information, the Common Crawl dataset provide

    ...Text Content...

    ## Processing the file format
    ### Processing the file format

    We've provided [three introductory examples in Java](https://github.com/Smerity/cc-warc-examples) for the Hadoop framework. The code also contains [wrapper tools](https://github.com/Smerity/cc-warc-examples/tree/master/src/org/commoncrawl/warc) for making working with the Web Archive Commons library easier in Hadoop.
    These introductory examples include:
  4. Smerity revised this gist Jul 30, 2014. 1 changed file with 11 additions and 12 deletions.
    23 changes: 11 additions & 12 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -52,19 +52,18 @@ This information is stored as JSON. To keep the file sizes as small as possible,

    The HTTP response metadata is most likely to be of interest to Common Crawl users. The skeleton of the JSON format is outlined below.

    + Envelope
    + WARC-Header-Metadata
    + Payload-Metadata
    + HTTP-Response-Metadata
    + Headers
    + HTML-Metadata
    + Head
    + Title
    + Scripts
    + Metas
    + Links
    + WARC-Header-Metadata
    + Payload-Metadata
    + HTTP-Response-Metadata
    + Headers
    + HTML-Metadata
    + Head
    + Title
    + Scripts
    + Metas
    + Links
    + Container
    + Links
    + Container

    ## WET Response Format

  5. Smerity revised this gist Jul 30, 2014. 1 changed file with 4 additions and 4 deletions.
    8 changes: 4 additions & 4 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -59,10 +59,10 @@ The HTTP response metadata is most likely to be of interest to Common Crawl user
    + Headers
    + HTML-Metadata
    + Head
    + Title
    + Scripts
    + Metas
    + Links
    + Title
    + Scripts
    + Metas
    + Links
    + Links
    + Container

  6. Smerity revised this gist Jul 30, 2014. No changes.
  7. Smerity revised this gist Jul 30, 2014. 1 changed file with 13 additions and 33 deletions.
    46 changes: 13 additions & 33 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -52,39 +52,19 @@ This information is stored as JSON. To keep the file sizes as small as possible,

    The HTTP response metadata is most likely to be of interest to Common Crawl users. The skeleton of the JSON format is outlined below.

    <ul class="showIndent">
    <li>Envelope
    <ul>
    <li>WARC-Header-Metadata</li>
    <li>Payload-Metadata
    <ul>
    <li>HTTP-Response-Metadata
    <ul>
    <li>Headers
    <ul>
    <li>HTML-Metadata
    <ul>
    <li>Head
    <ul>
    <li>Title</li>
    <li>Scripts</li>
    <li>Metas</li>
    <li>Links</li>
    </ul>
    </li>
    <li>Links</li>
    </ul>
    </li>
    </ul>
    </li>
    </ul>
    </li>
    </ul>
    </li>
    <li>Container</li>
    </ul>
    </li>
    </ul>
    + Envelope
    + WARC-Header-Metadata
    + Payload-Metadata
    + HTTP-Response-Metadata
    + Headers
    + HTML-Metadata
    + Head
    + Title
    + Scripts
    + Metas
    + Links
    + Links
    + Container

    ## WET Response Format

  8. Smerity revised this gist Jul 30, 2014. 1 changed file with 83 additions and 3 deletions.
    86 changes: 83 additions & 3 deletions gistfile1.md
    Original file line number Diff line number Diff line change
    @@ -1,6 +1,6 @@
    Common Crawl currently stores the crawl data using the Web ARChive (WARC) format.
    Before that point, the crawl was stored in [the ARC file format](http://archive.org/web/researcher/ArcFileFormat.php).
    The WARC format allows for more efficient storage and processing of CommonCrawl's free multi-billion page web archives, which can be hundreds of terabytes in size.
    The WARC format allows for more efficient storage and processing of Common Crawl's free multi-billion page web archives, which can be hundreds of terabytes in size.
    This document aims to give you an introduction to working with the new format, specifically the difference between:

    + WARC files which store the raw crawl data
    @@ -10,7 +10,7 @@ This document aims to give you an introduction to working with the new format, s
    If you want all the nitty gritty details, the best source is the ISO standard, for which the final draft is available.
    If you're more interested in diving into code, we've provided three introductory examples in Java that use the Hadoop framework to process WAT, WET and WARC.

    ## The WARC Format
    ## WARC Format

    The WARC format is the raw data from the crawl, providing a direct mapping to the crawl process. Not only does the format store the HTTP response from the websites it contacts (WARC-Type: response), it also stores information about how that information was requested (WARC-Type: request) and metadata on the crawl process itself (WARC-Type: metadata).

    @@ -42,4 +42,84 @@ In the example below, we can see the crawler contacted http://102jamzorlando.cbs
    Connection: close


    ...HTML Content...
    ...HTML Content...

    ## WAT Response Format

    WAT files contain important metadata about the records stored in the WARC format above. This metadata is computed for each of the three types of records (metadata, request, and response). If the information crawled is HTML, the computed metadata includes the HTTP headers returned and the links (including the type of link) listed on the page.

    This information is stored as JSON. To keep the file sizes as small as possible, the JSON is stored with all unnecessary whitespace stripped, resulting in a relatively unreadable format for humans. If you want to inspect the JSON file yourself, use one of the many JSON pretty print tools available.

    The HTTP response metadata is most likely to be of interest to Common Crawl users. The skeleton of the JSON format is outlined below.

    <ul class="showIndent">
    <li>Envelope
    <ul>
    <li>WARC-Header-Metadata</li>
    <li>Payload-Metadata
    <ul>
    <li>HTTP-Response-Metadata
    <ul>
    <li>Headers
    <ul>
    <li>HTML-Metadata
    <ul>
    <li>Head
    <ul>
    <li>Title</li>
    <li>Scripts</li>
    <li>Metas</li>
    <li>Links</li>
    </ul>
    </li>
    <li>Links</li>
    </ul>
    </li>
    </ul>
    </li>
    </ul>
    </li>
    </ul>
    </li>
    <li>Container</li>
    </ul>
    </li>
    </ul>

    ## WET Response Format

    As many tasks only require textual information, the Common Crawl dataset provides WET files that only contain extracted plaintext. The way in which this textual data is stored in the WET format is quite simple. The WARC metadata contains various details, including the URL and the length of the plaintext data, with the plaintext data following immediately afterwards.

    WARC/1.0
    WARC-Type: conversion
    WARC-Target-URI: http://advocatehealth.com/condell/emergencyservices3
    WARC-Date: 2013-12-04T15:30:35Z
    WARC-Record-ID:
    WARC-Refers-To:
    WARC-Block-Digest: sha1:3SJBHMFPOCUJEHJ7OMGVCRSHQTWLJUUS
    Content-Type: text/plain
    Content-Length: 5765


    ...Text Content...

    ## Processing the file format

    We've provided [three introductory examples in Java](https://github.com/Smerity/cc-warc-examples) for the Hadoop framework. The code also contains [wrapper tools](https://github.com/Smerity/cc-warc-examples/tree/master/src/org/commoncrawl/warc) for making working with the Web Archive Commons library easier in Hadoop.
    These introductory examples include:

    + Count the number of times varioustags are used across HTML on the internet using the WARC files
    + Counting the number of different server types found in the HTTP headers using the WAT files
    + Word count over the extracted plaintext found in the WET files


    If you're using a different language, there are a number of open source libraries that handle processing these WARC files and the content they contain. These include:

    + Common Crawl's [Example WARC](https://github.com/commoncrawl/example-warc-java) (Java & Clojure)
    + [WARC-Mapreduce WET/WARC processor](https://github.com/vadali/warc-mapreduce) (Java & Clojure)
    + Kevin Bullaughey's [WARC & WAT tools](https://github.com/kbullaughey/warc-tools) (Go)
    + Hanzo Archive's [Warc Tools](http://code.hanzoarchives.com/warc-tools) (Python)
    + IIPC's [Web Archive Commons library](https://github.com/iipc/webarchive-commons) for processing WARC & WAT (Java)
    + Internet Archive’s [Hadoop tools](https://github.com/internetarchive/ia-hadoop-tools) for bridging WARC to Pig (Java)

    If in doubt, the tools provided as part of the IIPC's Web Archive Commons library are the preferred implementation.
  9. Smerity renamed this gist Jul 30, 2014. 1 changed file with 0 additions and 0 deletions.
    File renamed without changes.
  10. Smerity created this gist Jul 30, 2014.
    45 changes: 45 additions & 0 deletions about_the_data.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,45 @@
    Common Crawl currently stores the crawl data using the Web ARChive (WARC) format.
    Before that point, the crawl was stored in [the ARC file format](http://archive.org/web/researcher/ArcFileFormat.php).
    The WARC format allows for more efficient storage and processing of CommonCrawl's free multi-billion page web archives, which can be hundreds of terabytes in size.
    This document aims to give you an introduction to working with the new format, specifically the difference between:

    + WARC files which store the raw crawl data
    + WAT files which store computed metadata for the data stored in the WARC
    + WET files which store extracted plaintext from the data stored in the WARC

    If you want all the nitty gritty details, the best source is the ISO standard, for which the final draft is available.
    If you're more interested in diving into code, we've provided three introductory examples in Java that use the Hadoop framework to process WAT, WET and WARC.

    ## The WARC Format

    The WARC format is the raw data from the crawl, providing a direct mapping to the crawl process. Not only does the format store the HTTP response from the websites it contacts (WARC-Type: response), it also stores information about how that information was requested (WARC-Type: request) and metadata on the crawl process itself (WARC-Type: metadata).

    For the HTTP responses themselves, the raw response is stored. This not only includes the response itself, what you would get if you downloaded the file, but also the HTTP header information, which can be used to glean a number of interesting insights.
    In the example below, we can see the crawler contacted http://102jamzorlando.cbslocal.com/tag/nba/page/2/ and received a HTML page in response. We can also see the page was served from the nginx web server and that a special header has been added, X-hacker, purely for the purposes of advertising to a very specific audience of programmers who might look at the HTTP headers!

    WARC/1.0
    WARC-Type: response
    WARC-Date: 2013-12-04T16:47:32Z
    WARC-Record-ID:
    Content-Length: 73873
    Content-Type: application/http; msgtype=response
    WARC-Warcinfo-ID:
    WARC-Concurrent-To:
    WARC-IP-Address: 23.0.160.82
    WARC-Target-URI: http://102jamzorlando.cbslocal.com/tag/nba/page/2/
    WARC-Payload-Digest: sha1:FXV2BZKHT6SQ4RZWNMIMP7KMFUNZMZFB
    WARC-Block-Digest: sha1:GMYFZYSACNBEGHVP3YFQNOSTV5LPXNAU

    HTTP/1.0 200 OK
    Server: nginx
    Content-Type: text/html; charset=UTF-8
    Vary: Accept-Encoding
    Vary: Cookie
    X-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
    Content-Encoding: gzip
    Date: Wed, 04 Dec 2013 16:47:32 GMT
    Content-Length: 18953
    Connection: close


    ...HTML Content...