Skip to content

Instantly share code, notes, and snippets.

@cosme12
Forked from bradmontgomery/ShortIntroToScraping.rst
Last active January 5, 2018 03:31
Show Gist options
  • Select an option

  • Save cosme12/83d23a90da09f04dbd4fca99d4ef284d to your computer and use it in GitHub Desktop.

Select an option

Save cosme12/83d23a90da09f04dbd4fca99d4ef284d to your computer and use it in GitHub Desktop.

Revisions

  1. cosme12 revised this gist Jan 5, 2018. 1 changed file with 3 additions and 1 deletion.
    4 changes: 3 additions & 1 deletion ShortIntroToScraping.rst
    Original file line number Diff line number Diff line change
    @@ -18,8 +18,10 @@ Lets grab the Free Book Samplers from O'Reilly: `http://oreilly.com/store/sample
    ::

    >>> import requests
    #ADDED
    >>> headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
    >>>
    >>> result = requests.get("http://oreilly.com/store/samplers.html")
    >>> result = requests.get("http://oreilly.com/store/samplers.html", headers=headers)

    Make sure we got a result.

  2. @bradmontgomery bradmontgomery revised this gist Feb 21, 2012. 1 changed file with 6 additions and 1 deletion.
    7 changes: 6 additions & 1 deletion ShortIntroToScraping.rst
    Original file line number Diff line number Diff line change
    @@ -16,13 +16,15 @@ Start Scraping!
    Lets grab the Free Book Samplers from O'Reilly: `http://oreilly.com/store/samplers.html <http://oreilly.com/store/samplers.html>`_.

    ::

    >>> import requests
    >>>
    >>> result = requests.get("http://oreilly.com/store/samplers.html")

    Make sure we got a result.

    ::

    >>> result.status_code
    200
    >>> result.headers
    @@ -31,11 +33,13 @@ Make sure we got a result.
    Store your content in an easy-to-type variable!

    ::

    >>> c = result.content

    Start parsing with Beautiful Soup. NOTE: If you installed with pip, you'll need to import from ``bs4``. If you download the source, you'll need to import from ``BeautifulSoup`` (which is what they do in the `online docs <http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Quick%20Start>`_).

    ::

    >>> from bs4 import BeautifulSoup
    >>> soup = BeautifulSoup(c)
    >>> samples = soup.find_all("a", "item-title")
    @@ -47,6 +51,7 @@ Start parsing with Beautiful Soup. NOTE: If you installed with pip, you'll need
    Now, pick apart individual links.

    ::

    >>> data = {}
    >>> for a in samples:
    ... title = a.string.strip()
    @@ -55,4 +60,4 @@ Now, pick apart individual links.

    Check out the keys/values in the ``data`` dict. Rejoice!

    Now go scrape some stuff!
  3. @bradmontgomery bradmontgomery created this gist Feb 21, 2012.
    58 changes: 58 additions & 0 deletions ShortIntroToScraping.rst
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,58 @@
    Web Scraping Workshop
    =====================

    Using `Requests <http://python-requests.org>`_ and `Beautiful Soup <http://www.crummy.com/software/BeautifulSoup/>`_, with the most recent `Beautiful Soup 4 docs <http://www.crummy.com/software/BeautifulSoup/bs4/doc/>`_.

    Getting Started
    ---------------
    Install our tools (preferably in a new virtualenv)::

    pip install beautifulsoup4
    pip install requests

    Start Scraping!
    ---------------

    Lets grab the Free Book Samplers from O'Reilly: `http://oreilly.com/store/samplers.html <http://oreilly.com/store/samplers.html>`_.

    ::
    >>> import requests
    >>>
    >>> result = requests.get("http://oreilly.com/store/samplers.html")

    Make sure we got a result.

    ::
    >>> result.status_code
    200
    >>> result.headers
    ...

    Store your content in an easy-to-type variable!

    ::
    >>> c = result.content

    Start parsing with Beautiful Soup. NOTE: If you installed with pip, you'll need to import from ``bs4``. If you download the source, you'll need to import from ``BeautifulSoup`` (which is what they do in the `online docs <http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Quick%20Start>`_).

    ::
    >>> from bs4 import BeautifulSoup
    >>> soup = BeautifulSoup(c)
    >>> samples = soup.find_all("a", "item-title")
    >>> samples[0]
    <a class="item-title" href="http://cdn.oreilly.com/oreilly/booksamplers/9780596004927_sampler.pdf">
    Programming Perl
    </a>

    Now, pick apart individual links.

    ::
    >>> data = {}
    >>> for a in samples:
    ... title = a.string.strip()
    ... data[title] = a.attrs['href']


    Check out the keys/values in the ``data`` dict. Rejoice!