Forked from bradmontgomery/ShortIntroToScraping.rst
Last active
January 5, 2018 03:31
-
-
Save cosme12/83d23a90da09f04dbd4fca99d4ef284d to your computer and use it in GitHub Desktop.
Revisions
-
cosme12 revised this gist
Jan 5, 2018 . 1 changed file with 3 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -18,8 +18,10 @@ Lets grab the Free Book Samplers from O'Reilly: `http://oreilly.com/store/sample :: >>> import requests #ADDED >>> headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'} >>> >>> result = requests.get("http://oreilly.com/store/samplers.html", headers=headers) Make sure we got a result. -
bradmontgomery revised this gist
Feb 21, 2012 . 1 changed file with 6 additions and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -16,13 +16,15 @@ Start Scraping! Lets grab the Free Book Samplers from O'Reilly: `http://oreilly.com/store/samplers.html <http://oreilly.com/store/samplers.html>`_. :: >>> import requests >>> >>> result = requests.get("http://oreilly.com/store/samplers.html") Make sure we got a result. :: >>> result.status_code 200 >>> result.headers @@ -31,11 +33,13 @@ Make sure we got a result. Store your content in an easy-to-type variable! :: >>> c = result.content Start parsing with Beautiful Soup. NOTE: If you installed with pip, you'll need to import from ``bs4``. If you download the source, you'll need to import from ``BeautifulSoup`` (which is what they do in the `online docs <http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Quick%20Start>`_). :: >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(c) >>> samples = soup.find_all("a", "item-title") @@ -47,6 +51,7 @@ Start parsing with Beautiful Soup. NOTE: If you installed with pip, you'll need Now, pick apart individual links. :: >>> data = {} >>> for a in samples: ... title = a.string.strip() @@ -55,4 +60,4 @@ Now, pick apart individual links. Check out the keys/values in the ``data`` dict. Rejoice! Now go scrape some stuff! -
bradmontgomery created this gist
Feb 21, 2012 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,58 @@ Web Scraping Workshop ===================== Using `Requests <http://python-requests.org>`_ and `Beautiful Soup <http://www.crummy.com/software/BeautifulSoup/>`_, with the most recent `Beautiful Soup 4 docs <http://www.crummy.com/software/BeautifulSoup/bs4/doc/>`_. Getting Started --------------- Install our tools (preferably in a new virtualenv):: pip install beautifulsoup4 pip install requests Start Scraping! --------------- Lets grab the Free Book Samplers from O'Reilly: `http://oreilly.com/store/samplers.html <http://oreilly.com/store/samplers.html>`_. :: >>> import requests >>> >>> result = requests.get("http://oreilly.com/store/samplers.html") Make sure we got a result. :: >>> result.status_code 200 >>> result.headers ... Store your content in an easy-to-type variable! :: >>> c = result.content Start parsing with Beautiful Soup. NOTE: If you installed with pip, you'll need to import from ``bs4``. If you download the source, you'll need to import from ``BeautifulSoup`` (which is what they do in the `online docs <http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Quick%20Start>`_). :: >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(c) >>> samples = soup.find_all("a", "item-title") >>> samples[0] <a class="item-title" href="http://cdn.oreilly.com/oreilly/booksamplers/9780596004927_sampler.pdf"> Programming Perl </a> Now, pick apart individual links. :: >>> data = {} >>> for a in samples: ... title = a.string.strip() ... data[title] = a.attrs['href'] Check out the keys/values in the ``data`` dict. Rejoice!