{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# A primer for sports stats scraping\n", "\n", "Though I'm using an example for a random Green Bay Pheonix game from the school's website here, the principles generally hold true across the board. Some sites are easier to scrape than others, but most require at least minor parsing to get the exact format you want in the end.\n", "\n", "*Anyway*, this should serve as a decent barebones example to show just how easy it can be to scrape this information into a relatively usable format with little effort. [kenpompy](https://github.com/j-andrews7/kenpompy) uses very similar code and logic, though his site tends to require heavy manual parsing as well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we just need to load up our packages. I use `mechanicalsoup` here, but you could do the same with `requests` easily enough. `mechanicalsoup` is useful if you have to login or otherwise interact with a page through a form though." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import mechanicalsoup\n", "import pandas as pd\n", "from bs4 import BeautifulSoup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we snag the actual HTML for the [page containing the boxscore](https://greenbayphoenix.com/sports/womens-basketball/stats/2019-20/mizzou/boxscore/4186). If a page dynamically generates/populates data via javascript, this entire process becomes moot, as you'll need something like [selenium](https://selenium-python.readthedocs.io/) to actually load the page." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This just initializes a browser.\n", "browser = mechanicalsoup.StatefulBrowser()\n", "browser.open(\"https://greenbayphoenix.com/sports/womens-basketball/stats/2019-20/mizzou/boxscore/4186\")\n", "\n", "# This actually scrapes the page contents into a variable, HTML tags, javascript, and all. \n", "content = browser.get_current_page()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we grab all tables from the page, storing them in a list." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "tables = content.find_all('table')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So this site has 7 tables:\n", " - Scoring by quarter\n", " - Away team box score\n", " - Away team summary stats\n", " - Away team miscellaneous stats\n", " - Home team box score\n", " - Home team summary stats\n", " - Home team miscellaneous stats\n", " \n", "Kudos to the Green Bay website designers, their site has a better layout than most major programs. \n", "\n", "First, we'll grab the scoring by quarter table." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Team1234Total FRecords
0Mizzou Missouri9111430641-3,0-0 SEC
1GB Green Bay16152120722-1,0-0 Horizon
\n", "
" ], "text/plain": [ " Team 1 2 3 4 Total F Records\n", "0 Mizzou Missouri 9 11 14 30 64 1-3,0-0 SEC\n", "1 GB Green Bay 16 15 21 20 72 2-1,0-0 Horizon" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Parse the first table into a pandas dataframe.\n", "quarters_df = pd.read_html(str(tables[0]))\n", "\n", "# For some reason it always nests the result in a list.\n", "quarters_df = quarters_df[0]\n", "quarters_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not bad, pretty usable right off the bat depending on what you want. Some folks might want to parse out conferences and conference records from the `Records` column, or split the `Team` column so that the team abbreviations are in their own column. But overall, pretty decent.\n", "\n", "Now we'll try the actual box score." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
##PlayerGSMINFG3PTFTORB-DRBREBPFATOBLKSTLPTS
01313 Schuchts,Hannah*304-53-42-22-683122013
12323 Smith,Amber*295-122-50-01-785230012
22222 Roundtree,Jordan*353-71-42-20-44201019
32424 Chavis,Jordan*180-40-30-00-11330010
41212 Brown,Elle*60-10-00-00-00111000
54343 Frank,HayleyNaN264-111-43-33-145230312
63333 Blackwell,AijhaNaN162-80-24-81-23524008
71111 Troup,HaleyNaN231-71-43-40-44112026
81010 Green,NadiaNaN141-30-02-23-03311014
94545 Garner,BrittanyNaN30-00-00-00-00200000
10TMTM TEAMNaN00-00-00-02-35000000
11NaNTotals-20020-588-2616-2112-28403013172864
\n", "
" ], "text/plain": [ " ## Player GS MIN FG 3PT FT ORB-DRB REB PF \\\n", "0 13 13 Schuchts,Hannah * 30 4-5 3-4 2-2 2-6 8 3 \n", "1 23 23 Smith,Amber * 29 5-12 2-5 0-0 1-7 8 5 \n", "2 22 22 Roundtree,Jordan * 35 3-7 1-4 2-2 0-4 4 2 \n", "3 24 24 Chavis,Jordan * 18 0-4 0-3 0-0 0-1 1 3 \n", "4 12 12 Brown,Elle * 6 0-1 0-0 0-0 0-0 0 1 \n", "5 43 43 Frank,Hayley NaN 26 4-11 1-4 3-3 3-1 4 5 \n", "6 33 33 Blackwell,Aijha NaN 16 2-8 0-2 4-8 1-2 3 5 \n", "7 11 11 Troup,Haley NaN 23 1-7 1-4 3-4 0-4 4 1 \n", "8 10 10 Green,Nadia NaN 14 1-3 0-0 2-2 3-0 3 3 \n", "9 45 45 Garner,Brittany NaN 3 0-0 0-0 0-0 0-0 0 2 \n", "10 TM TM TEAM NaN 0 0-0 0-0 0-0 2-3 5 0 \n", "11 NaN Totals - 200 20-58 8-26 16-21 12-28 40 30 \n", "\n", " A TO BLK STL PTS \n", "0 1 2 2 0 13 \n", "1 2 3 0 0 12 \n", "2 0 1 0 1 9 \n", "3 3 0 0 1 0 \n", "4 1 1 0 0 0 \n", "5 2 3 0 3 12 \n", "6 2 4 0 0 8 \n", "7 1 2 0 2 6 \n", "8 1 1 0 1 4 \n", "9 0 0 0 0 0 \n", "10 0 0 0 0 0 \n", "11 13 17 2 8 64 " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get the next element of our tables list.\n", "mizzou_box = pd.read_html(str(tables[1]))\n", "mizzou_box = mizzou_box[0]\n", "mizzou_box" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, not bad. Would probably definitely tweak some things here, like dropping player numbers from the `Player` column, splitting the `ORB-DRB` column and creating a total rebounds column, splitting up made field goals and attempt, etc. \n", "\n", "But ya'll can figure that out on your own. `kenpompy` does plenty of that, if you need inspiration. \n", "\n", "On to the summary stats." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Team SummaryFG3PTFT
01st Quarter3-162-81-2
11st Quarter18.75 %25.00 %50.00 %
22nd Quarter3-101-44-5
32nd Quarter30.00 %25.00 %80.00 %
43rd Quarter5-141-63-6
53rd Quarter35.71 %16.67 %50.00 %
64th Quarter9-184-88-8
74th Quarter50.00 %50.00 %100.00 %
8Total20-588-2616-21
9NaN34.5 %30.8 %76.2 %
\n", "
" ], "text/plain": [ " Team Summary FG 3PT FT\n", "0 1st Quarter 3-16 2-8 1-2\n", "1 1st Quarter 18.75 % 25.00 % 50.00 %\n", "2 2nd Quarter 3-10 1-4 4-5\n", "3 2nd Quarter 30.00 % 25.00 % 80.00 %\n", "4 3rd Quarter 5-14 1-6 3-6\n", "5 3rd Quarter 35.71 % 16.67 % 50.00 %\n", "6 4th Quarter 9-18 4-8 8-8\n", "7 4th Quarter 50.00 % 50.00 % 100.00 %\n", "8 Total 20-58 8-26 16-21\n", "9 NaN 34.5 % 30.8 % 76.2 %" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mizzou_sum = pd.read_html(str(tables[2]))\n", "mizzou_sum = mizzou_sum[0]\n", "mizzou_sum" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alright, this one definitely needs a bit of work. Not too much, moving the percentage rows to their own columns would probably be enough to make plotting doable and clean up the row redundancy.\n", "\n", "Lastly, the miscellaneous table (timeouts, tech fouls, lead changes, etc)." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0Technical Fouls: noneSecond Chance Points: 10Scores Tied: 0 time(s)
1Points in the Paint: 20Fast Break Points: 15Lead Changed: 0 time(s)
2Points off Turnovers: 10Bench Points: 30NaN
\n", "
" ], "text/plain": [ " 0 1 2\n", "0 Technical Fouls: none Second Chance Points: 10 Scores Tied: 0 time(s)\n", "1 Points in the Paint: 20 Fast Break Points: 15 Lead Changed: 0 time(s)\n", "2 Points off Turnovers: 10 Bench Points: 30 NaN" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mizzou_misc = pd.read_html(str(tables[3]))\n", "mizzou_misc = mizzou_misc[0]\n", "mizzou_misc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yikes, this one is pretty useless as is. I'd be more inclined to create a class out of this than actually parse it into a usable dataframe format. You can however, parse individual elements easily enough. For instance, if I just wanted to grab the points in the paint:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Points in the Paint: 20'" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# pandas indexing is row by column.\n", "mizzou_misc.iloc[1,0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And getting just the number is just another line" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'20'" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "paint_pts = mizzou_misc.iloc[1,0]\n", "paint_pts = paint_pts.split(\": \")[1]\n", "paint_pts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's all I've got. The last 3 tables in the `tables` list contain the same information for Green Bay.\n", "\n", "If you have questions, hit me up on [twitter](https://twitter.com/JMA_Data)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 4 }