# A primer for sports stats scraping

Though I'm using an example for a random Green Bay Pheonix game from the school's website here, the principles generally hold true across the board. Some sites are easier to scrape than others, but most require at least minor parsing to get the exact format you want in the end.

*Anyway*, this should serve as a decent barebones example to show just how easy it can be to scrape this information into a relatively usable format with little effort. [kenpompy](https://github.com/j-andrews7/kenpompy) uses very similar code and logic, though his site tends to require heavy manual parsing as well.

First, we just need to load up our packages. I use `mechanicalsoup` here, but you could do the same with `requests` easily enough. `mechanicalsoup` is useful if you have to login or otherwise interact with a page through a form though.

In [4]:
import mechanicalsoup
import pandas as pd
from bs4 import BeautifulSoup

Then we snag the actual HTML for the [page containing the boxscore](https://greenbayphoenix.com/sports/womens-basketball/stats/2019-20/mizzou/boxscore/4186). If a page dynamically generates/populates data via javascript, this entire process becomes moot, as you'll need something like [selenium](https://selenium-python.readthedocs.io/) to actually load the page.

In [9]:
# This just initializes a browser.
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://greenbayphoenix.com/sports/womens-basketball/stats/2019-20/mizzou/boxscore/4186")

# This actually scrapes the page contents into a variable, HTML tags, javascript, and all. 
content = browser.get_current_page()

<Response [200]>

Now we grab all tables from the page, storing them in a list.

In [12]:
tables = content.find_all('table')

So this site has 7 tables:
 - Scoring by quarter
 - Away team box score
 - Away team summary stats
 - Away team miscellaneous stats
 - Home team box score
 - Home team summary stats
 - Home team miscellaneous stats
 
Kudos to the Green Bay website designers, their site has a better layout than most major programs. 

First, we'll grab the scoring by quarter table.

In [23]:
# Parse the first table into a pandas dataframe.
quarters_df = pd.read_html(str(tables[0]))

# For some reason it always nests the result in a list.
quarters_df = quarters_df[0]
quarters_df

Unnamed: 0,Team,1,2,3,4,Total F,Records
0,Mizzou Missouri,9,11,14,30,64,"1-3,0-0 SEC"
1,GB Green Bay,16,15,21,20,72,"2-1,0-0 Horizon"


Not bad, pretty usable right off the bat depending on what you want. Some folks might want to parse out conferences and conference records from the `Records` column, or split the `Team` column so that the team abbreviations are in their own column. But overall, pretty decent.

Now we'll try the actual box score.

In [25]:
# Get the next element of our tables list.
mizzou_box = pd.read_html(str(tables[1]))
mizzou_box = mizzou_box[0]
mizzou_box

Unnamed: 0,##,Player,GS,MIN,FG,3PT,FT,ORB-DRB,REB,PF,A,TO,BLK,STL,PTS
0,13,"13 Schuchts,Hannah",*,30,4-5,3-4,2-2,2-6,8,3,1,2,2,0,13
1,23,"23 Smith,Amber",*,29,5-12,2-5,0-0,1-7,8,5,2,3,0,0,12
2,22,"22 Roundtree,Jordan",*,35,3-7,1-4,2-2,0-4,4,2,0,1,0,1,9
3,24,"24 Chavis,Jordan",*,18,0-4,0-3,0-0,0-1,1,3,3,0,0,1,0
4,12,"12 Brown,Elle",*,6,0-1,0-0,0-0,0-0,0,1,1,1,0,0,0
5,43,"43 Frank,Hayley",,26,4-11,1-4,3-3,3-1,4,5,2,3,0,3,12
6,33,"33 Blackwell,Aijha",,16,2-8,0-2,4-8,1-2,3,5,2,4,0,0,8
7,11,"11 Troup,Haley",,23,1-7,1-4,3-4,0-4,4,1,1,2,0,2,6
8,10,"10 Green,Nadia",,14,1-3,0-0,2-2,3-0,3,3,1,1,0,1,4
9,45,"45 Garner,Brittany",,3,0-0,0-0,0-0,0-0,0,2,0,0,0,0,0


Again, not bad. Would probably definitely tweak some things here, like dropping player numbers from the `Player` column, splitting the `ORB-DRB` column and creating a total rebounds column, splitting up made field goals and attempt, etc. 

But ya'll can figure that out on your own. `kenpompy` does plenty of that, if you need inspiration. 

On to the summary stats.

In [26]:
mizzou_sum = pd.read_html(str(tables[2]))
mizzou_sum = mizzou_sum[0]
mizzou_sum

Unnamed: 0,Team Summary,FG,3PT,FT
0,1st Quarter,3-16,2-8,1-2
1,1st Quarter,18.75 %,25.00 %,50.00 %
2,2nd Quarter,3-10,1-4,4-5
3,2nd Quarter,30.00 %,25.00 %,80.00 %
4,3rd Quarter,5-14,1-6,3-6
5,3rd Quarter,35.71 %,16.67 %,50.00 %
6,4th Quarter,9-18,4-8,8-8
7,4th Quarter,50.00 %,50.00 %,100.00 %
8,Total,20-58,8-26,16-21
9,,34.5 %,30.8 %,76.2 %


Alright, this one definitely needs a bit of work. Not too much, moving the percentage rows to their own columns would probably be enough to make plotting doable and clean up the row redundancy.

Lastly, the miscellaneous table (timeouts, tech fouls, lead changes, etc).

In [27]:
mizzou_misc = pd.read_html(str(tables[3]))
mizzou_misc = mizzou_misc[0]
mizzou_misc

Unnamed: 0,0,1,2
0,Technical Fouls: none,Second Chance Points: 10,Scores Tied: 0 time(s)
1,Points in the Paint: 20,Fast Break Points: 15,Lead Changed: 0 time(s)
2,Points off Turnovers: 10,Bench Points: 30,


Yikes, this one is pretty useless as is. I'd be more inclined to create a class out of this than actually parse it into a usable dataframe format. You can however, parse individual elements easily enough. For instance, if I just wanted to grab the points in the paint:

In [29]:
# pandas indexing is row by column.
mizzou_misc.iloc[1,0]

'Points in the Paint: 20'

And getting just the number is just another line

In [31]:
paint_pts = mizzou_misc.iloc[1,0]
paint_pts = paint_pts.split(": ")[1]
paint_pts

'20'

That's all I've got. The last 3 tables in the `tables` list contain the same information for Green Bay.

If you have questions, hit me up on [twitter](https://twitter.com/JMA_Data).