Introduction

For as long as I can remember, mid march to early April has been a time of intrigue. This is when “March Madness” takes place, an annual event where 68 teams fly across the country to participate in a single-elimination basketball tournament to crown a national champion.

The reason it is so exciting is because it has an element of randomness that thousands of people try to predict every year, failing each and every time. The best teams are matched up with the worst teams to begin, which either results in a dominating performance from the better team, or an emotionally charged upset from the lower team. It’s popular for people to fill out brackets to predict what will happen, many in contests with prizes and pride on the line, which makes it even more of an emotional investment. Most other sports playoffs have less teams, less upsets, and more predictable winners. Many argue that this tournament is not the best way to determine a champion, but it certainly is the most fun.

I have been curious about the characteristics of teams that are selected by the tournament committee as well as the teams that do well once they get to the tournament. I want to know- are teams with high assist rates (meaning they pass the ball well) more successful? How important is it for tourney success to play a hard schedule during the regular season? Is 3 point shooting percentage correlated with more wins and Tournament appearances?

I will be answering these questions in a later post; I first needed to find, scrape, and clean my data so that these questions could be answered.

Data Collection

I explored a few different options to gather my data to answer my questions. ESPN has interesting data, but it didn’t have a collection that would let me get all the stats for all the teams from one place. I then tried an api website called sportsradar, but truthfully, it was confusing to me becuase I could not find a table that showed what the codes were for specific teams. I then went to a website I had used for fun a few times, basketball-reference. This site is awesome becuase they cover all different sports and have a wide variety of statistics and accurate coverage for almost all teams.

Once I landed on this site, I decided to use the 2020-2021 basketball season (one of my favorites, since I thought there were great teams and I went to the tournament in person). I decided to use the “basic stats,” provided in a (roughly) 355x40 table. I thought about using the advanced stats; I oped for basic stats this time since analysis isn’t my main goal in this project, but I may go back to the advanced stats later.

I was able to right click and select “inspect” so that I could figure out the div and id tags that I would need to use in my web scraping script to get the data I wanted. This is what it looked like:

Figure

Using what I learned in class, previous projects, and a bit of help from chat GPT, I was able to scrape the data as follows this:

from bs4 import BeautifulSoup
import requests
import pandas as pd

Here, I simply installed required packages. Beautiful soup and requests allows me to scrape and fetch the data from the webpage, and pandas helps format it in a data frame once I’ve got it.

#combine a lot of steps: get the beautiful soup function called on the webpage
soup = BeautifulSoup(
    requests.get(
        "https://www.sports-reference.com/cbb/seasons/men/2021-school-stats.html", 
        headers={"User-Agent": "Chrome"}).content,
    'html.parser')

Here, we combine a couple of steps to get our data from sports-reference.com. ‘requests.get()’ sends an HTTP get request to the url, and the ‘BeautifulSoup’ function takes in first the site to parse, and second, the parser. In this case, it is parsing content from requests.get(), and uses an ‘html.parser’ to do it.

# Find the table element by its class name
table = soup.find('table', {'class': 'per_match_toggle sortable stats_table', 'id': 'basic_school_stats'})

This is where we actually get the table- we know from inspecting the page earlier that the table starts in the “per_match_toggle sortable stats_table” class, with the ID being “basic_school_stats.” Our ‘table’ object now has all the data we want from the table, but in an unreadable format. We have to get our headers adn split up the rows.

# Extract the table headers
headers = []
for th in table.find_all('th'):
    headers.append(th.text.strip())

    #Since the headers occur every 20 teams, and have, there's like 1000. 
headers=headers[13:50]
subheaders=headers[0:12]

Each header is signified by ‘th’, so we are doing a simple loop to put the headers in a list. The table had subheaders and it showed the headers every 20 teams, so we selected the headers just the first time they appeared.

headers=headers[13:50]

Then, to loop through and put each row in our list:

# Extract the table rows
rows = []
for tr in table.find_all('tr')[1:]:
    row = []
    for td in tr.find_all('td'):
        row.append(td.text.strip())
    rows.append(row)

Data Cleaning

Lastly, we have to combine the rows and headers and format the table how we want it:

# Create a pandas dataframe from the headers and rows
df = pd.DataFrame(rows, columns=headers,)

# Print the dataframe, get rid of NAs and Make a new column for rank/team in alphabetical #order
df.dropna(axis = 0, how = 'all', inplace = True)
df.insert(0, 'Rank', range(1, 1+len(df)))

# Drop the index column
df = df.reset_index(drop=True)
df

This is the result:

Figure

Then, we want to add a column for whether a team made the tournament. The ‘school’ column has ‘NCAA’ after the team name if they made the tourney, so we had to make a function to make a new column with a “Yes” or “No” based on if a team made the tournament..

# define lambda function to check if "NCAA" is present in the string
is_ncaa = lambda x: "Yes" if "NCAA" in x else "No"

# apply the lambda function to each row in the "School" column and create a new "Tournament" column
df['Tournament'] = df['School'].apply(is_ncaa)

# insert the "Tournament" column at position 2 (index 1)
df.insert(1, 'Tournament', df.pop('Tournament'))

# print the dataframe
df

This is our final result:

Figure

We are now ready to start analyzing/graphing doing exploratory data analysis on the teams. I will definitely will make plots for assist and turnover rates, based on if they made the tournament. I still need to think about how to answer the rest of my questions, but now I have the data to start!

At first I thought it would be difficult to scrape data for myself. It took lots of trial and error, but now I feel confident and could do it again for a different project. I invite you to get after the projects that you’ve had in the back of your mind - they might be more possible than you think!