Introduction

Welcome back to Moneyball Moments. If you’ve been following along on the page, you’re ready to dive into some data visualization for NCAA basketball. If this is your first time, well, strap in!

Figure

Intent on learning the characteristics of teams who qualify for the 68-team “March Madness” tournament at the end of the year and which teams find in-tournament success, I scraped and cleaned data in sports-Reference.com from the 2021 College Basketball season.

Figure

With the data, I created a plethora of exploratory graphs to help determine what stats made a good tournament team. This blog is split up into sections based on the stats we are analyzing.

Getting To the Tournament

Assists

One of my initial curiosities was the relationship between assists and making the tournament. I decided that using an assist/game metric would be better than total season assists, so I plotted average ast/game for tournament vs non-tourney teams.

Figure

Tournament teams seem to have more assists per game which makes sense (better teams make the tourney, more assists make a team better) but the difference is only about 1 Assist, which isn’t a huge difference. I would imagine at this point that there are better predictive metrics out there.

Strength of Schedule and Win-Loss Record

Next, I decided to see if tournament teams could be easily identified by Win-Loss record (W-L%), strength of schedule (SOS), or both. I plotted W-L& record vs SOS and sorted the colors by whether a school was Tournament team.

Figure

The first thing that jumps out to me is that the 3 of the final four teams Baylor (champ), Gonzaga (runner-up), and Houston were all clustered together in the top right. This is preliminary evidence that on top of winning lots of games, it’s important to have a good strength of schedule. The three teams below them; Colgate, Winthrop, and Belmont, didn’t play strong teams and didn’t have any tournament wins. It’s important to note that not all teams have the same ability to have a higher SOS - the majority of teams a given school will play come from their conference, and as a general rule, teams in the same conference are at comparable strengths.

Side note- Gonzaga is comfortably at the top of the graph, so supposedly the best team, but they didn’t win the championship.

That leads into the next interesting observation - The top 5 hardest schedules (the ones along the top) were played by schools from only the Big 10 conference. Looking into the data, Michigan State only won 54% of their games, but since they played in the Big 10 (consequently facing a tough schedule) they received an invitation to the tournament (they end up losing in the first round).

This begs the question: is it better for middle-of-the-pack to decent team to be in a hard conference, where they don’t get a good record but face good teams, or be a team like Winthrop that wins all their games but plays poor-performing teams? It’s hard to say, but I believe that if a team is good enough, they will win regardless of their conference. Evidence of this is that right now in the 2023 tournament, we are down to 4 teams, and only half come from “power conferences.” This means that the two other teams, San Diego State and Florida Atlantic are in conferences where their strength of schedule isn’t very high, but they are still finding success! What’s funny too is that the power conference teams, UConn and Miami, have combined for 15 losses on the season. Essentially, anyone can be good!

I like the team labels on this graph, but was curious about the other schools that weren’t labeled, so I made this interactive plot:

I found the neatest way to use this was to hover over the red dots in the sea of blue to see which teams seemed to be worthy of the tournament based on these metrics but didn’t make it. (Can you find Louisville, Arizona, and Memphis?)

The blue teams that had poor records or SOS or both probably won their conference championships, which got them an automatic bid to the tournament.

Let me know in the comments which Strength of Schedule graph you prefer! Was it the one with the built in labels, or the interactive plot?

SRS (performance rating)

This last graph was insightful, but it failed to tell the stories of how teams won certain games. They could have won by one point in overtime every game or beat each team by 40, but the quality of their wins wouldn’t be well reflected in SOS v Winning %. So, we turned to the SRS metric, which is a rating that takes into account average point differential in games and strength of schedule. It’s denominated in points above/below areage, where zero is average. This, mapped against winning %, should give us an even clearer picture of who the good teams were.

As we hoped, this interactive plot does a pretty good job at showing which teams were truly the “best.” I mainly say this because 3 dots in the upper right-hand corner were the 3 of the top 4 teams in the tournament. As I mentioned in my last post , many don’t like the tournament because they don’t think it does a good job at determining the best teams because of all the “madness” that results in upsets, but this year, the best teams across the whole season (regular season and tournament), according to these metrics, are the teams that won! Interesting note, Gonzaga was actually considered one of the best teams of all time (according to analytics and commentary from basketball pundits) before they lost to Baylor in the championship game.

Figure

This is the same plot as above, just in a non-interactive way that shows the top 6 winning % teams and SRS teams (Gonzaga, Baylor, and Houston were top 6 in both categories). I like having both plots because seeing the top teams on the stationary graph all at the same time helps see their relationship quickly, whereas the interactive graph requires more time to find a pattern and it can be hard to know which dots to hover over to figure out the story of the graph. I realize the labels can get a little messy which is why I only included a few. Let me know in the comments which plot you like better!

For my fellow Cougar fans- this is how BYU stacked up:

Figure

Pretty good! They deserved the 6 seed they got in the tournament, it was a bummer that they had to play lower ranked 11 seed UCLA in the first round (who went on to the final four).

In-Tournament Performance

3 Point Shooting %

Now that I have looked into characteristics of teams that make the tournament, I want to know what teams who actually won games in the postseason look like.

3 point shooting percentage is a great stat to look at, because three points are worth more than two, so they have the ability to impact the game in a meaningful way. If there is an upset in the tournament, it’s often because the perhaps smaller or less athletic team hits a lot of threes. Whether a shot goes in is binary (it goes in for three points or misses for 0) and for any team or player, there is a probability associated with whether it goes in. Some games, a team could get great shots that just don’t fall, and shoot like 20% from three. Or maybe, a team gets hot and goes 12/20, shooting 60% and winning the game. The point is, it’s random from game to game, but over a season, we can tell who the best three points shooting teams are.

This graph shows a breakdown of teams who made the tournament and how far they advanced, with their 3 point shooting percentage over the season.

This is not surprising at all. Teams that made it farther in the tourney shot better than the group of teams that made it to the last round (the champion, Baylor, is included in each of these columns, Gonzaga, the runner-up, is included in all columns except the last one, etc.) Baylor shot over 40% on the season which is quite impressive. Based on these results, I would say that 3 point shooting is probably a good indicator of success in the tournament

Field Goal Shooting %

I tried something similar, but this time for field goal percentage.

We see pretty similar results, except for the runner-up bar is higher than the champ bar. That means that Gonzaga had a high fg percentage. This is probably thanks to one of their best players, Drew Timme, who shoots (and scores) from close up a lot, and hardly shot threes. Again- if you make shots at a higher rate than other teams, you will be successful in the tournament.

Turnovers

The last metric I wanted to explore was Turnovers. Do teams with less turnovers have more success in the tournament?

Based off what we see here, it doesn’t seem that turnovers were a huge indicator of success, since they all averaged about the same per game (range 10.5-12.9). If we explored all the teams and I imagine we might see a trend over all the 300+ teams, but when just looking at the tournament teams, it was not as clear cut.

Conclusion

This was a blast to work on. My favorite part was looking at the team specific data because I already had an idea in my head of how good these teams were, so it was neat to either validate what I thought about the teams with data, or have to rethink what I thought about them. For example - Seeing Gonzaga, Baylor, and Houston at the top of the SRS vs W-L% graph was satisfying.

I feel like I found some answers to my original questions about what makes a good basketball team (3point %!), but realized there’s so much more that could be done to answer them. If I were to further analyze this data, I would create a regression model and do inference to determine the most important stats that indicate a good team.

I hope that you have enjoyed looking through the graphs and perhaps have been intrigued to learn more about this awesome sport, or dive into data visualization yourself. I only used matplot lib, seaborn, and plotly.express, but I was able to produce a variety of graphs that told different stories with different flavors. Also, Chat GPT was an incredible tool because it provided the foundation for the more complicated graphs, so that I could spend my time fine-tuning instead of building from scratch.

Let me know what you liked generally, suggestions for improvement, questions about methods, or comments on the data!

Link to code/github repo