Players, Positions, and Probability in the NBA

Using Supervised Machine Learning to build an NBA Position Classifier

Jeremy Lee

Image for post — Kobe Bryant shoots a free throw during a game against the Sacramento Kings. Photo by Ramiro Pianarosa on Unsplash

Let’s start this blog with an important question: Why would you need to use machine learning to classify NBA players by position? A casual basketball fan would likely have watched enough basketball to identify which player plays which position. Even a person with no knowledge of the game could find some YouTube clips of the all time greats to understand what each position does. So what extra information might a machine learning algorithm offer us that would justify its use?

Well, when it comes to the task of classification, machine learning algorithms are using a predicted probability to decide whether or not something belongs to a group. In the case of an NBA positions, a machine learning algorithm is going to predict the probability that the player is either a point guard, shooting guard, small forward, power forward, or center. The algorithm will give us a probability for each position for any given player. It’s these probabilities that could prove to be useful to a front office or discussion among NBA analysts. With each new generation of NBA talent, we have seen more hybridization amongst players when it comes to their skill set. Centers are no longer strictly close-to-the-basket low post players that only come to the 3 point line to set a screen. Modern centers drift out to the three point line to take and make 3’s, and some have the requisite skills to effectively run a fast break. With positional probabilities assigned to a player we can begin to understand the versatility of a player. If we know the positional probabilities of each player on a team, that can be used for a lineup analysis of your own team or the opponent.

Let’s also consider another use for these positional probabilities. Each year a panel of sportswriters and broadcasters vote for various awards like Most Valuable Player, Defensive Player of the Year, and the All-NBA teams. Per an article on hoopsrumors.com, players are eligible for a supermax contract (30% of a team’s salary cap) if they meet the following criteria:

Make an All-NBA team in the previous season or in 2 of the 3 previous seasons
Be named Defensive Player of the year in the previous season or in 2 of the 3 previous seasons
Be named Most Valuable Player in at least one of the three previous seasons

Given that these media members’ vote can affect the future earnings of players it is paramount that we have metrics that are equitable when evaluating and comparing players. The inputs to an NBA position classifier algorithm would be some group of recorded statistics that describe a player’s performance and impact on a game. The algorithm would then take those inputs, perform some kind of math and provide us with the position probabilities. Assuming we make the right assumptions with our input and appropriately train the model, we can generate something like this:

If deciding between two potential MVP candidates, an award voter might use a tool like the one shown above and decide to go with the player who had greater positional diversity.

Data Collection

To create a model that can produce probabilities shown above we first have to collect data for each of the players. Per 100 possession stats and advanced stats were collected from basketball-reference.com. Player tracking data was collected from stats.nba.com via the nba_stats_tracking library for Python. The documentation for this library can be found here. Only player data from the 2015–16 through the 2018–19 season was used. I chose not to use the 2019–20 player data due to the abrupt suspension caused by COVID-19 and the inconsistent total number of regular season games played. Player tracking data from the NBA website is only available going back to the 2013–2014 season. As we’ll soon see, the NBA has gone through a bit of an evolution when it comes to shot selection. There has been a significant increase in the number of 3 pointers attempted and made across all teams in the NBA. By using 4 seasons of data that distinctly fall within the 3 point era, my goal was to prevent the model from making incorrect assumptions about a player due to statistical differences from year to year. Only regular season data was used. The different sources were combined into a single Pandas dataframe using the name of the player and the season. Each player was treated as a single observation (i.e. 2015–16 LeBron James is different from 2017–18 LeBron James). I chose to treat each player as a single observation for three reasons:

To ensure that there was enough data for training and testing the model
Rosters are very rarely the same from year to year. The introduction of a new player whether it be from the draft, a trade, or free agency may cause a coach to have to adjust lineups and the players themselves may alter their play for the betterment of team success.
Just like everyone else, NBA players get older every year. Whether it be due to age or an injury, time does affect how players play. An explosive point guard who suffered a major injury might look to rely more on jump shooting after recovering and be better suited as shooting guard while letting another player run the offense.

Data Exploration

In order to build my final model, I ended up using 51 variables for each player. Obviously it would take an incredible amount of time to produce and analyze each of those statistics so I’ll only include a few in this blog. I’ll leave a link to my Github repo below where can you see more analysis and visualizations.

I briefly mentioned that 3 point shooting has become an increasingly important skill in today’s NBA. Let’s look at 3 point shooting across 4 seasons by position:

Each position has taken an increasing number of 3 pointers per 100 possessions over the past 4 seasons and we can also see that power forwards and centers have improved their percentage over time. Although power forwards and centers don’t make 3’s at the same rate as the perimeter positions, this is one metric that demonstrates that the increasingly level of talent from players requires that

We see that as the number of 3 point attempts increases, it holds true that the number of 2 point attempts must decrease. There has been an overall positive trend across all positions for percentage made. The year over year increase in 2 point percentage is likely attributed to a higher frequency of 2 point shots taken near the basket as opposed to a mid-range shot. Here is good article to see how shot selection has changed over the years in the NBA.

Let’s also take a look a player tracking metric such as time of possession.

What’s important to note here is that there seems to be a clear pecking order for who holds the ball the most. Statistics were we can see a clear delineation between positions is only going to help our classification model perform better. Similar to time of possession, we see some distinction when looking at the number of 3 point attempts per 100 possessions or 2 point percentage per 100 possessions.

Model Selection

Before we get to final model I used for this project, I’ll first discuss some of the other models that I tried and discuss the reasons why they were not a good fit.

Logistic Regression

Why it’s not a good fit: Logistic Regression requires the variables to be independent of each other.

As previously mentioned, I had 51 input variables for each player to determine their position. For example, I chose to include catch-and-shoot 3 point attempts as well as overall 3 point attempts. As catch-and-shoot 3 point attempts increases, so will the number of 3 point attempts. I would make the argument that number catch-and-shoot 3 pointers is an important metric when attempting to predict their position. We might expect a greater percentage of shooting guards to take catch-and-shoot 3’s compared to centers.

K Nearest Neighbors

Why it’s not a good fit: Accuracy scores were mediocre at best.

Starting with 3 neighbors, the model accuracy was only 64.5%. Repeating the process with more neighbors only led to an incremental increase in accuracy. 11 neighbors resulted in a model accuracy of 68.8%. Although we could continue to increase the number of neighbors in hopes that accuracy keeps improving, that might prove to be risky.

Why is it risky? Let’s recap how the K Nearest Neighbor algorithm works using two NBA players

Imagine we are trying to classify players just by the number of assists per game and rebounds per game. If we were to plot each player (assists on the x-axis and rebounds on the y-axis), players with high assist and rebound numbers would be clustered together and players with low assist and rebound numbers would be clustered together. LeBron James has averaged 7.4 assists per game (APG) and 7.4 rebounds per game (RPG) for his career, Russell Westbrook has averaged 8.3 assists and 7.1 rebounds per game, and Jason Kidd has averaged 8.7 APG and 6.3 RPG. Russell Westbrook and Jason Kidd are point guards, while LeBron James (at least for the majority of his career) is a small forward. The nearest neighbor algorithm would choose in this case to classify LeBron as a point guard. Obviously, we can make the case for LeBron being a point guard (or better yet a point forward) but that is a slightly different discussion.

Random Forests

Why it’s not a good fit: Serious problems with overfitting.

Using both random forests and XGBoost, I was able to get a training accuracy of 78.2% and 83.8% respectively. Unfortunately, model accuracy on the test set was only 68.6% and 69.5% respectively. Obviously, the difference in accuracy makes the random forest algorithm not feasible for this problem. When we think about how random forests work, we can develop some intuition as to why this algorithm might not work for this data. Essentially the random forest is just a group of decision trees, and each of those decision trees is establishing a threshold to split our data into groups. If one of the decision trees in the random forest created some threshold where all players with averages of at least 5 PPG, 5 APG, and 5 RPG were small forwards, we would quickly see that that decision tree in particular would be wrong very often. There are definitely a number of players across all positions who have met or exceeded those averages. If we have a random forest with decision trees like the one described above applied to an unseen test set, those thresholds might not be applicable.

So what model did work?

The best performing model for this data was a support vector machine. I was able to get a cross-validated training accuracy of 74% and a testing accuracy of 73%. Those two accuracy scores are close enough that we could conclude that we have a generalized model that can classify NBA players by position. For reference, below are the optimal parameters for this model found using GridSearchCV :

{'svc__C': 1,
 'svc_gamma': 'scale',
 'svc_kernel': 'linear',
 'svc_proability: True}

We know that there are many players in the NBA who are talented enough to play multiple positions. This means that we may able to live with some misclassifications from the model. To better understand how the model performed, let’s discuss some numbers and look at a confusion matrix (read this article to understand how to interpret a multi-class confusion matrix).

Looking at the confusion matrix above, 24% of power forwards were classified as centers, 22% of small forwards were classified as shooting guards, and 20% of shooting guards were classified as small forwards. Considering that those positions may be interchangeable depending on the player (and team), we would need to look at specific player misclassifications to understand whether or not those would be acceptable.

The worst misclassifications above are likely the one center who was predicted to be a small forward and the two shooting guards predicted to be power forwards. We might also look at the group of 14 power forwards that were misclassified as small forwards. It would not be a surprise in today’s NBA if some of those power forwards were to have the requisite shooting ability and ball handling skills to play the small forward position.

A Quick Flashback to the 2016 NBA Finals

In order to see the positional probabilities for a player I created a Player class with a method called .position_breakdown() that allows us to easily visualize how the model sees each player. Conveniently, basketball-reference.com has a similar feature which estimates the percentage of minutes spent at a certain position for a player. These estimates can provide a benchmark for the positional probabilities the model predicts.

We would instantiate a player object and call the .position_breakdown() method as follows to produce the pie plot seen earlier:

LeBron = Player('LeBron', 'James', '15-16')LeBron.position_breakdown()

As I mentioned in the introduction, we can use the positional probabilities to evaluate different lineups and hopefully gain an understanding as to what advantages or disadvantages those lineups have when pitted against their opponents.

First up, the 2015–2016 Cleveland Cavaliers:

2015–2016 Golden State Warriors:

With the exception of Andrew Bogut, it’s reasonable to assume that the starting players from the 2016 NBA Finals are capable of playing more than one position. When paired with game film and scouting reports, the Cleveland coaching staff might use these positional probabilities to take advantage of Andrew Bogut on offense by possibly forcing him to switch onto a smaller more mobile player and taking advantage of that mismatch. The coaching staff might also elect to sub in a wing player for Tristan Thompson and have Kevin Love shift to center and LeBron shift to power forward. Kevin Love’s ability to shoot from 3 might pull Bogut out of the paint and allow the Cavaliers to execute plays where LeBron has a clearer lane to the basket.

With further model tuning we might also look at these positional probabilities alongside a plus/minus statistic so that we could evaluate the production of different lineups.

Food for Thought: Year over Year Changes

Here are some more player position breakdowns that I thought were of interest! Basketball is a team game and your role/position on the court is going to be influenced by who else is on the court with you.

Here’s what it looks like for Kevin Durant with the Oklahoma City Thunder in 2015–16 versus Kevin Durant with the Golden State Warriors in 2016–17:

And here’s what the positional probabilities look like for Russell Westbrook in the same years:

Next Steps and Future Considerations

Now that we have a baseline model that can predict positional probabilities we would certainly want to improve it! One possible way to improve the model would be to use Principal Component Analysis to try and reduce the number of features required to make a prediction. The support vector machine described above used 51 different variables and by eliminating those variables that are insignificant we might be able to improve model accuracy. Fewer variables would also be more helpful when explaining model performance and the significance of various statistics to the non-technical stakeholders (e.g. coaches, scouts, or the front office).

A future NBA position classifier might also be useful in evaluating college draft prospects or unsigned free agents. Certain college players may be better fits at different positions with an NBA team as compared to their college team. Unfortunately, player tracking data isn’t as readily available for college or basketball leagues in other countries. I think it would also be beneficial to create these same models for the WNBA and women’s college basketball, but the adoption of player tracking systems is lagging behind in women’s basketball.

Building an NBA position classifier also raises another important question. With a model that is 73% accurate, are the traditional 5 positions still sufficient in describing the various styles of play? We saw that across three of the 5 positions, at least 25% of those players were misclassified. I’d argue that some of those players weren’t misclassified at all, but rather that have a versatile enough skillset that they don’t fit into one of the five traditional positions. You may have heard of terms like a combo guard, 3&D, stretch 4, or point forward. While these aren’t new terms and there is evidence of these types of players going back to earlier eras of NBA basketball, they have become more frequent used terms in the NBA lexicon. Fortunately there is already research and analysis that has been done in terms of trying to better describe the different roles of players on an NBA roster (definitely read this article if any of the above has been interesting to you).

If you made it this far, thanks so much for reading! It was a bit long, but I wanted to write a blog that would be insightful for basketball fans and data scientists alike. As always, any and all feedback is welcome. You can find my project in its entirety at my Github.

Feel free to reach out on LinkedIn or Twitter if you want to discuss more about data science or basketball analytics!