Here is an embedded dashboard that shows a number of statistical insights for NBA teams, their opponents, and individual players as well. You can compare multiple teams and players. Navigate through the different pages by clicking through the scrolling arrow below. (The data is based on the most recent season “per-game” numbers.)
(If you cannot see the dashboard embedded below for whatever reason, click here to be taken directly to the dashboard in a separate page.)
The Golden State Warriors have posed quite the conundrum for opposing teams. They are quick, have a spectacular ability to move the ball, and play suffocating defense. Given their play in the playoffs thus far, all of these points have been exemplified even more to the point where it seems that they are unbeatable.
I wanted to take somewhat of a simplified approach and see if opposing teams are missing something. That is, is their some weakness in their play that opposing teams can exploit, a “weakness in Helm’s deep”?
The most obvious place to start from a data science point-of-view seemed to me to look at every single shot the Warriors took as a team this season in each game and compile a grand ensemble shot chart. Using the data from Basketball-reference.com and some data scraping scripts I wrote in R, I obtained the following:
Certainly, on the surface, it seems that there is no discernible pattern between made shots and missed shots. This is where the machine learning comes in!
From here, I now extracted the x and y coordinates of each shot and recorded a response variable of “made” or “missed” in a table, such that the coordinates were now predictor variables and the shot classification (made/missed) was the response variable. Altogether, we had 7104 observations. Splitting this dataset up into a 70% training dataset and a 30% test data set, I tried the following algorithms, recording the % of correctly classified observations:
% of Correctly Predicted Observations
Gradient Boosted Decision Trees
Neural Networks with Entropy Fitting
Naive Bayes Classification with Kernel Density Estimation
One sees that that gradient boosted decision trees had the best performance correctly classifying 62.62% of the test observations. Given how noisy the data is, this is not bad, and much better than expected. I should also mention that these numbers were obtained after tuning these models using cross-validation for optimal parameters.
Using the gradient boosted decision tree model, we made a set of predictions for a vast number of (x,y)-coordinates for basketball court. We obtained the following contour plot:
Overlaying this on top of the basketball court diagram, we got:
The contour plot levels denote the probabilities that the GSW will make a shot from a given (x,y) location on the court. As a sanity check, the lowest probabilities seem to be close to the 1/2-court line and beyond the three-point line. The highest probabilities are surprisingly along very specific areas on the court: very close the basket, the line from the basket to the left corner, extending up slightly, and a very narrow line extending from the basket to the right corner. Interestingly, the probabilities are low on the right side of the basket, specifically:
A map showing the probabilities more explicitly is as follows (although, upon uploading it, I realized it is a bit harder to read, I will re-upload a clearer version soon!)
In conclusion, it seems that, at least according to a first look at the data, the Warriors do indeed have several “weak spots” in their offense that opponents should certainly look to exploit by designing defensive schemes that force them to take shots in the aforementioned low-probability zones. As for future improvements, I think it would be interesting to add as predictor variables things like geographic location, crowd sizes, team opponent strengths, etc… I will look into making these improvements in the near future.
In a previous article, I showed how one could use data in combination with advanced probability techniques to determine the optimal shot / court positions for LeBron James. I decided to use this algorithm on the Knicks’ starting 5, and obtained the following joint probability density contour plots:
One sees that the Knicks offensive strategy is optimal if and only if players gets shots as close to the basket as possible. If this is the case, the players have a high probability of making shots even if defenders are playing them tightly. This means that the Knicks would be served best by driving in the paint, posting up, and Porzingis NOT attempting a multitude of three point shots.
By the way, a lot of people are convinced nowadays that someone like Porzingis attempting 3’s is a sign of a good offense, as it is an optimal way to space the floor. I am not convinced of this. Spacing the floor geometrically translates to a multi-objective nonlinear optimization problem. In particular, let represent the (x-y)-coordinates of a player on the floor. Spreading the floor means one must maximize (simultaneously) each element of the following distance metric:
subject to . While a player attempting 3-point shots may be one way to solve this problem, I am not convinced that it is a unique solution to this optimization problem. In fact, I am convinced that there are a multiple of solutions to this optimization problem.
This solution is slightly simpler if one realizes that the metric above is symmetric, so that there are only 11 independent components.
Where is Lebron James most effective on the court?
Based on 2015-2016 data, we obtained from NBA.com the following data which tracks Lebron’s FG% based on defender distance:
From Basketball-Reference.com, we then obtained data of Lebron’s FG% based on his shot distance from the basket:
Based on this data, we generated tens of thousands of sample data points to perform a Monte Carlo simulation to obtain relevant probability density functions. We found that the joint PDF was a very lengthy expression(!):
Graphically, this was:
A contour plot of the joint PDF was computed to be:
From this information, we can compute where/when LeBron has the highest probability of making a shot. Numerically, we found that the maximum probability occurs when Lebron’s defender is 0.829988 feet away, while Lebron is 1.59378 feet away from the basket. What is interesting is that this analysis shows that defending Lebron tightly doesn’t seem to be an effective strategy if his shot distance is within 5 feet of the basket. It is only an effective strategy further than 5 feet away from the basket. Therefore, opposing teams have the best chance at stopping Lebron from scoring by playing him tightly and forcing him as far away from the basket as possible.
In this article, I will use Data Science / Machine Learning methodologies to break down the real factors separating the playoff from non-playoff teams. In particular, I used the data from Basketball-Reference.com to associate 44 predictor variables which each team: “FG” “FGA” “FG.” “X3P” “X3PA” “X3P.” “X2P” “X2PA” “X2P.” “FT” “FTA” “FT.” “ORB” “DRB” “TRB” “AST” “STL” “BLK” “TOV” “PF” “PTS” “PS.G” “oFG” “oFGA” “oFG.” “o3P” “o3PA” “o3P.” “o2P” “o2PA” “o2P.” “oFT” “oFTA” “oFT.” “oORB” “oDRB” “oTRB” “oAST” “oSTL” “oBLK” “oTOV” “oPF” “oPTS” “oPS.G”
, where a letter ‘o’ before the last 22 predictor variables indicates a defensive variable. (‘o’ stands for opponent. )
Using principal components analysis (PCA), I was able to project this 44-dimensional data set to a 5-D dimensional data set. That is, the first 5 principal components were found to explain 85% of the variance.
Here are the various biplots:
In these plots, the teams are grouped according to whether they made the playoffs or not.
One sees from this biplot of the first two principal components that the dominant component along the first PC is 3 point attempts, while the dominant component along the second PC is opponent points. CLE and TOR have a high negative score along the second PC indicating a strong defensive performance. Indeed, one suspects that the final separating factor that led CLE to the championship was their defensive play as opposed to 3-point shooting which all-in-all didn’t do GSW any favours. This is in line with some of my previous analyses.
I was thinking about how one can use the NBA’s new SportVU system to figure out optimal positions for players on the court. One of the interesting things about the SportVU system is that it tracks player coordinates on the court. Presumably, it also keeps track of whether or not a player located at makes a shot or misses it. Let us denote a player making a shot by , and a player missing a shot by . Then, one essentially will have data in the form .
One can then use a logistic regression to determine the probability that a player at position will make a shot:
The main idea is that the parameters uniquely characterize a given player’s probability of making a shot.
As a coaching staff from an offensive perspective, let us say we wish to position players as to say they have a very high probability of making a shot, let us say, for demonstration purposes 99%. This means we must solve the optimization problem:
(The constraints are determined here by the x-y dimensions of a standard NBA court).
This has the following solutions:
with the following conditions:
One can also have:
with the following conditions:
Another solution is:
with the following conditions:
The fourth possible solution is:
with the following conditions:
In practice, it should be noted, that it is typically unlikely to have a player that has a 99% probability of making a shot.
To put this example in more practical terms, I generated some random data (1000 points) for a player in terms of coordinates and whether he made a shot from that distance or not. The following scatter plot shows the result of this simulation:
In this plot, the red dots indicate a player has made a shot (a response of 1.0) from the coordinates given, while a purple dot indicates a player has missed a shot from the coordinates given (a response of 0.0).
Performing a logistic regression on this data, we obtain that .
Using the equations above, we see that this player has a maximum probability of of making a shot from a location of , and a minimum probability of of making a shot from a location of .