How to Beat the Golden State Warriors

The Golden State Warriors have posed quite the conundrum for opposing teams. They are quick, have a spectacular ability to move the ball, and play suffocating defense. Given their play in the playoffs thus far, all of these points have been exemplified even more to the point where it seems that they are unbeatable.

I wanted to take somewhat of a simplified approach and see if opposing teams are missing something. That is, is their some weakness in their play that opposing teams can exploit, a “weakness in Helm’s deep”?

The most obvious place to start from a data science point-of-view seemed to me to look at every single shot the Warriors took as a team this season in each game and compile a grand ensemble shot chart. Using the data from Basketball-reference.com and some data scraping scripts I wrote in R, I obtained the following:

Certainly, on the surface, it seems that there is no discernible pattern between made shots and missed shots. This is where the machine learning comes in!

From here, I now extracted the x and y coordinates of each shot and recorded a response variable of “made” or “missed” in a table, such that the coordinates were now predictor variables and the shot classification (made/missed) was the response variable. Altogether, we had 7104 observations. Splitting this dataset up into a 70% training dataset and a 30% test data set, I tried the following algorithms, recording the % of correctly classified observations:

 Algorithm % of Correctly Predicted Observations Logistic Regression 56.43 Gradient Boosted Decision Trees 62.62 Random Forests 58.54 Neural Networks with Entropy Fitting 62.47 Naive Bayes Classification with Kernel Density Estimation 57.32

One sees that that gradient boosted decision trees had the best performance correctly classifying 62.62% of the test observations. Given how noisy the data is, this is not bad, and much better than expected. I should also mention that these numbers were obtained after tuning these models using cross-validation for optimal parameters.

Using the gradient boosted decision tree model, we made a set of predictions for a vast number of (x,y)-coordinates for basketball court. We obtained the following contour plot:

Overlaying this on top of the basketball court diagram, we got:

The contour plot levels denote the probabilities that the GSW will make a shot from a given (x,y) location on the court. As a sanity check, the lowest probabilities seem to be close to the 1/2-court line and beyond the three-point line. The highest probabilities are surprisingly along very specific areas on the court: very close the basket, the line from the basket to the left corner, extending up slightly, and a very narrow line extending from the basket to the right corner. Interestingly, the probabilities are low on the right side of the basket, specifically:

A map showing the probabilities more explicitly is as follows (although, upon uploading it, I realized it is a bit harder to read, I will re-upload a clearer version soon!)

In conclusion, it seems that, at least according to a first look at the data, the Warriors do indeed have several “weak spots” in their offense that opponents should certainly look to exploit by designing defensive schemes that force them to take shots in the aforementioned low-probability zones. As for future improvements, I think it would be interesting to add as predictor variables things like geographic location, crowd sizes, team opponent strengths, etc… I will look into making these improvements in the near future.

What are the factors behind Golden State’s and Cleveland’s Wins in The NBA Finals

As I write this, Cleveland just won the series 4-3. What was behind each team’s wins and losses in this series?

First, Golden State: A correlation plot of their per game predictor variables versus the binary win/loss outcome is as follows:

The key information is in the last column of this matrix:

Evidently, the most important factors in GSW’s winning games were Assists, number of Field Goals made, Field Goal percentage, and steals. The most important factors in GSW losing games this series were number of three point attempts per game (Imagine that!), and number of personal fouls per game.

Now, Cleveland: A correlation plot of their per game predictor variables versus the binary win/loss outcome is as follows:

The key information is in the last column of this matrix:

Evidently, the most important factor in CLE’s wins was their number of defensive rebounds. Following behind this were number of three point shots made, and field goal percentage. There were some weak correlations between Cleveland’s losses and their number of offensive rebounds and turnovers.

Note that these results are essentially a summary analysis of previous blog postings which tracked individual games. For example, here , here and a first attempt here.

Game 1 of CLE vs GSW Breakdown

Using my live tracking app combined with the relevant factors based on this previous work, here is my breakdown of what contributed to the Warriors win in Game 1 of the NBA Finals.

First, here is the time series graph of several predictor variables:

Breaking this down a bit further, we have:

Computing the correlations, we obtain:

For the graphically inclined:

One sees that the predictor variable correlated most positively with the Warriors’ lead was the number of fouls Cleveland committed. Therefore, evidently, the most important factor in GSW winning Game 1 was the rate and number of fouls committed by Cleveland during the game.

Breakdown of Game 7 between OKC and GSW

Here is the collection of time series of relevant predictor variables captured live during Game 7 of the Western Conference Finals between The Oklahoma City Thunder and The Golden State Warriors:

Another video animation:

Many commentators are making a point to mention how many three point shots The Warriors made, suggesting that that was the main reason why the Warriors won the game. However, the time series above show otherwise. As can be seen above, OKC’s loss of the lead in the game directly corresponds to GSW’s increase in 2PT %. This can be further confirmed by computing the correlations between OKC’s point difference and all of the other predictor variables plotted above:

One can see from these calculations that OKC’s point difference is strongly negatively correlated with the amount of personal fouls they committed during the game, the amount of personal fouls GSW committed during the game, and GSW 2PT% during the game.

Metrics for GSW vs. OKC Game 6 Second Half

Continuing with the live metrics employed yesterday, here is an analysis of the second half of the Warriors-Thunder Game 6.

Here is a plot of the various time series of relevant statistical variables:

One can see from this plot for example, the exact point in time when OKC loses control of the game.

Further, here are the correlation coefficients of the variables above:

One sees there is a tremendously strong anti-correlation between OKC’s lead and GSW 3PT%, while there is a somewhat strong correlation between OKC’s lead and their 2PT%. This perhaps means that for Game 7, OKC’s 3PT defense needs to greatly improve along with maintaining their 2PT%, which, as can be seen from the plot above, dropped off towards the end of the game.

Analyzing Stephen Curry’s Play

As a long-time Golden State Warriors fan (go Tim Hardaway and Chris Mullin!), I have been watching the Warriors this season with great interest.

Stephen Curry has been getting a lot of attention. It is somewhat of a foregone conclusion that he will be the MVP this season, but, I am not completely convinced, in the sense that watching his play, he gets many open looks throughout the process of a game.

I was therefore interested in analyzing his FG% has a function of his shot distance from the basket and the distance of the closest defender on the court.

The NBA has made completing such an analysis somewhat easy with all of its new analytics tools like Shot Tracking but analyzing this question has proven difficult, because the trackers have not measured FG% as a function of two variables, rather, they have produced this statistic as function of each individual variable. One therefore ends up with a table of data as follows:

 FG% Distance from Basket (> 10 ft) Closest Defender Distance 1 56.5 10 NA 2 39.0 15 NA 3 46.9 20 NA 4 46.0 25 NA 5 60.0 30 NA 6 50.0 35 NA 7 36.4 40 NA 8 32.5 NA 0 9 42.4 NA 2 10 50.6 NA 4 11 47.8 NA 6

The “NA” values are the missing values as a result of not having the complete 3D set of data available.

The only way I could see to alleviate this problem was to perform some type of interpolation .

This way, I was able to perform the following surface regression:

This regression to the interpolated data points had an R^2 value of: 0.99, so the fit actually was very good.

The actual function for this surface was found to be:

where $d$ denotes the closest defender distance, and $y$ denotes the distance from the basket for shots greater than 10 feet.

Using this function and tools from multivariable calculus, we are able to conclude that:

Min FG% = 38.164% at d = 1, y = 15

That is, Stephen Curry is expected to have his lowest field goal percentage with the closest defender within 1 foot of him while being within 15 feet of the basket. Certainly, looking at the plot above, we see that his FG% increases as defenders are further and further away.

This can be also seen from the following contour plot obtained from computing the gradient of $FG(d,y)$ above:

What about trends? Well, computing the gradient of $FG(d,y)$, we find that:

$\nabla FG = (-6.813-0.6808d+1.0284d^2 + 0.9175 y - 0.3068 d y)\hat{d} + (-0.6783 + 0.9175d - 0.1534d^2)\hat{y}$

The charm of this is that we can now use methods of dynamical systems theory to obtain information about the trends! The vector field $\nabla FG$ is defined on the manifold $\mathbb{R}^2$ in the sense that it is a mapping: $\mathbb{R}^2 \to T\mathbb{R}^2$ that assigns to each point $m \in \mathbb{R}^2$ a vector in $T_{m} \mathbb{R}^2$. We can also interpret this vector field as the right-hand side of a system of first-order autonomous differential equations.

Motivated by this, we see that the fixed points are thus found to be:

$(d_1,y_1) = (0.864142, 10.1679)$ and $(d_2,y_2) = (5.11695,25.4915)$

Evaluating the Jacobian matrix in a neigbourhood of $(d_1,y_1)$ we find that the eigenvalues corresponding to this point are: $\lambda_1 = -2.21508, \lambda_{2} = 0.192138$. That is, the first point is a saddle point. Similarly, the eigenvalues of the second point are found to be: $\lambda_1 = 2.21509, \lambda_{2} = -0.192137$, which implies that this point is also another saddle point.

So, in terms of trends, there certainly exist orbits where Stephen Curry tends to shoot away from defenders while also keeping a distance of more than 25 feet from the basket. There also exists orbits where he does the opposite. However, the following vector field plot is very illuminating in terms of displaying Steph Curry’s flow during the game:

One sees that there is a tendency for his shots to converge where the defender is at least three feet away at a minimum distance of 25 feet away from the basket. The saddle point behaviour is very evident in the lower left and upper right corners of the vector field plot.

Stephen Curry and Mahmoud Abdul-Rauf?

As usual, Phil Jackson made another interesting tweet today:

And, as usual received many criticisms from “Experts”, who just looked at the raw numbers from each players, and saw that there is just no way such a statement is justified, but it is not that simple!

When you compare two players (or two objects) who have very different data feature values, it is not that they can’t be compared, you must effectively normalize the data somehow to make the sets comparable.

In this case, I used the data from Basketball-Reference.com to compare Chris Jackson’s 6 seasons in Denver to Stephen Curry’s last 6 seasons (including this one) and took into account 45 different statistical measures, and came up with the following correlation matrix/similarity matrix plot:

Dark blue circles indicate a strong correlation, while dark red circles indicate a weak correlation between two sets of features.

What would be of interest in an analysis like this is to examine the diagonal of this matrix, which offers a direct comparison between the two players:

One can see that there are many features that have strong correlation coefficients.

Therefore, it is true that Stephen Curry and Chris Jackson do in fact share many strong similarities!