The Most Optimal Strategy for the Knicks

In a previous article, I showed how one could use data in combination with advanced probability techniques to determine the optimal shot / court positions for LeBron James. I decided to use this algorithm on the Knicks’ starting 5, and obtained the following joint probability density contour plots:

One sees that the Knicks offensive strategy is optimal if and only if players gets shots as close to the basket as possible. If this is the case, the players have a high probability of making shots even if defenders are playing them tightly. This means that the Knicks would be served best by driving in the paint, posting up, and Porzingis NOT attempting a multitude of three point shots.

By the way, a lot of people are convinced nowadays that someone like Porzingis attempting 3’s is a sign of a good offense, as it is an optimal way to space the floor. I am not convinced of this. Spacing the floor geometrically translates to a multi-objective nonlinear optimization problem. In particular, let (x_i, y_i) represent the (x-y)-coordinates of a player on the floor. Spreading the floor means one must maximize (simultaneously) each element of the following distance metric:

distancematrix

subject to -14 \leq x_i \leq 14, 0 \leq y_i \leq 23.75. While a player attempting 3-point shots may be one way to solve this problem, I am not convinced that it is a unique solution to this optimization problem. In fact, I am convinced that there are a multiple of solutions to this optimization problem.

This solution is slightly simpler if one realizes that the metric above is symmetric, so that there are only 11 independent components.

Advertisements

Analyzing Lebron James’ Offensive Play

Where is Lebron James most effective on the court?

Based on 2015-2016 data, we obtained from NBA.com the following data which tracks Lebron’s FG% based on defender distance:

lebrondef

From Basketball-Reference.com, we then obtained data of Lebron’s FG% based on his shot distance from the basket:

lebronshotdist

Based on this data, we generated tens of thousands of sample data points to perform a Monte Carlo simulation to obtain relevant probability density functions. We found that the joint PDF was a very lengthy expression(!):

 

Graphically, this was:

lebronjointplot

A contour plot of the joint PDF was computed to be:

lebroncontour

From this information, we can compute where/when LeBron has the highest probability of making a shot. Numerically, we found that the maximum probability occurs when Lebron’s defender is 0.829988 feet away, while Lebron is 1.59378 feet away from the basket. What is interesting is that this analysis shows that defending Lebron tightly doesn’t seem to be an effective strategy if his shot distance is within 5 feet of the basket. It is only an effective strategy further than 5 feet away from the basket. Therefore, opposing teams have the best chance at stopping Lebron from scoring by playing him tightly and forcing him as far away from the basket as possible.

Breaking Down the 2015-2016 NBA Season

In this article, I will use Data Science / Machine Learning methodologies to break down the real factors separating the playoff from non-playoff teams. In particular, I used the data from Basketball-Reference.com to associate 44 predictor variables which each team: “FG” “FGA” “FG.” “X3P” “X3PA” “X3P.” “X2P” “X2PA” “X2P.” “FT” “FTA” “FT.” “ORB” “DRB” “TRB” “AST”   “STL” “BLK” “TOV” “PF” “PTS” “PS.G” “oFG” “oFGA” “oFG.” “o3P” “o3PA” “o3P.” “o2P” “o2PA” “o2P.” “oFT”   “oFTA” “oFT.” “oORB” “oDRB” “oTRB” “oAST” “oSTL” “oBLK” “oTOV” “oPF” “oPTS” “oPS.G”

, where a letter ‘o’ before the last 22 predictor variables indicates a defensive variable. (‘o’ stands for opponent. )

Using principal components analysis (PCA), I was able to project this 44-dimensional data set to a 5-D dimensional data set. That is, the first 5 principal components were found to explain 85% of the variance. 

Here are the various biplots: 


In these plots, the teams are grouped according to whether they made the playoffs or not. 

One sees from this biplot of the first two principal components that the dominant component along the first PC is 3 point attempts, while the dominant component along the second PC is opponent points. CLE and TOR have a high negative score along the second PC indicating a strong defensive performance. Indeed, one suspects that the final separating factor that led CLE to the championship was their defensive play as opposed to 3-point shooting which all-in-all didn’t do GSW any favours. This is in line with some of my previous analyses

Optimal Positions for NBA Players

I was thinking about how one can use the NBA’s new SportVU system to figure out optimal positions for players on the court. One of the interesting things about the SportVU system is that it tracks player (x,y) coordinates on the court. Presumably, it also keeps track of whether or not a player located at (x,y) makes a shot or misses it. Let us denote a player making a shot by 1, and a player missing a shot by 0. Then, one essentially will have data in the form (x,y, \text{1/0}).

One can then use a logistic regression to determine the probability that a player at position (x,y) will make a shot:

p(x,y) = \frac{\exp\left(\beta_0 + \beta_1 x + \beta_2 y\right)}{1 +\exp\left(\beta_0 + \beta_1 x + \beta_2 y\right)}

The main idea is that the parameters \beta_0, \beta_1, \beta_2 uniquely characterize a given player’s probability of making a shot.

As a coaching staff from an offensive perspective, let us say we wish to position players as to say they have a very high probability of making a shot, let us say, for demonstration purposes 99%. This means we must solve the optimization problem:

\frac{\exp\left(\beta_0 + \beta_1 x + \beta_2 y\right)}{1 +\exp\left(\beta_0 + \beta_1 x + \beta_2 y\right)} = 0.99

\text{s.t. } 0 \leq x \leq 28, \quad 0 \leq y \leq 47

(The constraints are determined here by the x-y dimensions of a standard NBA court).

This has the following solutions:

x = \frac{-1. \beta _0-1. \beta _2 y+4.59512}{\beta _1}, \quad \frac{-1. \beta _0-28. \beta _1+4.59512}{\beta _2} \leq y

with the following conditions:

constraints1

One can also have:

x = \frac{-1. \beta _0-1. \beta _2 y+4.59512}{\beta _1}, \quad y \leq 47

with the following conditions:

constraints2

Another solution is:

x = \frac{-1. \beta _0-1. \beta _2 y+4.59512}{\beta _1}

with the following conditions:

constraints3

The fourth possible solution is:

x = \frac{-1. \beta _0-1. \beta _2 y+4.59512}{\beta _1}

with the following conditions:

constraints4

In practice, it should be noted, that it is typically unlikely to have a player that has a 99% probability of making a shot.

To put this example in more practical terms, I generated some random data (1000 points) for a player in terms of (x,y) coordinates and whether he made a shot from that distance or not. The following scatter plot shows the result of this simulation:

bballoptim5

In this plot, the red dots indicate a player has made a shot (a response of 1.0) from the (x,y) coordinates given, while a purple dot indicates a player has missed a shot from the (x,y) coordinates given (a response of 0.0).

Performing a logistic regression on this data, we obtain that \beta_0 = 0, \beta_1 = 0.00066876, \beta_2 = -0.00210949.

Using the equations above, we see that this player has a maximum probability of 58.7149 \% of making a shot from a location of (x,y) = (0,23), and a minimum probability of 38.45 \% of making a shot from a location of (x,y) = (28,0).

Basketball Paper Update

Everyone by now knows about this paper I wrote a few months ago: http://arxiv.org/abs/1604.05266

Using data science / machine learning methodologies, it basically showed that the most important factors in characterizing a team’s playoff eligibility are the opponent field goal percentage and the opponent points per game. This seems to suggest that defensive factors as opposed to offensive factors are the most important characteristics shared among NBA playoff teams. It was also shown that championship teams must be able to have very strong defensive characteristics, in particular, strong perimeter defense characteristics in combination with an effective half-court offense that generates high-percentage two-point shots. A key part of this offensive strategy must also be the ability to draw fouls. 

Some people have commented that despite this, teams who frequently attempt three point shots still can be considered to have an efficient offense as doing so leads to better rebounding, floor spacing, and higher percentage shots. We show below that this is not true. Looking at the last 16 years of all NBA teams (using the same data we used in the paper), we performed a correlation analysis of an individual NBA team’s 3-point attempts per game and other relevant variables, and discovered: 


One sees that there is very little correlation between a team’s 3-point attempts per game and 2-point percentage, free throws, free throw attempts, and offensive rebounds. In fact, at best, there is a somewhat “medium” anti-correlation between 3-point attempts per game and a team’s 2-point attempts per game. 

The Mathematics of “Filling the Triangle”

I’ve been fascinated by the triangle offense for a long time. I think it is a beautiful way to play basketball, and the right way to play basketball, in the half-court, a “system-based” way to play. For those of you that are interested, I highly recommend Tex Winter’s classic book on the topic.

There is this brief video as well where Tex Winter explains how the triangle offense and a basketball are grounded in geometric principles:

 

I don’t think people recognize though how deep of a geometry problem this is actually. Looking at when the triangle is filled, as in the video above, we have the following situation:

trianglesetup
The 3 triangles that form when one triangle is filled involving all 5 players. The letters a,b,c,d,e,f,g,h,i denote the angles within the triangles. We are assuming NBA court dimensions where the 1/2 court is 47′ long and the team bench area which roughly corresponds to the top of the three-point line is 28′ from the baseline.

The problem I wanted to study was given 5 players’ random positions on the court, could a series of equations be solved yielding (x,y) coordinates that would yield where players should “go” to fill the triangle? 

Using simple geometry, from the diagram above, we see that each player’s position in the triangle offense is governed by the following system of nonlinear equations:

\left(x_4-x_2\right) \left(x_4-x_5\right)+\left(y_4-y_2\right) \left(y_4-y_5\right)=\cos (a) \sqrt{\left(x_2-x_4\right){}^2+\left(y_2-y_4\right){}^2} \sqrt{\left(x_4-x_5\right){}^2+\left(y_4-y_5\right){}^2}

\left(x_4-x_2\right) \left(x_2-x_5\right)+\left(y_4-y_2\right) \left(y_2-y_5\right)=\cos (b) \sqrt{\left(x_2-x_4\right){}^2+\left(y_2-y_4\right){}^2} \sqrt{\left(x_2-x_5\right){}^2+\left(y_2-y_5\right){}^2}

\left(x_2-x_5\right) \left(x_4-x_5\right)+\left(y_2-y_5\right) \left(y_4-y_5\right)=\cos (c) \sqrt{\left(x_2-x_5\right){}^2+\left(y_2-y_5\right){}^2} \sqrt{\left(x_4-x_5\right){}^2+\left(y_4-y_5\right){}^2}

\left(x_2-x_1\right) \left(x_2-x_5\right)+\left(y_2-y_1\right) \left(y_2-y_5\right)=\cos (d) \sqrt{\left(x_1-x_2\right){}^2+\left(y_1-y_2\right){}^2} \sqrt{\left(x_2-x_5\right){}^2+\left(y_2-y_5\right){}^2}

\left(x_2-x_1\right) \left(x_1-x_5\right)+\left(y_2-y_1\right) \left(y_1-y_5\right)=\cos (e) \sqrt{\left(x_1-x_2\right){}^2+\left(y_1-y_2\right){}^2} \sqrt{\left(x_1-x_5\right){}^2+\left(y_1-y_5\right){}^2}

\left(x_1-x_5\right) \left(x_2-x_5\right)+\left(y_1-y_5\right) \left(y_2-y_5\right)=\cos (f) \sqrt{\left(x_1-x_5\right){}^2+\left(y_1-y_5\right){}^2} \sqrt{\left(x_2-x_5\right){}^2+\left(y_2-y_5\right){}^2}

\left(x_1-x_3\right) \left(x_1-x_5\right)+\left(y_1-y_3\right) \left(y_1-y_5\right)=\cos (h) \sqrt{\left(x_1-x_3\right){}^2+\left(y_1-y_3\right){}^2} \sqrt{\left(x_1-x_5\right){}^2+\left(y_1-y_5\right){}^2}

\left(x_1-x_3\right) \left(x_3-x_5\right)+\left(y_1-y_3\right) \left(y_3-y_5\right)=\cos (i) \sqrt{\left(x_1-x_3\right){}^2+\left(y_1-y_3\right){}^2} \sqrt{\left(x_3-x_5\right){}^2+\left(y_3-y_5\right){}^2}

\left(x_1-x_5\right) \left(x_3-x_5\right)+\left(y_1-y_5\right) \left(y_3-y_5\right)=\cos (g) \sqrt{\left(x_1-x_5\right){}^2+\left(y_1-y_5\right){}^2} \sqrt{\left(x_3-x_5\right){}^2+\left(y_3-y_5\right){}^2}

Further, the angles obviously must satisfy the following constraints:

a + b + c = \pi, \quad d + e + f = \pi, \quad g + h + i = \pi

Finally, we require that each player be about 15-20 feet apart in the triangle offense (because the offense is predicated on spacing), and thus have some additional constraints:

15\leq \sqrt{\left(x_2-x_4\right){}^2+\left(y_2-y_4\right){}^2}\leq 20

15\leq \sqrt{\left(x_4-x_5\right){}^2+\left(y_4-y_5\right){}^2}\leq 20

15\leq \sqrt{\left(x_2-x_5\right){}^2+\left(y_2-y_5\right){}^2}\leq 20

15\leq \sqrt{\left(x_1-x_2\right){}^2+\left(y_1-y_2\right){}^2}\leq 20

15\leq \sqrt{\left(x_1-x_5\right){}^2+\left(y_1-y_5\right){}^2}\leq 20

15\leq \sqrt{\left(x_1-x_3\right){}^2+\left(y_1-y_3\right){}^2}\leq 20

15\leq \sqrt{\left(x_3-x_5\right){}^2+\left(y_3-y_5\right){}^2}\leq 20

Solving this highly nonlinear system of equations with constraints is not a trivial problem! It fact, because of the high degree of nonlinearity and dimension of the problem, it is safe to assume that no closed-form solution exists, and therefore, must be solved numerically.

For this task, we used MATLAB, and experimented with the lsqnonlin() and fsolve() commands. The only issue is that (as with all such numerical algorithms) convergence depends very highly on the choice of initial conditions. It is very difficult to choose a priori this many initial conditions, so I wrote a script that randomized initial conditions. I then ran several numerical experiments and obtained the following results:

In the plot above, I have labeled the plots that converged to the triangle formation with the title “this one”. In addition, the five black circles denote the initial positions of the players on the court before they fill the triangles in the offense. One sees just by the diagram above, how difficult such a problem is to solve mathematically, even through a numerical approach. Running more trials would perhaps yield better results, but, it works! I am truly fascinated by this. In the coming days, I will work on optimizing the numerical algorithm, and post my updates as they come.

Here is an animation of one of the scenarios above when the algorithm converges correctly:

In this animation above, the black dots represent the positions of the players on the court. They begin at initial (random) positions and attempt to fill the triangles as described above.

Thanks for reading!

Breakdown of Game 7 between OKC and GSW

Here is the collection of time series of relevant predictor variables captured live during Game 7 of the Western Conference Finals between The Oklahoma City Thunder and The Golden State Warriors:

Another video animation:

Many commentators are making a point to mention how many three point shots The Warriors made, suggesting that that was the main reason why the Warriors won the game. However, the time series above show otherwise. As can be seen above, OKC’s loss of the lead in the game directly corresponds to GSW’s increase in 2PT %. This can be further confirmed by computing the correlations between OKC’s point difference and all of the other predictor variables plotted above:


One can see from these calculations that OKC’s point difference is strongly negatively correlated with the amount of personal fouls they committed during the game, the amount of personal fouls GSW committed during the game, and GSW 2PT% during the game.