## Breaking Down the 2015-2016 NBA Season

In this article, I will use Data Science / Machine Learning methodologies to break down the real factors separating the playoff from non-playoff teams. In particular, I used the data from Basketball-Reference.com to associate 44 predictor variables which each team: “FG” “FGA” “FG.” “X3P” “X3PA” “X3P.” “X2P” “X2PA” “X2P.” “FT” “FTA” “FT.” “ORB” “DRB” “TRB” “AST”   “STL” “BLK” “TOV” “PF” “PTS” “PS.G” “oFG” “oFGA” “oFG.” “o3P” “o3PA” “o3P.” “o2P” “o2PA” “o2P.” “oFT”   “oFTA” “oFT.” “oORB” “oDRB” “oTRB” “oAST” “oSTL” “oBLK” “oTOV” “oPF” “oPTS” “oPS.G”

, where a letter ‘o’ before the last 22 predictor variables indicates a defensive variable. (‘o’ stands for opponent. )

Using principal components analysis (PCA), I was able to project this 44-dimensional data set to a 5-D dimensional data set. That is, the first 5 principal components were found to explain 85% of the variance.

Here are the various biplots: In these plots, the teams are grouped according to whether they made the playoffs or not.

One sees from this biplot of the first two principal components that the dominant component along the first PC is 3 point attempts, while the dominant component along the second PC is opponent points. CLE and TOR have a high negative score along the second PC indicating a strong defensive performance. Indeed, one suspects that the final separating factor that led CLE to the championship was their defensive play as opposed to 3-point shooting which all-in-all didn’t do GSW any favours. This is in line with some of my previous analyses

## Optimal Positions for NBA Players

I was thinking about how one can use the NBA’s new SportVU system to figure out optimal positions for players on the court. One of the interesting things about the SportVU system is that it tracks player $(x,y)$ coordinates on the court. Presumably, it also keeps track of whether or not a player located at $(x,y)$ makes a shot or misses it. Let us denote a player making a shot by $1$, and a player missing a shot by $0$. Then, one essentially will have data in the form $(x,y, \text{1/0})$.

One can then use a logistic regression to determine the probability that a player at position $(x,y)$ will make a shot: $p(x,y) = \frac{\exp\left(\beta_0 + \beta_1 x + \beta_2 y\right)}{1 +\exp\left(\beta_0 + \beta_1 x + \beta_2 y\right)}$

The main idea is that the parameters $\beta_0, \beta_1, \beta_2$ uniquely characterize a given player’s probability of making a shot.

As a coaching staff from an offensive perspective, let us say we wish to position players as to say they have a very high probability of making a shot, let us say, for demonstration purposes 99%. This means we must solve the optimization problem: $\frac{\exp\left(\beta_0 + \beta_1 x + \beta_2 y\right)}{1 +\exp\left(\beta_0 + \beta_1 x + \beta_2 y\right)} = 0.99$ $\text{s.t. } 0 \leq x \leq 28, \quad 0 \leq y \leq 47$

(The constraints are determined here by the x-y dimensions of a standard NBA court).

This has the following solutions: $x = \frac{-1. \beta _0-1. \beta _2 y+4.59512}{\beta _1}, \quad \frac{-1. \beta _0-28. \beta _1+4.59512}{\beta _2} \leq y$

with the following conditions: One can also have: $x = \frac{-1. \beta _0-1. \beta _2 y+4.59512}{\beta _1}, \quad y \leq 47$

with the following conditions: Another solution is: $x = \frac{-1. \beta _0-1. \beta _2 y+4.59512}{\beta _1}$

with the following conditions: The fourth possible solution is: $x = \frac{-1. \beta _0-1. \beta _2 y+4.59512}{\beta _1}$

with the following conditions: In practice, it should be noted, that it is typically unlikely to have a player that has a 99% probability of making a shot.

To put this example in more practical terms, I generated some random data (1000 points) for a player in terms of $(x,y)$ coordinates and whether he made a shot from that distance or not. The following scatter plot shows the result of this simulation: In this plot, the red dots indicate a player has made a shot (a response of 1.0) from the $(x,y)$ coordinates given, while a purple dot indicates a player has missed a shot from the $(x,y)$ coordinates given (a response of 0.0).

Performing a logistic regression on this data, we obtain that $\beta_0 = 0, \beta_1 = 0.00066876, \beta_2 = -0.00210949$.

Using the equations above, we see that this player has a maximum probability of $58.7149 \%$ of making a shot from a location of $(x,y) = (0,23)$, and a minimum probability of $38.45 \%$ of making a shot from a location of $(x,y) = (28,0)$.

Using data science / machine learning methodologies, it basically showed that the most important factors in characterizing a team’s playoff eligibility are the opponent field goal percentage and the opponent points per game. This seems to suggest that defensive factors as opposed to offensive factors are the most important characteristics shared among NBA playoff teams. It was also shown that championship teams must be able to have very strong defensive characteristics, in particular, strong perimeter defense characteristics in combination with an effective half-court offense that generates high-percentage two-point shots. A key part of this offensive strategy must also be the ability to draw fouls.

Some people have commented that despite this, teams who frequently attempt three point shots still can be considered to have an efficient offense as doing so leads to better rebounding, floor spacing, and higher percentage shots. We show below that this is not true. Looking at the last 16 years of all NBA teams (using the same data we used in the paper), we performed a correlation analysis of an individual NBA team’s 3-point attempts per game and other relevant variables, and discovered: One sees that there is very little correlation between a team’s 3-point attempts per game and 2-point percentage, free throws, free throw attempts, and offensive rebounds. In fact, at best, there is a somewhat “medium” anti-correlation between 3-point attempts per game and a team’s 2-point attempts per game.

## The Mathematics of The Triangle Offense, Continued…

In a previous post, I showed how given random positions of 5 players on the court that they could “fill” the triangle. The main geometric constraint is that 5 players can form 3 triangles on the court, and that due to spacing requirements, these triangles are “optimal” if they are equilateral triangles.

Given that we now know how to fill the triangle, the question that this post tries to address is that how can players actually move within the triangle. The key is symmetry. Players must all move in a way such that the equilateral triangles remain invariant. Equilateral triangles have associated with them the $D_{3}$ dihedral symmetry group. They are therefore invariant with respect to 120 degree rotations, 240 degree rotations, 0 degree rotations, and three reflections.

There are therefore six generators of this group: $\left( \begin{array}{cc} 1 & 0 \\ 0 & 1 \\ \end{array} \right), \left( \begin{array}{cc} -\frac{1}{2} & -\frac{\sqrt{3}}{2} \\ \frac{\sqrt{3}}{2} & -\frac{1}{2} \\ \end{array} \right),\left( \begin{array}{cc} -\frac{1}{2} & \frac{\sqrt{3}}{2} \\ -\frac{\sqrt{3}}{2} & -\frac{1}{2} \\ \end{array} \right), \left( \begin{array}{cc} \frac{1}{2} & \frac{\sqrt{3}}{2} \\ \frac{\sqrt{3}}{2} & -\frac{1}{2} \\ \end{array} \right),\left( \begin{array}{cc} -1 & 0 \\ 0 & 1 \\ \end{array} \right),\left( \begin{array}{cc} \frac{1}{2} & -\frac{\sqrt{3}}{2} \\ -\frac{\sqrt{3}}{2} & -\frac{1}{2} \\ \end{array} \right).$

In fact, the Cayley graph for this group is as follows: For now, I will discuss how players can move within the action of 120 degree rotations. As in the previous posting, let the $(x,y)$-coordinates of player $i$ be represented by $(x^{i}, y^{i})$, where $i = 1,2,3,4,5$. Then, under a 120 degree rotation, the player’s coordinates get shifted according to: $\boxed{x^{i}_{t+1} = \frac{1}{2} \left(-x^{i}_{t} - \sqrt{3}y^{i}_{t}\right), \quad y^{i}_{t+1} = \frac{1}{2}\left(\sqrt{3}x^{i}_{t} - y^{i}_{t}\right)}$

This is a discrete dynamical system. In fact, it can be solved explicitly. Let $x^i_{0}, y^{i}_{0}$ represent the initial coordinates of player $i$. Then, one solves the above discrete system to obtain: $\boxed{x^i_t =\frac{1}{2} e^{\frac{1}{3} (-2) i \pi t} \left[\left(1+e^{\frac{4 i \pi t}{3}}\right) x^i_0+i \left(-1+e^{\frac{4 i \pi t}{3}}\right) y^i_0\right], \quad y^{i}_{t} =\frac{1}{2} e^{\frac{1}{3} (-2) i \pi t} \left[\left(1+e^{\frac{4 i \pi t}{3}}\right) y^i_0-i \left(-1+e^{\frac{4 i \pi t}{3}}\right) x^i_0\right]}$

Now, we can simulate this to see actually how players move within the triangle offense, forming equilateral triangles in every sequence: This is running in continuous time, that is, endlessly. In future postings, I will update this to include the other symmetries of the dihedral $D_{3}$ group. However, the challenge is that this symmetry group is non-Abelian, so it will be interesting to implement pairs of consecutive symmetry operations in a simulation that would still result in invariant equilateral triangles.

Hopefully, this post also shows why teams cannot really run “parts” of the triangle, as one player’s movement necessarily effects everyone else’s. This is something that Charley Rosen also mentioned in an article of his own.

## What are the factors behind Golden State’s and Cleveland’s Wins in The NBA Finals

As I write this, Cleveland just won the series 4-3. What was behind each team’s wins and losses in this series?

First, Golden State: A correlation plot of their per game predictor variables versus the binary win/loss outcome is as follows: The key information is in the last column of this matrix: Evidently, the most important factors in GSW’s winning games were Assists, number of Field Goals made, Field Goal percentage, and steals. The most important factors in GSW losing games this series were number of three point attempts per game (Imagine that!), and number of personal fouls per game.

Now, Cleveland: A correlation plot of their per game predictor variables versus the binary win/loss outcome is as follows: The key information is in the last column of this matrix: Evidently, the most important factor in CLE’s wins was their number of defensive rebounds. Following behind this were number of three point shots made, and field goal percentage. There were some weak correlations between Cleveland’s losses and their number of offensive rebounds and turnovers.

Note that these results are essentially a summary analysis of previous blog postings which tracked individual games. For example, here , here and a first attempt here.

## The Mathematics of “Filling the Triangle”

I’ve been fascinated by the triangle offense for a long time. I think it is a beautiful way to play basketball, and the right way to play basketball, in the half-court, a “system-based” way to play. For those of you that are interested, I highly recommend Tex Winter’s classic book on the topic.

There is this brief video as well where Tex Winter explains how the triangle offense and a basketball are grounded in geometric principles:

I don’t think people recognize though how deep of a geometry problem this is actually. Looking at when the triangle is filled, as in the video above, we have the following situation: The 3 triangles that form when one triangle is filled involving all 5 players. The letters a,b,c,d,e,f,g,h,i denote the angles within the triangles. We are assuming NBA court dimensions where the 1/2 court is 47′ long and the team bench area which roughly corresponds to the top of the three-point line is 28′ from the baseline.

The problem I wanted to study was given 5 players’ random positions on the court, could a series of equations be solved yielding (x,y) coordinates that would yield where players should “go” to fill the triangle?

Using simple geometry, from the diagram above, we see that each player’s position in the triangle offense is governed by the following system of nonlinear equations: $\left(x_4-x_2\right) \left(x_4-x_5\right)+\left(y_4-y_2\right) \left(y_4-y_5\right)=\cos (a) \sqrt{\left(x_2-x_4\right){}^2+\left(y_2-y_4\right){}^2} \sqrt{\left(x_4-x_5\right){}^2+\left(y_4-y_5\right){}^2}$ $\left(x_4-x_2\right) \left(x_2-x_5\right)+\left(y_4-y_2\right) \left(y_2-y_5\right)=\cos (b) \sqrt{\left(x_2-x_4\right){}^2+\left(y_2-y_4\right){}^2} \sqrt{\left(x_2-x_5\right){}^2+\left(y_2-y_5\right){}^2}$ $\left(x_2-x_5\right) \left(x_4-x_5\right)+\left(y_2-y_5\right) \left(y_4-y_5\right)=\cos (c) \sqrt{\left(x_2-x_5\right){}^2+\left(y_2-y_5\right){}^2} \sqrt{\left(x_4-x_5\right){}^2+\left(y_4-y_5\right){}^2}$ $\left(x_2-x_1\right) \left(x_2-x_5\right)+\left(y_2-y_1\right) \left(y_2-y_5\right)=\cos (d) \sqrt{\left(x_1-x_2\right){}^2+\left(y_1-y_2\right){}^2} \sqrt{\left(x_2-x_5\right){}^2+\left(y_2-y_5\right){}^2}$ $\left(x_2-x_1\right) \left(x_1-x_5\right)+\left(y_2-y_1\right) \left(y_1-y_5\right)=\cos (e) \sqrt{\left(x_1-x_2\right){}^2+\left(y_1-y_2\right){}^2} \sqrt{\left(x_1-x_5\right){}^2+\left(y_1-y_5\right){}^2}$ $\left(x_1-x_5\right) \left(x_2-x_5\right)+\left(y_1-y_5\right) \left(y_2-y_5\right)=\cos (f) \sqrt{\left(x_1-x_5\right){}^2+\left(y_1-y_5\right){}^2} \sqrt{\left(x_2-x_5\right){}^2+\left(y_2-y_5\right){}^2}$ $\left(x_1-x_3\right) \left(x_1-x_5\right)+\left(y_1-y_3\right) \left(y_1-y_5\right)=\cos (h) \sqrt{\left(x_1-x_3\right){}^2+\left(y_1-y_3\right){}^2} \sqrt{\left(x_1-x_5\right){}^2+\left(y_1-y_5\right){}^2}$ $\left(x_1-x_3\right) \left(x_3-x_5\right)+\left(y_1-y_3\right) \left(y_3-y_5\right)=\cos (i) \sqrt{\left(x_1-x_3\right){}^2+\left(y_1-y_3\right){}^2} \sqrt{\left(x_3-x_5\right){}^2+\left(y_3-y_5\right){}^2}$ $\left(x_1-x_5\right) \left(x_3-x_5\right)+\left(y_1-y_5\right) \left(y_3-y_5\right)=\cos (g) \sqrt{\left(x_1-x_5\right){}^2+\left(y_1-y_5\right){}^2} \sqrt{\left(x_3-x_5\right){}^2+\left(y_3-y_5\right){}^2}$

Further, the angles obviously must satisfy the following constraints: $a + b + c = \pi, \quad d + e + f = \pi, \quad g + h + i = \pi$

Finally, we require that each player be about 15-20 feet apart in the triangle offense (because the offense is predicated on spacing), and thus have some additional constraints: $15\leq \sqrt{\left(x_2-x_4\right){}^2+\left(y_2-y_4\right){}^2}\leq 20$ $15\leq \sqrt{\left(x_4-x_5\right){}^2+\left(y_4-y_5\right){}^2}\leq 20$ $15\leq \sqrt{\left(x_2-x_5\right){}^2+\left(y_2-y_5\right){}^2}\leq 20$ $15\leq \sqrt{\left(x_1-x_2\right){}^2+\left(y_1-y_2\right){}^2}\leq 20$ $15\leq \sqrt{\left(x_1-x_5\right){}^2+\left(y_1-y_5\right){}^2}\leq 20$ $15\leq \sqrt{\left(x_1-x_3\right){}^2+\left(y_1-y_3\right){}^2}\leq 20$ $15\leq \sqrt{\left(x_3-x_5\right){}^2+\left(y_3-y_5\right){}^2}\leq 20$

Solving this highly nonlinear system of equations with constraints is not a trivial problem! It fact, because of the high degree of nonlinearity and dimension of the problem, it is safe to assume that no closed-form solution exists, and therefore, must be solved numerically.

For this task, we used MATLAB, and experimented with the lsqnonlin() and fsolve() commands. The only issue is that (as with all such numerical algorithms) convergence depends very highly on the choice of initial conditions. It is very difficult to choose a priori this many initial conditions, so I wrote a script that randomized initial conditions. I then ran several numerical experiments and obtained the following results: In the plot above, I have labeled the plots that converged to the triangle formation with the title “this one”. In addition, the five black circles denote the initial positions of the players on the court before they fill the triangles in the offense. One sees just by the diagram above, how difficult such a problem is to solve mathematically, even through a numerical approach. Running more trials would perhaps yield better results, but, it works! I am truly fascinated by this. In the coming days, I will work on optimizing the numerical algorithm, and post my updates as they come.

Here is an animation of one of the scenarios above when the algorithm converges correctly:

In this animation above, the black dots represent the positions of the players on the court. They begin at initial (random) positions and attempt to fill the triangles as described above.

## Game 2 of CLE vs GSW Breakdown

As usual, here is the post-game breakdown of Game 2 of the NBA Finals between Cleveland and Golden State. Using my live-tracking app to track the relevant factors (as explained in previous posts) here are the live-captured time series: Computing the correlations between each time series above and the Golden State Warriors point difference, we obtain: One sees once again that the most relevant factors to GSW’s point difference in the game was CLE’s personal fouls during the game, GSW’s personal fouls during the game, and not far behind, GSW 3-point percentage during the game. What is interesting is that one can see the importance of these variables played out in real time matching the two graphs above.

In fact, looking at the personal fouls vs. GSW point difference in real time (essentially taking a subset of the time series graph above), we obtain: 