Optimal Positions for NBA Players

I was thinking about how one can use the NBA’s new SportVU system to figure out optimal positions for players on the court. One of the interesting things about the SportVU system is that it tracks player (x,y) coordinates on the court. Presumably, it also keeps track of whether or not a player located at (x,y) makes a shot or misses it. Let us denote a player making a shot by 1, and a player missing a shot by 0. Then, one essentially will have data in the form (x,y, \text{1/0}).

One can then use a logistic regression to determine the probability that a player at position (x,y) will make a shot:

p(x,y) = \frac{\exp\left(\beta_0 + \beta_1 x + \beta_2 y\right)}{1 +\exp\left(\beta_0 + \beta_1 x + \beta_2 y\right)}

The main idea is that the parameters \beta_0, \beta_1, \beta_2 uniquely characterize a given player’s probability of making a shot.

As a coaching staff from an offensive perspective, let us say we wish to position players as to say they have a very high probability of making a shot, let us say, for demonstration purposes 99%. This means we must solve the optimization problem:

\frac{\exp\left(\beta_0 + \beta_1 x + \beta_2 y\right)}{1 +\exp\left(\beta_0 + \beta_1 x + \beta_2 y\right)} = 0.99

\text{s.t. } 0 \leq x \leq 28, \quad 0 \leq y \leq 47

(The constraints are determined here by the x-y dimensions of a standard NBA court).

This has the following solutions:

x = \frac{-1. \beta _0-1. \beta _2 y+4.59512}{\beta _1}, \quad \frac{-1. \beta _0-28. \beta _1+4.59512}{\beta _2} \leq y

with the following conditions:

constraints1

One can also have:

x = \frac{-1. \beta _0-1. \beta _2 y+4.59512}{\beta _1}, \quad y \leq 47

with the following conditions:

constraints2

Another solution is:

x = \frac{-1. \beta _0-1. \beta _2 y+4.59512}{\beta _1}

with the following conditions:

constraints3

The fourth possible solution is:

x = \frac{-1. \beta _0-1. \beta _2 y+4.59512}{\beta _1}

with the following conditions:

constraints4

In practice, it should be noted, that it is typically unlikely to have a player that has a 99% probability of making a shot.

To put this example in more practical terms, I generated some random data (1000 points) for a player in terms of (x,y) coordinates and whether he made a shot from that distance or not. The following scatter plot shows the result of this simulation:

bballoptim5

In this plot, the red dots indicate a player has made a shot (a response of 1.0) from the (x,y) coordinates given, while a purple dot indicates a player has missed a shot from the (x,y) coordinates given (a response of 0.0).

Performing a logistic regression on this data, we obtain that \beta_0 = 0, \beta_1 = 0.00066876, \beta_2 = -0.00210949.

Using the equations above, we see that this player has a maximum probability of 58.7149 \% of making a shot from a location of (x,y) = (0,23), and a minimum probability of 38.45 \% of making a shot from a location of (x,y) = (28,0).

Will Donald Trump’s Proposed Immigration Policies Curb Terrorism in The US?

In recent days, Donald Trump proposed yet another iteration of his immigration policy which is focused on “Keeping America Safe” as part of his plan to “Make America Great Again!”. In this latest iteration, in addition to suspending visas from countries with terrorist ties, he is also proposing introducing an ideological test for those entering the US. As you can see in the BBC article, he is also fond of holding up bar graphs of showing the number of refugees entering the US over a period of time, and somehow relates that to terrorist activities in the US, or at least, insinuates it.

Let’s look at the facts behind these proposals using the available data from 2005-2014. Specifically, we analyzed:

  1. The number of terrorist incidents per year from 2005-2014 from here (The Global Terrorism Database maintained by The University of Maryland)
  2. The Department of Homeland Security Yearbook of Immigration Statistics, available here . Specifically, we looked at Persons Obtaining Lawful Permanent Resident Status by Region and Country of Birth (2005-2014) and Refugee Arrivals by Region and Country of Nationality (2005-2014).

Given these datasets, we focused on countries/regions labeled as terrorist safe havens and state sponsors of terror based on the criteria outlined here .

We found the following.

First, looking at naturalized citizens, these computations yielded:

Country

Correlations

Percent of Variance Explained 

Afghanistan

0.61169

0.37416

Egypt

0.26597

0.07074

Indonesia

-0.66011

0.43574

Iran

-0.31944

0.10204

Iraq

0.26692

0.07125

Lebanon

-0.35645

0.12706

Libya

0.59748

0.35698

Malaysia

0.39481

0.15587

Mali

0.20195

0.04079

Pakistan

0.00513

0.00003

Phillipines

-0.79093

0.62557

Somalia

-0.40675

0.16544

Syria

0.62556

0.39132

Yemen

-0.11707

0.01371

In graphical form:

The highest correlations are 0.62556 and 0.61669 from Syria and Afghanistan respectively. The highest anti-correlations were from Indonesia and The Phillipines at -0.66011 and -0.79093 respectively. Certainly, none of the correlations exceed 0.65, which indicates that there could be some relationship between the number of naturalized citizens from these particular countries and the number of terrorist incidents, but, it is nowhere near conclusive. Further, looking at Syria, we see that the percentage of variance explained / coefficient of determination is 0.39132, which means that only about 39% of the variation in the number of terrorist incidents can be predicted from the relationship between where a naturalized citizen is born and the number of terrorist incidents in The United States.

Second, looking at refugees, these computations yielded:

Country

Correlations

Percent of Variance Explained

Afghanistan

0.59836

0.35803

Egypt

0.66657

0.44432

Iran

-0.29401

0.08644

Iraq

0.49295

0.24300

Pakistan

0.60343

0.36413

Somalia

0.14914

0.02224

Syria

0.56384

0.31792

Yemen

-0.35438

0.12558

Other

0.54109

0.29278

In graphical form:

We see that the highest correlations are from Egypt (0.6657), Pakistan (0.60343), and Afghanistan (0.59836). This indicates there is some mild correlation between refugees from these countries and the number of terrorist incidents in The United States, but it is nowhere near conclusive. Further, the coefficients of determination from Egypt and Syria are 0.44432 and 0.31792 respectively. This means that in the case of Syrian refugees for example, only 31.792% of the variation in terrorist incidents in the United States can be predicted from the relationship between a refugee’s country of origin and the number of terrorist incidents in The United States.

In conclusion, it is therefore unlikely that Donald Trump’s proposals would do anything to significantly curb the number of terrorist incidents in The United States. Further, repeatedly showing pictures like this:

at his rallies is doing nothing to address the issue at hand and is perhaps only serving as yet another fear tactic as has become all too common in his campaign thus far.

(Thanks to Hargun Singh Kohli, Honours B.A., LL.B. for the initial data mining and processing of the various datasets listed above.)

Note, further to the results of this article, I was recently made aware of this excellent article from The WSJ, which I have summarized below:

Live Metrics for NBA Games

Yesterday for the first time, I took the playoff game between Cleveland and Toronto as an opportunity to test out a script I wrote in R that keeps track of key statistics during a game in real time (well, every 30 seconds). Based on previous work, it is evident that championship-calibre teams are the ones that have excellent 2PT-FG% and the ability to draw fouls, so I tracked these during the game, and I came up with the following plot of several time series:


One sees for example that while Toronto started off the game with a much higher 2PT FG%, towards the end Cleveland ended up winning that battle.

A video of this animation is as follows (set the YouTube player to 1080p + FullScreen for Max Quality!)

An interesting question to ask is how are these series correlated? Well, let’s see:

corrplot
In this correlation plot, “pd” indicates point difference, “PF” indicates personal fouls, “2PFG.” indicates 2-Point field goal percentage.

One sees immediately from the correlation plot above that there is a very strong correlation between Cleveland’s point difference  and Toronto’s personal fouls, with some strong correlations attributed to Cleveland’s 2-Point FG% as well.  The equal and opposite is true for Toronto’s point difference. It seems that during a game of this intensity in the playoffs, drawing fouls is a very important factor in determining which team leads and eventually wins in the game combined with 2-Point field goal percentage.

How close were The Knicks to making the Playoffs?

It is another New York Knicks season where fans have to wait until next year to see if the Knicks will make the playoffs or not.

Yesterday, there was a lot buzz around the idea that Phil Jackson may want to keep Kurt Rambis on as head coach, and as usual, there were numerous people that were very vocal in their criticism.

However, in actuality, the Knicks were much closer to the playoffs than people realize. A previous post of mine described in detail using data science methodologies the criteria a team must meet to have a high probability of making the playoffs. 

Using the decision tree generated in that post, I evaluated the Knicks playoffs chances this season based on possible playoff criteria scenarios, and found the following:

knicksplayoffs

One sees that a big problem was the Knicks margin of victory, which was too negative. However, even in this case, there are possibilities that existed that would have allowed the Knicks to make the playoffs. For example, a slight increase in the Knicks’ opponent’s field goal attempts or a very slight decrease in the Knicks’ field goal attempts per game would have greatly impacted their playoff chances.

These metrics can easily be adjusted for the upcoming season which will likely require a more organized execution of the triangle offense and discipline on both ends of the floor. They really are almost there!

The Effect of Individual State Election Results on The National Election

A short post by me today. I wanted to look at the which states are important in winning the national election. Looking at the last 14 presidential elections, I generated the following correlation plot:

  
For those not familiar with how correlation plots work, the number bar on the right-hand-side of the graph indicates the correlation between a state on the left side with a state at the top, with the last row and column respectively indicating the national presidential election winner. Dark blue circles representing a correlation close to 1, indicate a strong relationship between the two variables, while orange-to-red circles representing a correlation close to -1 indicate a strong anti-correlation between the two variables, while almost white circles indicate no correlation between the two variables.

For example, one can see there is a very strong correlation between who wins Nevada and the winner of the national election. Indeed, Nevada has picked the last 13 of 14 U.S. Presidents. Darker blue circles indicate a strong correlation, while lighter orange-red circles indicate a weak correlation. This also shows the correlation between winning states. For example, from the plot above, candidates who win Alabama have a good chance of winning Mississippi or Wyoming, but virtually no chance of winning California.

This could serve as a potential guide in determining which states are extremely important to win during the election season!

 

What Do NBA Playoff Teams Have in Common?

I’ve been interested for some time on figuring out an analytical way to determine what characterizes an NBA team as a playoff team. Looking at the previous six seasons, I pulled together almost 65 different statistics that characterize how a team plays, and then performed a classification tree analysis. I found the following result:

  
For the above tree, the misclassification error rate was 2.73%. Also, MOV stands for margin of victory, o3PA is the number of opponent three-point attempts per game, DRtg, is defensive rating, which is the number of points a team allows per 100 possessions, and so on. The data itself was taken from Basketball-Reference.com.

We see that the following patterns emerge among NBA playoff teams over the past number of seasons.

  1. MOV > 2.695
  2. MOV < -0.54, MOV > -1.825, Opponent 3PA > 16.0732, Defensive Rating < 106.05
  3. MOV < -0.54, MOV > -1.825, Opponent 3PA > 16.0732, Defensive Rating > 106.05, FGA < 80.2195
  4. MOV < 2.695, Opponent FGA < 82.0671, MOV < 0.295, Opponent FT > 16.7866
  5. MOV < 2.695, Opponent FGA < 82.0671, MOV > 0.295
  6. MOV < 2.695, Opponent FGA > 82.0671,  Opponent DRB > 29.7683, FGA < 83.128
  7. MOV < 2.695, Opponent FGA > 82.0671,  Opponent DRB > 29.7683, FGA < 83.128, MOV < 2.17