## NCAA March Madness 2017 Predictions

Update: March 18, 2017: In a stunning upset, Wisconsin just beat Villanova. It is easy to see why this happened based on the factor relevance diagram below. To win games, Villanova has relied heavily on moving the ball, while Wisconsin has relied heavily on opposing assists! Wisconsin had a minor 5 assists in the whole game today, great defense by them.

Original Article: March 16, 2017

So, I’m a bit late this year with these, but, it’s only the first day of the tournament as I write this (teaching 2 courses in 1 semester tends to take up A LOT of one’s time!). Anyways, I tried to use Machine Learning methodologies such as neural networks to make predictions on who is going to win the NCAA tournament this year.

To do this, I trained a neural network model on the last 17 seasons of NCAA regular-season team data.

The first thing that I found was what are the most relevant predictor variables in a team’s NCAA championship success:

1. Free Throws Made : 99.99% relevance
2. Opponent Assists : 55.86% relevance
3. Opponent Field Goal Attempts : 31.44% relevance
4. Free Throws Attempted : -83.13% relevance
5. Opponent Field Goals Made: -69.2% relevance

It is interesting that the most important factor in deciding whether or not a team wins the NCAA tournament is actually free throw percentage. In other words, schools that have a knack for shooting a high free throw percentage seem to have the highest probability of winning the NCAA tournament. (Point 1 and Point 4 in the list above translates to having a high free throw percentage.) Obviously, with a neural network the relationship between these predictors and the output is not necessarily linear, so other factors could play a strong role as well.

The neural network structure used looked like this:

Now, for the results:

 School Name Probability of Winning Tournament Villanova 0.9294916774 Gonzaga 0.8076801 Baylor 0.716319 Arizona 0.5516670309 Duke 0.005617711 Saint Mary’s 0.0048923492 Wichita St. 0.001208123 Purdue 0.001180955 SMU 0.0008327729 North Carolina 0.0006080225 UCLA 0.0003794108 S. Dakota St. 0.0003186754 Oregon 0.0002288606 Princeton 0.0002107522 Wisconsin 0.000206285 Northwestern 0.0001878604 Cincinnati 0.0001875887 Marquette 0.0001828106 Virgnia 0.0001532999 Kent St. 0.0001353252 Miami 0.0001338989 Fla. Gulf Coast 0.0001308963 Vermont 0.0001288239 Notre Dame 0.0001278009 Minnesota 0.0001277032 New Mexico State 0.0001276369 USC 0.0001274456 Middle Tenn. 0.0001268802 Florida 0.0001265646 Texas Southern 0.0001265547 Xavier 0.0001264269 Vanderbilt 0.0001262982 Michigan 0.0001261976 East Tenn. St. 0.0001261878 Nevada 0.0001261331 Butler 0.0001260504 Louisville 0.0001260042 Troy 0.0001259668 Dayton 0.0001259567 Arkansas 0.0001259387 Michigan St. 0.0001259298 Oklahoma St. 0.0001259287 Winthrop 0.0001259213 Iona 0.0001259197 Jacksonville St. 0.0001259174 Creighton 0.0001259092 West Virginia 0.0001259032 North Carolin-Wilmington 0.0001259012 Northern Ky. 0.0001259000 Kansas 0.0001258950 Iowa St 0.0001258950 Bucknell 0.0001258945 Florida St 0.0001258939 Kentucky 0.0001258939 Virginia Tech 0.0001258938 Seton Hall 0.0001258937 Maryland 0.0001258936 North Dakota 0.0001258936 South Carolina 0.0001258935 Rhode Island 0.0001258934 Kansas St. 0.0001258933 Mount St. Mary’s 0.0001258932 VCU 0.0001258931 UC Davis 0.0001258929

This neural network model predicts that the team with the highest probability of winning the NCAA tournament this year is Villanova with a 92.94% chance of winning, followed by Gonzaga with a 80.77% chance of winning, Baylor with a 71.63% chance of winning, and Arizona with a 55.16% chance of winning.

## Analyzing Lebron James’ Offensive Play

Where is Lebron James most effective on the court?

Based on 2015-2016 data, we obtained from NBA.com the following data which tracks Lebron’s FG% based on defender distance:

From Basketball-Reference.com, we then obtained data of Lebron’s FG% based on his shot distance from the basket:

Based on this data, we generated tens of thousands of sample data points to perform a Monte Carlo simulation to obtain relevant probability density functions. We found that the joint PDF was a very lengthy expression(!):

Graphically, this is:

A contour plot of the joint PDF was computed to be:

From this information, we can compute where/when LeBron has the highest probability of making a shot. Numerically, we found that the maximum probability occurs when Lebron’s defender is 0.829988 feet away, while Lebron is 1.59378 feet away from the basket. What is interesting is that this analysis shows that defending Lebron tightly doesn’t seem to be an effective strategy if his shot distance is within 5 feet of the basket. It is only an effective strategy further than 5 feet away from the basket. Therefore, opposing teams have the best chance at stopping Lebron from scoring by playing him tightly and forcing him as far away from the basket as possible.

## Optimal Positions for NBA Players

I was thinking about how one can use the NBA’s new SportVU system to figure out optimal positions for players on the court. One of the interesting things about the SportVU system is that it tracks player $(x,y)$ coordinates on the court. Presumably, it also keeps track of whether or not a player located at $(x,y)$ makes a shot or misses it. Let us denote a player making a shot by $1$, and a player missing a shot by $0$. Then, one essentially will have data in the form $(x,y, \text{1/0})$.

One can then use a logistic regression to determine the probability that a player at position $(x,y)$ will make a shot:

$p(x,y) = \frac{\exp\left(\beta_0 + \beta_1 x + \beta_2 y\right)}{1 +\exp\left(\beta_0 + \beta_1 x + \beta_2 y\right)}$

The main idea is that the parameters $\beta_0, \beta_1, \beta_2$ uniquely characterize a given player’s probability of making a shot.

As a coaching staff from an offensive perspective, let us say we wish to position players as to say they have a very high probability of making a shot, let us say, for demonstration purposes 99%. This means we must solve the optimization problem:

$\frac{\exp\left(\beta_0 + \beta_1 x + \beta_2 y\right)}{1 +\exp\left(\beta_0 + \beta_1 x + \beta_2 y\right)} = 0.99$

$\text{s.t. } 0 \leq x \leq 28, \quad 0 \leq y \leq 47$

(The constraints are determined here by the x-y dimensions of a standard NBA court).

This has the following solutions:

$x = \frac{-1. \beta _0-1. \beta _2 y+4.59512}{\beta _1}, \quad \frac{-1. \beta _0-28. \beta _1+4.59512}{\beta _2} \leq y$

with the following conditions:

One can also have:

$x = \frac{-1. \beta _0-1. \beta _2 y+4.59512}{\beta _1}, \quad y \leq 47$

with the following conditions:

Another solution is:

$x = \frac{-1. \beta _0-1. \beta _2 y+4.59512}{\beta _1}$

with the following conditions:

The fourth possible solution is:

$x = \frac{-1. \beta _0-1. \beta _2 y+4.59512}{\beta _1}$

with the following conditions:

In practice, it should be noted, that it is typically unlikely to have a player that has a 99% probability of making a shot.

To put this example in more practical terms, I generated some random data (1000 points) for a player in terms of $(x,y)$ coordinates and whether he made a shot from that distance or not. The following scatter plot shows the result of this simulation:

In this plot, the red dots indicate a player has made a shot (a response of 1.0) from the $(x,y)$ coordinates given, while a purple dot indicates a player has missed a shot from the $(x,y)$ coordinates given (a response of 0.0).

Performing a logistic regression on this data, we obtain that $\beta_0 = 0, \beta_1 = 0.00066876, \beta_2 = -0.00210949$.

Using the equations above, we see that this player has a maximum probability of $58.7149 \%$ of making a shot from a location of $(x,y) = (0,23)$, and a minimum probability of $38.45 \%$ of making a shot from a location of $(x,y) = (28,0)$.