NCAA March Madness 2017 Predictions

By: Dr. Ikjyot Singh Kohli

Update: March 18, 2017: In a stunning upset, Wisconsin just beat Villanova. It is easy to see why this happened based on the factor relevance diagram below. To win games, Villanova has relied heavily on moving the ball, while Wisconsin has relied heavily on opposing assists! Wisconsin had a minor 5 assists in the whole game today, great defense by them.




Original Article: March 16, 2017

So, I’m a bit late this year with these, but, it’s only the first day of the tournament as I write this (teaching 2 courses in 1 semester tends to take up A LOT of one’s time!). Anyways, I tried to use Machine Learning methodologies such as neural networks to make predictions on who is going to win the NCAA tournament this year.

To do this, I trained a neural network model on the last 17 seasons of NCAA regular-season team data.

The first thing that I found was what are the most relevant predictor variables in a team’s NCAA championship success:

  1. Free Throws Made : 99.99% relevance
  2. Opponent Assists : 55.86% relevance
  3. Opponent Field Goal Attempts : 31.44% relevance
  4. Free Throws Attempted : -83.13% relevance
  5. Opponent Field Goals Made: -69.2% relevance

It is interesting that the most important factor in deciding whether or not a team wins the NCAA tournament is actually free throw percentage. In other words, schools that have a knack for shooting a high free throw percentage seem to have the highest probability of winning the NCAA tournament. (Point 1 and Point 4 in the list above translates to having a high free throw percentage.) Obviously, with a neural network the relationship between these predictors and the output is not necessarily linear, so other factors could play a strong role as well.

The neural network structure used looked like this:

Now, for the results:

School Name

Probability of Winning Tournament

Villanova 0.9294916774
Gonzaga 0.8076801
Baylor 0.716319
Arizona 0.5516670309
Duke 0.005617711
Saint Mary’s 0.0048923492
Wichita St. 0.001208123
Purdue 0.001180955
SMU 0.0008327729
North Carolina 0.0006080225
UCLA 0.0003794108
S. Dakota St. 0.0003186754
Oregon 0.0002288606
Princeton 0.0002107522
Wisconsin 0.000206285
Northwestern 0.0001878604
Cincinnati 0.0001875887
Marquette 0.0001828106
Virgnia 0.0001532999
Kent St. 0.0001353252
Miami 0.0001338989
Fla. Gulf Coast 0.0001308963
Vermont 0.0001288239
Notre Dame 0.0001278009
Minnesota 0.0001277032
New Mexico State 0.0001276369
USC 0.0001274456
Middle Tenn. 0.0001268802
Florida 0.0001265646
Texas Southern 0.0001265547
Xavier 0.0001264269
Vanderbilt 0.0001262982
Michigan 0.0001261976
East Tenn. St. 0.0001261878
Nevada 0.0001261331
Butler 0.0001260504
Louisville 0.0001260042
Troy 0.0001259668
Dayton 0.0001259567
Arkansas 0.0001259387
Michigan St. 0.0001259298
Oklahoma St. 0.0001259287
Winthrop 0.0001259213
Iona 0.0001259197
Jacksonville St. 0.0001259174
Creighton 0.0001259092
West Virginia 0.0001259032
North Carolin-Wilmington 0.0001259012
Northern Ky. 0.0001259000
Kansas 0.0001258950
Iowa St 0.0001258950
Bucknell 0.0001258945
Florida St 0.0001258939
Kentucky 0.0001258939
Virginia Tech 0.0001258938
Seton Hall 0.0001258937
Maryland 0.0001258936
North Dakota 0.0001258936
South Carolina 0.0001258935
Rhode Island 0.0001258934
Kansas St. 0.0001258933
Mount St. Mary’s 0.0001258932
VCU 0.0001258931
UC Davis 0.0001258929

This neural network model predicts that the team with the highest probability of winning the NCAA tournament this year is Villanova with a 92.94% chance of winning, followed by Gonzaga with a 80.77% chance of winning, Baylor with a 71.63% chance of winning, and Arizona with a 55.16% chance of winning.

Analyzing Lebron James’ Offensive Play

Where is Lebron James most effective on the court?

Based on 2015-2016 data, we obtained from the following data which tracks Lebron’s FG% based on defender distance:


From, we then obtained data of Lebron’s FG% based on his shot distance from the basket:


Based on this data, we generated tens of thousands of sample data points to perform a Monte Carlo simulation to obtain relevant probability density functions. We found that the joint PDF was a very lengthy expression(!):


Graphically, this is:


A contour plot of the joint PDF was computed to be:


From this information, we can compute where/when LeBron has the highest probability of making a shot. Numerically, we found that the maximum probability occurs when Lebron’s defender is 0.829988 feet away, while Lebron is 1.59378 feet away from the basket. What is interesting is that this analysis shows that defending Lebron tightly doesn’t seem to be an effective strategy if his shot distance is within 5 feet of the basket. It is only an effective strategy further than 5 feet away from the basket. Therefore, opposing teams have the best chance at stopping Lebron from scoring by playing him tightly and forcing him as far away from the basket as possible.


Optimal Positions for NBA Players

I was thinking about how one can use the NBA’s new SportVU system to figure out optimal positions for players on the court. One of the interesting things about the SportVU system is that it tracks player (x,y) coordinates on the court. Presumably, it also keeps track of whether or not a player located at (x,y) makes a shot or misses it. Let us denote a player making a shot by 1, and a player missing a shot by 0. Then, one essentially will have data in the form (x,y, \text{1/0}).

One can then use a logistic regression to determine the probability that a player at position (x,y) will make a shot:

p(x,y) = \frac{\exp\left(\beta_0 + \beta_1 x + \beta_2 y\right)}{1 +\exp\left(\beta_0 + \beta_1 x + \beta_2 y\right)}

The main idea is that the parameters \beta_0, \beta_1, \beta_2 uniquely characterize a given player’s probability of making a shot.

As a coaching staff from an offensive perspective, let us say we wish to position players as to say they have a very high probability of making a shot, let us say, for demonstration purposes 99%. This means we must solve the optimization problem:

\frac{\exp\left(\beta_0 + \beta_1 x + \beta_2 y\right)}{1 +\exp\left(\beta_0 + \beta_1 x + \beta_2 y\right)} = 0.99

\text{s.t. } 0 \leq x \leq 28, \quad 0 \leq y \leq 47

(The constraints are determined here by the x-y dimensions of a standard NBA court).

This has the following solutions:

x = \frac{-1. \beta _0-1. \beta _2 y+4.59512}{\beta _1}, \quad \frac{-1. \beta _0-28. \beta _1+4.59512}{\beta _2} \leq y

with the following conditions:


One can also have:

x = \frac{-1. \beta _0-1. \beta _2 y+4.59512}{\beta _1}, \quad y \leq 47

with the following conditions:


Another solution is:

x = \frac{-1. \beta _0-1. \beta _2 y+4.59512}{\beta _1}

with the following conditions:


The fourth possible solution is:

x = \frac{-1. \beta _0-1. \beta _2 y+4.59512}{\beta _1}

with the following conditions:


In practice, it should be noted, that it is typically unlikely to have a player that has a 99% probability of making a shot.

To put this example in more practical terms, I generated some random data (1000 points) for a player in terms of (x,y) coordinates and whether he made a shot from that distance or not. The following scatter plot shows the result of this simulation:


In this plot, the red dots indicate a player has made a shot (a response of 1.0) from the (x,y) coordinates given, while a purple dot indicates a player has missed a shot from the (x,y) coordinates given (a response of 0.0).

Performing a logistic regression on this data, we obtain that \beta_0 = 0, \beta_1 = 0.00066876, \beta_2 = -0.00210949.

Using the equations above, we see that this player has a maximum probability of 58.7149 \% of making a shot from a location of (x,y) = (0,23), and a minimum probability of 38.45 \% of making a shot from a location of (x,y) = (28,0).