I have now made a significant update to my applied machine learning paper on predicting patterns among NBA playoff and championship teams, which can be accessed here: arXiv Link .
Tag: machine learning
Analyzing Lebron James’ Offensive Play
Where is Lebron James most effective on the court?
Based on 20152016 data, we obtained from NBA.com the following data which tracks Lebron’s FG% based on defender distance:
From BasketballReference.com, we then obtained data of Lebron’s FG% based on his shot distance from the basket:
Based on this data, we generated tens of thousands of sample data points to perform a Monte Carlo simulation to obtain relevant probability density functions. We found that the joint PDF was a very lengthy expression(!):
Graphically, this is:
A contour plot of the joint PDF was computed to be:
From this information, we can compute where/when LeBron has the highest probability of making a shot. Numerically, we found that the maximum probability occurs when Lebron’s defender is 0.829988 feet away, while Lebron is 1.59378 feet away from the basket. What is interesting is that this analysis shows that defending Lebron tightly doesn’t seem to be an effective strategy if his shot distance is within 5 feet of the basket. It is only an effective strategy further than 5 feet away from the basket. Therefore, opposing teams have the best chance at stopping Lebron from scoring by playing him tightly and forcing him as far away from the basket as possible.
The Relationship Between The Electoral College and Popular Vote
An interesting machine learning problem: Can one figure out the relationship between the popular vote margin, voter turnout, and the percentage of electoral college votes a candidate wins? Going back to the election of John Quincy Adams, the raw data looks like this:
Electoral College  Party  Popular vote Margin (%) 
Percentage of EC 

John Quincy Adams  D.R.  0.1044  0.27  0.3218 
Andrew Jackson  Dem.  0.1225  0.58  0.68 
Andrew Jackson  Dem.  0.1781  0.55  0.7657 
Martin Van Buren  Dem.  0.14  0.58  0.5782 
William Henry Harrison  Whig  0.0605  0.80  0.7959 
James Polk  Dem.  0.0145  0.79  0.6182 
Zachary Taylor  Whig  0.0479  0.73  0.5621 
Franklin Pierce  Dem.  0.0695  0.70  0.8581 
James Buchanan  Dem.  0.12  0.79  0.5878 
Abraham Lincoln  Rep.  0.1013  0.81  0.5941 
Abraham Lincoln  Rep.  0.1008  0.74  0.9099 
Ulysses Grant  Rep.  0.0532  0.78  0.7279 
Ulysses Grant  Rep.  0.12  0.71  0.8195 
Rutherford Hayes  Rep.  0.03  0.82  0.5014 
James Garfield  Rep.  0.0009  0.79  0.5799 
Grover Cleveland  Dem.  0.0057  0.78  0.5461 
Benjamin Harrison  Rep.  0.0083  0.79  0.58 
Grover Cleveland  Dem.  0.0301  0.75  0.6239 
William McKinley  Rep.  0.0431  0.79  0.6063 
William McKinley  Rep.  0.0612  0.73  0.6532 
Theodore Roosevelt  Rep.  0.1883  0.65  0.7059 
William Taft  Rep.  0.0853  0.65  0.6646 
Woodrow Wilson  Dem.  0.1444  0.59  0.8192 
Woodrow Wilson  Dem.  0.0312  0.62  0.5217 
Warren Harding  Rep.  0.2617  0.49  0.7608 
Calvin Coolidge  Rep.  0.2522  0.49  0.7194 
Herbert Hoover  Rep.  0.1741  0.57  0.8362 
Franklin Roosevelt  Dem.  0.1776  0.57  0.8889 
Franklin Roosevelt  Dem.  0.2426  0.61  0.9849 
Franklin Roosevelt  Dem.  0.0996  0.63  0.8456 
Franklin Roosevelt  Dem.  0.08  0.56  0.8136 
Harry Truman  Dem.  0.0448  0.53  0.5706 
Dwight Eisenhower  Rep.  0.1085  0.63  0.8324 
Dwight Eisenhower  Rep.  0.15  0.61  0.8606 
John Kennedy  Dem.  0.0017  0.6277  0.5642 
Lyndon Johnson  Dem.  0.2258  0.6192  0.9033 
Richard Nixon  Rep.  0.01  0.6084  0.5595 
Richard Nixon  Rep.  0.2315  0.5521  0.9665 
Jimmy Carter  Dem.  0.0206  0.5355  0.55 
Ronald Reagan  Rep.  0.0974  0.5256  0.9089 
Ronald Reagan  Rep.  0.1821  0.5311  0.9758 
George H. W. Bush  Rep.  0.0772  0.5015  0.7918 
Bill Clinton  Dem.  0.0556  0.5523  0.6877 
Bill Clinton  Dem.  0.0851  0.4908  0.7045 
George W. Bush  Rep.  0.0051  0.51  0.5037 
George W. Bush  Rep.  0.0246  0.5527  0.5316 
Barack Obama  Dem.  0.0727  0.5823  0.6784 
Barack Obama  Dem.  0.0386  0.5487  0.6171 
Clearly, the percentage of electoral college votes a candidate depends nonlinearly on the voter turnout percentage and popular vote margin (%) as this nonparametric regression shows:
We therefore chose to perform a nonlinear regression using neural networks, for which our structure was:
As is turns out, this simple neural network structure with one hidden layer gave the lowest test error, which was 0.002496419 in this case.
Now, looking at the most recent national polls for the upcoming election, we see that Hillary Clinton has a 6.1% lead in the popular vote. Our neural network model then predicts the following:
Simulation  Popular Vote Margin  Percentage of Voter Turnout  Predicted Percentage of Electoral College Votes (+/ 0.04996417) 
1  0.061  0.30  0.6607371 
2  0.061  0.35  0.6647464 
3  0.061  0.40  0.6687115 
4  0.061  0.45  0.6726314 
5  0.061  0.50  0.6765048 
6  0.061  0.55  0.6803307 
7  0.061  0.60  0.6841083 
8  0.061  0.65  0.6878366 
9  0.061  0.70  0.6915149 
10  0.061  0.75  0.6951424 
One sees that even for an extremely low voter turnout (30%), at this point Hillary Clinton can expect to win the Electoral College by a margin of 61.078% to 71.07013%, or 328 to 382 electoral college votes. Therefore, what seems like a relatively small lead in the popular vote (6.1%) translates according to this neural network model into a large margin of victory in the electoral college.
One can see that the predicted percentage of electoral college votes really depends on popular vote margin and voter turnout. For example, if we reduce the popular vote margin to 1%, the results are less promising for the leading candidate:
Pop.Vote Margin  Voter Turnout %  E.C. % Win  E.C% Win Best Case  E.C.% Win Worst Case 
0.01  0.30  0.5182854  0.4675000  0.5690708 
0.01  0.35  0.5244157  0.4736303  0.5752011 
0.01  0.40  0.5305820  0.4797967  0.5813674 
0.01  0.45  0.5367790  0.4859937  0.5875644 
0.01  0.50  0.5430013  0.4922160  0.5937867 
0.01  0.55  0.5492434  0.4984580  0.6000287 
0.01  0.60  0.5554995  0.5047141  0.6062849 
0.01  0.65  0.5617642  0.5109788  0.6125496 
0.01  0.70  0.5680317  0.5172463  0.6188171 
0.01  0.75  0.5742963  0.5235109  0.6250817 
One sees that if the popular vote margin is just 1% for the leading candidate, that candidate is not in the clear unless the popular vote exceeds 60%.
Breaking Down the 20152016 NBA Season
In this article, I will use Data Science / Machine Learning methodologies to break down the real factors separating the playoff from nonplayoff teams. In particular, I used the data from BasketballReference.com to associate 44 predictor variables which each team: “FG” “FGA” “FG.” “X3P” “X3PA” “X3P.” “X2P” “X2PA” “X2P.” “FT” “FTA” “FT.” “ORB” “DRB” “TRB” “AST” “STL” “BLK” “TOV” “PF” “PTS” “PS.G” “oFG” “oFGA” “oFG.” “o3P” “o3PA” “o3P.” “o2P” “o2PA” “o2P.” “oFT” “oFTA” “oFT.” “oORB” “oDRB” “oTRB” “oAST” “oSTL” “oBLK” “oTOV” “oPF” “oPTS” “oPS.G”
, where a letter ‘o’ before the last 22 predictor variables indicates a defensive variable. (‘o’ stands for opponent. )
Using principal components analysis (PCA), I was able to project this 44dimensional data set to a 5D dimensional data set. That is, the first 5 principal components were found to explain 85% of the variance.
Here are the various biplots:
In these plots, the teams are grouped according to whether they made the playoffs or not.
One sees from this biplot of the first two principal components that the dominant component along the first PC is 3 point attempts, while the dominant component along the second PC is opponent points. CLE and TOR have a high negative score along the second PC indicating a strong defensive performance. Indeed, one suspects that the final separating factor that led CLE to the championship was their defensive play as opposed to 3point shooting which allinall didn’t do GSW any favours. This is in line with some of my previous analyses.
Basketball Paper Update
Everyone by now knows about this paper I wrote a few months ago: http://arxiv.org/abs/1604.05266
Using data science / machine learning methodologies, it basically showed that the most important factors in characterizing a team’s playoff eligibility are the opponent field goal percentage and the opponent points per game. This seems to suggest that defensive factors as opposed to offensive factors are the most important characteristics shared among NBA playoff teams. It was also shown that championship teams must be able to have very strong defensive characteristics, in particular, strong perimeter defense characteristics in combination with an effective halfcourt offense that generates highpercentage twopoint shots. A key part of this offensive strategy must also be the ability to draw fouls.
Some people have commented that despite this, teams who frequently attempt three point shots still can be considered to have an efficient offense as doing so leads to better rebounding, floor spacing, and higher percentage shots. We show below that this is not true. Looking at the last 16 years of all NBA teams (using the same data we used in the paper), we performed a correlation analysis of an individual NBA team’s 3point attempts per game and other relevant variables, and discovered:
One sees that there is very little correlation between a team’s 3point attempts per game and 2point percentage, free throws, free throw attempts, and offensive rebounds. In fact, at best, there is a somewhat “medium” anticorrelation between 3point attempts per game and a team’s 2point attempts per game.
2016 RealTime Election Predictions
Further to my original post on using physics to predict the outcome of the 2016 US Presidential elections, I have now written a cloudbased app using the powerful Wolfram Cloud to pull the most recent polling data on the web from The HuffPost Pollster, which “tracks thousands of public polls to give you the latest data on elections, political opinions and more”. This app works in realtime and applies my PDEsolver / machine learning based algorithm to predict the probability of a candidate winning a state assuming the election is held tomorrow.
The app can be accessed by clicking the image below: (Note: If you obtain some type of server error, it means Wolfram’s server is busy, a refresh usually works. Also, results are only computed for states for which there exists reliable polling data. )
Will Donald Trump’s Proposed Immigration Policies Curb Terrorism in The US?
In recent days, Donald Trump proposed yet another iteration of his immigration policy which is focused on “Keeping America Safe” as part of his plan to “Make America Great Again!”. In this latest iteration, in addition to suspending visas from countries with terrorist ties, he is also proposing introducing an ideological test for those entering the US. As you can see in the BBC article, he is also fond of holding up bar graphs of showing the number of refugees entering the US over a period of time, and somehow relates that to terrorist activities in the US, or at least, insinuates it.
Let’s look at the facts behind these proposals using the available data from 20052014. Specifically, we analyzed:
 The number of terrorist incidents per year from 20052014 from here (The Global Terrorism Database maintained by The University of Maryland)
 The Department of Homeland Security Yearbook of Immigration Statistics, available here . Specifically, we looked at Persons Obtaining Lawful Permanent Resident Status by Region and Country of Birth (20052014) and Refugee Arrivals by Region and Country of Nationality (20052014).
Given these datasets, we focused on countries/regions labeled as terrorist safe havens and state sponsors of terror based on the criteria outlined here .
We found the following.
First, looking at naturalized citizens, these computations yielded:
Country 
Correlations 
Percent of Variance Explained 
Afghanistan 
0.61169 
0.37416 
Egypt 
0.26597 
0.07074 
Indonesia 
0.66011 
0.43574 
Iran 
0.31944 
0.10204 
Iraq 
0.26692 
0.07125 
Lebanon 
0.35645 
0.12706 
Libya 
0.59748 
0.35698 
Malaysia 
0.39481 
0.15587 
Mali 
0.20195 
0.04079 
Pakistan 
0.00513 
0.00003 
Phillipines 
0.79093 
0.62557 
Somalia 
0.40675 
0.16544 
Syria 
0.62556 
0.39132 
Yemen 
0.11707 
0.01371 
In graphical form:
The highest correlations are 0.62556 and 0.61669 from Syria and Afghanistan respectively. The highest anticorrelations were from Indonesia and The Phillipines at 0.66011 and 0.79093 respectively. Certainly, none of the correlations exceed 0.65, which indicates that there could be some relationship between the number of naturalized citizens from these particular countries and the number of terrorist incidents, but, it is nowhere near conclusive. Further, looking at Syria, we see that the percentage of variance explained / coefficient of determination is 0.39132, which means that only about 39% of the variation in the number of terrorist incidents can be predicted from the relationship between where a naturalized citizen is born and the number of terrorist incidents in The United States.
Second, looking at refugees, these computations yielded:
Country 
Correlations 
Percent of Variance Explained 
Afghanistan 
0.59836 
0.35803 
Egypt 
0.66657 
0.44432 
Iran 
0.29401 
0.08644 
Iraq 
0.49295 
0.24300 
Pakistan 
0.60343 
0.36413 
Somalia 
0.14914 
0.02224 
Syria 
0.56384 
0.31792 
Yemen 
0.35438 
0.12558 
Other 
0.54109 
0.29278 
In graphical form:
We see that the highest correlations are from Egypt (0.6657), Pakistan (0.60343), and Afghanistan (0.59836). This indicates there is some mild correlation between refugees from these countries and the number of terrorist incidents in The United States, but it is nowhere near conclusive. Further, the coefficients of determination from Egypt and Syria are 0.44432 and 0.31792 respectively. This means that in the case of Syrian refugees for example, only 31.792% of the variation in terrorist incidents in the United States can be predicted from the relationship between a refugee’s country of origin and the number of terrorist incidents in The United States.
In conclusion, it is therefore unlikely that Donald Trump’s proposals would do anything to significantly curb the number of terrorist incidents in The United States. Further, repeatedly showing pictures like this:
at his rallies is doing nothing to address the issue at hand and is perhaps only serving as yet another fear tactic as has become all too common in his campaign thus far.
(Thanks to Hargun Singh Kohli, Honours B.A., LL.B. for the initial data mining and processing of the various datasets listed above.)
Note, further to the results of this article, I was recently made aware of this excellent article from The WSJ, which I have summarized below: