Coronavirus Predictions

By: Dr. Ikjyot Singh Kohli

I wrote an extensive script in R that takes the most recent data available for the number of new/confirmed COVID-19 cases per day by location and computes the probability using statistical learning that a selected location will observe a new COVID-19 case (probability of observing a non-zero daily percentage return). You can access the dashboard by clicking the image below: (Beneath the screenshot are further examples of possible selections.)

Here, we see a map of all current COVID-19 locations, and an ability to select a specific location. Further, there are two calculations at the bottom of the screen: the first is the selected location(s) probability of observing a new case, the second is the current long-term trend of the daily growth rate of new cases for the selected location(s).
In this example, we have asked to return locations within the US that have more than an 87% probability of observing a new case. We can also see that for these locations, the long-term growth rate is trending towards 0.80.

Did Clyburn Help Biden in South Carolina?

By: Dr. Ikjyot Singh Kohli

The conventional wisdom by the political pundits/analysts who are seeking to explain Joe Biden’s massive win in the 2020 South Carolina primary is that Jim Clyburn’s endorsement was the sole reason why Biden won. (Here is just one article describing this.)

I wanted to analyze the data behind this and actually measure the effect of the Clyburn effect. Clyburn formally endorsed Biden on February 26, 2020.

Using extensive polling data from RealClearPolitics, I looked at Biden’s margin of victory according to various polling samples before the Clyburn endorsement. I used Kernel Density Estimation to form the following probability density function of Biden’s predicted margin of victory (as a percentage/popular vote) in the 2020 South Carolina Primary:

Assuming this probability density function has the form p(x), we notice some interesting properties:

  • The Expected Margin of Victory for Biden is given by: \int x p(x) dx. Using numerical integration, we find that this is \int x p(x) dx = 18.513 \%. The error in this prediction is given by var(x) = \int x^2 p(x) dx - (\int x p(x) dx)^2 = 107.79. This means that the predicted Biden margin of victory is 18.51 \pm 10.382. Clearly, the higher bound of this prediction is 28.89%. That is, according to the data before Clyburn’s endorsement, it was perfectly reasonable to expect that Biden’s victory in South Carolina could have been around 29%. Indeed, Biden’s final margin of victory in South Carolina was 28.5%, which is within the prediction margin. Therefore, it seems it is unlikely Jim Clyburn’s endorsement boosted Biden’s victory in South Carolina.
  • Given the density function above, we can make some more interesting calculations:
  • P(Biden win > 5%) = 1 - \int_{-\infty}^{5} f(x) dx = 0.904 = 90.4%
  • P(Biden win > 10%) = 1 - \int_{-\infty}^{10} f(x) dx = 0.799 = 79.9%
  • P(Biden win > 15%) = 1 - \int_{-\infty}^{15} f(x) dx = 0.710 = 71.0%
  • P(Biden win > 20%) = 1 - \int_{-\infty}^{20} f(x) dx = 0.567 = 56.7%

What these calculations show is that the probability that Biden would have won by more than 5% before Clyburn’s endorsement was 90.4%. The probability that Biden would have won by more than 10% before Clyburn’s endorsement was 79.9%. The probability that Biden would have won by more than 20% before Clyburn’s endorsement was 56.7%, and so on.

Given these calculations, it actually seems unlikely that Clyburn’s endorsement made a huge impact on Biden’s win in South Carolina. This analysis shows that Biden would have likely won by more 15%-20% regardless.

Optimal Strategies for Winning The Democratic Primaries

By: Dr. Ikjyot Singh Kohli

Election season is upon us again, and a number of people from political analysts to campaign advisors are making a huge deal about winning the Iowa caucuses. This seems to be the standard “wisdom”. I decided to run some analysis on the data to see if it was true.

I looked at every Democratic primary since 1976 and tried to find which states are absolutely “must-win” for a candidate to be the Democratic presidential nominee. Because the data from a data science perspective is scarce, I had to run Monte Carlo bootstrap sampling on the dataset to come up with the results.

Interestingly, irrespective of the number of bootstrap samples, three classification tree results kept coming up, which I now present:

Winning a certain state was encoded as a binary variable. “0” indicates a candidate losing the state, while “1” indicates a candidate won the state.

Very interestingly, from the classification tree above, one sees that actually the most important state for a candidate to win to ensure the highest probability of being the Democratic nominee is Illinois.

The other result from bootstrap sampling was as follows:

Winning a certain state was encoded as a binary variable. “0” indicates a candidate losing the state, while “1” indicates a candidate won the state.

Here we see that winning Texas is of paramount importance. In fact, all subsequent paths to the nomination stem from winning Texas.

There is also a third result that came from the bootstrap simulation:

Winning a certain state was encoded as a binary variable. “0” indicates a candidate losing the state, while “1” indicates a candidate won the state.

We see that in this simulation, once again Illinois is of prime importance. However, even if a candidate does lose Illinois, evidently a path to the nomination is still possible if that candidate wins Maryland and Arizona.

Conclusion: We see that from analyzing the data that Iowa and New Hampshire are actually not very important in becoming the Democratic party nomination. Rather, Illinois and Texas are much more important to ensure a candidate of a high probability of being the Democratic nominee.

The Probability of An Illegal Immigrant Committing a Crime In The United States

Trump has once again put The U.S. on the world stage this time at the expense of innocent children whose families are seeking asylum. The Trump administration’s justification is that:


“They want to have illegal immigrants pouring into our country, bringing with them crime, tremendous amounts of crime.”


I decided to try to analyze this statement quantitatively. Indeed, one can calculate the probability that an illegal immigrant will commit a crime within The United States as follows. Let us denote crime (or criminal) by C, while denoting illegal immigrant by ii. Then, by Bayes’ theorem, we have:

\boxed{P(C | ii) = \frac{P(ii | C) P(c)}{P(ii)}}

It is quite easy to find data associated with the various factors in this formula. For example, one finds that

  1. P(ii |c) = 0.21
  2. P(c) = 0.02
  3. P(ii) = 0.037

Putting all of this together, we find that:

P(C|ii) = 0.1135 = 11.35 \%

That is, the probability that an illegal immigrant will commit a crime (of any type) while in The United States is a very low 11.35%.


Therefore, Trump’s claim of “tremendous amounts of crime” being brought to The United States by illegal immigrants is incorrect.


Note that, the numerical factors used above were obtained from:





What if Michael Jordan Played in Today’s NBA?

By: Dr. Ikjyot Singh Kohli

It seems that one cannot turn on ESPN or any YouTube channel nowadays without the ongoing debate of whether Michael Jordan is better than Lebron, what would happen if Michael Jordan played in today’s NBA, etc… However, I have not seen a single scientific approach to this question. Albeit, it is sort of an impossible question to answer, but, using data science I will try.

From a data science perspective, it only makes sense to look at Michael Jordan’s performance in a single season, and try to predict based on that season how he would perform in the most recent NBA season. That being said, let’s look at Michael Jordan’s game-to-game performance in the 1995-1996 NBA season when the Bulls went 72-10.

Using neural networks and Garson’s algorithm , to regress against Michael Jordan’s per game point total, we note the following:

In this plot, the “o” stands for opponent.


One can see from this variable importance plot, Michael’s points in a given game were most positively associated with teams that committed a high number of turnovers followed by teams that make a lot of 3-point shots. Interestingly, there was not a strong negative factor on Michael’s points in a given game.

Given this information, and the per-game league averages of the 2017 season, we used this neural network to make a prediction on how many points Michael would average in today’s season:

Michael Jordan: 2017 NBA Season Prediction: 32.91 Points / Game (+/- 6.9)

It is interesting to note that Michael averaged 30.4 Points/Game in the 1995-1996 NBA Season. We therefore conclude that the 1995-1996 Michael would average a higher points/game if he played in today’s NBA.

As an aside, a plot of the neural network used to generate these variable importance plots and predictions is as follows:


What about the reverse question? What if the 2016-2017 Lebron James played in the 1995-1996 NBA? What would happen to his per-game point average? Using the same methodology as above, we used neural networks in combination with Garson’s algorithm to obtain a variable importance plot for Lebron James’ per-game point totals:



One sees from this plot that Lebron’s points every game were most positively impacted by teams that predominantly committed personal fouls, followed by teams that got a lot of offensive rebounds. There were no predominantly strong negative factors that affected Lebron’s ability to score.

Using this neural network model, we then tried to make a prediction on how many points per game Lebron would score if he played in the 1995-1996 NBA Season:

Lebron James: 1995-1996 NBA Season Prediction: 18.81 Points / Game (+/- 4.796)

This neural network model predicts that Lebron James would average 18.81 Points/Game if he played in the 1995-1996 NBA season, which is a drop from the 26.4 Points/Game he averaged this most recent NBA season.

Therefore, at least from this neural network model, one concludes that Lebron’s per game points would decrease if he played in the 1995-1996 Season, while Michael’s number would increase slightly if he played in the 2016-2017 Season.

The “Interference” of Phil Jackson

By: Dr. Ikjyot Singh Kohli

So, I came across this article today by Matt Moore on CBSSports, who basically once again has taken to the web to bash the Triangle Offense. Of course, much of what he claims (like much of the Knicks media) is flat-out wrong based on very primitive and simplistic analysis, and I will point it out below. Further, much of this article seems to motivated by several comments Carmelo Anthony made recently expressing his dismay at Jeff Hornacek moving away from the “high-paced” offense that the Knicks were running before the All-Star break:

“I think everybody was trying to figure everything out, what was going to work, what wasn’t going to work,’’ Anthony said in the locker room at the former Delta Center. “Early in the season, we were winning games, went on a little winning streak we had. We were playing a certain way. We went away from that, started playing another way. Everybody was trying to figure out: Should we go back to the way we were playing, or try to do something different?’’

Anthony suggested he liked the Hornacek way.

“I thought earlier we were playing faster and more free-flow throughout the course of the game,’’ Anthony said. “We kind of slowed down, started settling it down. Not as fast. The pace slowed down for us — something we had to make an adjustment on the fly with limited practice time, in the course of a game. Once you get into the season, it’s hard to readjust a whole system.’’

First, it is well-known that the Knicks have been implementing more of the triangle offense since All-Star break. All-Star Weekend was Feb 17-19, 2017. The Knicks record before All-Star weekend was amusingly 23-34, which is 11 games below .500 and is nowhere mentioned in any of these articles, and is also not mentioned (realized?) by Carmelo. 

Anyhow, the question is as follows. If Hornacek was allowed to continue is non-triangle ways of pushing the ball/higher pace (What Carmelo claims he liked), would the Knicks have made the playoffs? Probably not. I claim this to be the case based on a detailed machine-learning-based analysis of playoff-eligible teams that has been available for sometime now. In fact, what is perhaps of most importance from this paper is the following classification tree that determines whether a team is playoff-eligible or not:


So, these are the relevant factors in determining whether or not a team in a given season makes the playoffs. (Please see the paper linked above for details on the justification of these results.)

Looking at these predictor variables for the Knicks up to the All-Star break.

  1. Opponent Assists/Game: 22.44
  2. Steals/Game: 7.26
  3. TOV/Game: 13.53
  4. DRB/Game: 33.65
  5. Opp.TOV/Game: 12.46

Since Opp.TOV/Game = 12.46 < 13.16, the Knicks would actually be predicted to miss the NBA Playoffs. In fact, if current trends were allowed to continue, the so-called “Hornacek trends”, one can compute the probability of the Knicks making the playoffs:


From this probability density function, we can calculate that the probability of the Knicks making the playoffs was 36.84%. The classification tree also predicted that the Knicks would miss the playoffs. So, what is being missed by Carmelo, Matt Moore, and the like is the complete lack of pressure defense, hence, the insufficient amount of opponent TOV/G. So, it is completely incorrect to claim that the Knicks were somehow “Destined for glory” under Hornacek’s way of doing this. This is exacerbated by the fact that the Knicks’ opponent AST/G pre-All-Star break was already pretty high at 22.44.

The question now is how have the Knicks been doing since Phil Jackson’s supposed interference and since supposedly implementing the triangle in a more complete sense? (On a side note, I still don’t think you can partially implement the triangle, I think it needs a proper off-season implementation as it is a complete system).

Interestingly enough, the Knicks opponent assists per game (which, according to the machine learning analysis is the most relevant factor in determining whether a team makes the playoffs) from All-Star weekend to the present-day is an impressive 20.642/Game. By the classification tree above, this actually puts the Knicks safely in playoff territory, in the sense of being classified as a playoff team, but it is too little, too late.

The defense has actually improved significantly with respect to the key relevant statistic of opponent AST/G. (Note that, as will be shown in a future article, DRTG and ORTG are largely useless statistics in determining a team’s playoff eligibility, another point completely missed in Moore’s article) since the Knicks have started to implement the triangle more completely.

The problem is that it is obviously too little, too late at this point. I would argue based on this analysis, that Phil Jackson should have actually interfered earlier in the season. In fact, if the Knicks keep their opponent Assists/game below 20.75/game next season (which is now very likely, if current trends continue), the Knicks would be predicted to make the playoffs by the above machine learning analysis. 

Finally, I will just make this point. It is interesting to look at Phil Jackson teams that were not filled/packed with dominant players. As the saying goes, unfortunately, “Phil Jackson’s success had nothing to do with the triangle, but, because he had Shaq/Kobe, Jordan/Pippen, etc… ”

Well, let’s first look at the 1994-1995 Chicago Bulls, a team that did not have Michael Jordan, but ran the triangle offense completely. Per the relevant statistics above:

  1. Opp. AST/G = 20.9
  2. STL/G = 9.7
  3. AST/G = 24.0
  4. Opp. TOV/G = 18.1

These are remarkable defensive numbers, which supports Phil’s idea, that the triangle offense leads to good defense.



Analyzing Lebron James’ Offensive Play

Where is Lebron James most effective on the court?

Based on 2015-2016 data, we obtained from the following data which tracks Lebron’s FG% based on defender distance:


From, we then obtained data of Lebron’s FG% based on his shot distance from the basket:


Based on this data, we generated tens of thousands of sample data points to perform a Monte Carlo simulation to obtain relevant probability density functions. We found that the joint PDF was a very lengthy expression(!):


Graphically, this was:


A contour plot of the joint PDF was computed to be:


From this information, we can compute where/when LeBron has the highest probability of making a shot. Numerically, we found that the maximum probability occurs when Lebron’s defender is 0.829988 feet away, while Lebron is 1.59378 feet away from the basket. What is interesting is that this analysis shows that defending Lebron tightly doesn’t seem to be an effective strategy if his shot distance is within 5 feet of the basket. It is only an effective strategy further than 5 feet away from the basket. Therefore, opposing teams have the best chance at stopping Lebron from scoring by playing him tightly and forcing him as far away from the basket as possible.