Coronavirus Predictions

By: Dr. Ikjyot Singh Kohli

I wrote an extensive script in R that takes the most recent data available for the number of new/confirmed COVID-19 cases per day by location and computes the probability using statistical learning that a selected location will observe a new COVID-19 case (probability of observing a non-zero daily percentage return). You can access the dashboard by clicking the image below: (Beneath the screenshot are further examples of possible selections.)

Here, we see a map of all current COVID-19 locations, and an ability to select a specific location. Further, there are two calculations at the bottom of the screen: the first is the selected location(s) probability of observing a new case, the second is the current long-term trend of the daily growth rate of new cases for the selected location(s).
In this example, we have asked to return locations within the US that have more than an 87% probability of observing a new case. We can also see that for these locations, the long-term growth rate is trending towards 0.80.

Movie Sentiment Tracker

I wrote an extensive application using NLP and TensorFlow/Keras in Python that looks at all of the current and upcoming Hollywood releases for 2020 and tracks the online Twitter sentiment for each of them. The model output was then displayed in a PowerBI dashboard. In essence, we are predicting the classification probability Pr(Sentiment=Positive|Data).

You can access the dashboard by clicking on the screenshot below:

We have also included a new feature that gives a daily popularity score for movies. An algorithm was designed to rank movies according to daily positive sentiment. This can be found on Page 2 of the dashboard link.

You can select different titles by clicking the dropdown list. The left-side graph shows you the sentiment distribution of all of the tweet data corresponding to a film. The right-side graph calculates the median tweet sentiment for a given day for the selected film. (Right now, we go back 30 days from the present day). It is intended that this dashboard will be refreshed every day.

Did Clyburn Help Biden in South Carolina?

By: Dr. Ikjyot Singh Kohli

The conventional wisdom by the political pundits/analysts who are seeking to explain Joe Biden’s massive win in the 2020 South Carolina primary is that Jim Clyburn’s endorsement was the sole reason why Biden won. (Here is just one article describing this.)

I wanted to analyze the data behind this and actually measure the effect of the Clyburn effect. Clyburn formally endorsed Biden on February 26, 2020.

Using extensive polling data from RealClearPolitics, I looked at Biden’s margin of victory according to various polling samples before the Clyburn endorsement. I used Kernel Density Estimation to form the following probability density function of Biden’s predicted margin of victory (as a percentage/popular vote) in the 2020 South Carolina Primary:

Assuming this probability density function has the form p(x), we notice some interesting properties:

  • The Expected Margin of Victory for Biden is given by: \int x p(x) dx. Using numerical integration, we find that this is \int x p(x) dx = 18.513 \%. The error in this prediction is given by var(x) = \int x^2 p(x) dx - (\int x p(x) dx)^2 = 107.79. This means that the predicted Biden margin of victory is 18.51 \pm 10.382. Clearly, the higher bound of this prediction is 28.89%. That is, according to the data before Clyburn’s endorsement, it was perfectly reasonable to expect that Biden’s victory in South Carolina could have been around 29%. Indeed, Biden’s final margin of victory in South Carolina was 28.5%, which is within the prediction margin. Therefore, it seems it is unlikely Jim Clyburn’s endorsement boosted Biden’s victory in South Carolina.
  • Given the density function above, we can make some more interesting calculations:
  • P(Biden win > 5%) = 1 - \int_{-\infty}^{5} f(x) dx = 0.904 = 90.4%
  • P(Biden win > 10%) = 1 - \int_{-\infty}^{10} f(x) dx = 0.799 = 79.9%
  • P(Biden win > 15%) = 1 - \int_{-\infty}^{15} f(x) dx = 0.710 = 71.0%
  • P(Biden win > 20%) = 1 - \int_{-\infty}^{20} f(x) dx = 0.567 = 56.7%

What these calculations show is that the probability that Biden would have won by more than 5% before Clyburn’s endorsement was 90.4%. The probability that Biden would have won by more than 10% before Clyburn’s endorsement was 79.9%. The probability that Biden would have won by more than 20% before Clyburn’s endorsement was 56.7%, and so on.

Given these calculations, it actually seems unlikely that Clyburn’s endorsement made a huge impact on Biden’s win in South Carolina. This analysis shows that Biden would have likely won by more 15%-20% regardless.

Optimal Strategies for Winning The Democratic Primaries

By: Dr. Ikjyot Singh Kohli

Election season is upon us again, and a number of people from political analysts to campaign advisors are making a huge deal about winning the Iowa caucuses. This seems to be the standard “wisdom”. I decided to run some analysis on the data to see if it was true.

I looked at every Democratic primary since 1976 and tried to find which states are absolutely “must-win” for a candidate to be the Democratic presidential nominee. Because the data from a data science perspective is scarce, I had to run Monte Carlo bootstrap sampling on the dataset to come up with the results.

Interestingly, irrespective of the number of bootstrap samples, three classification tree results kept coming up, which I now present:

Winning a certain state was encoded as a binary variable. “0” indicates a candidate losing the state, while “1” indicates a candidate won the state.

Very interestingly, from the classification tree above, one sees that actually the most important state for a candidate to win to ensure the highest probability of being the Democratic nominee is Illinois.

The other result from bootstrap sampling was as follows:

Winning a certain state was encoded as a binary variable. “0” indicates a candidate losing the state, while “1” indicates a candidate won the state.

Here we see that winning Texas is of paramount importance. In fact, all subsequent paths to the nomination stem from winning Texas.

There is also a third result that came from the bootstrap simulation:

Winning a certain state was encoded as a binary variable. “0” indicates a candidate losing the state, while “1” indicates a candidate won the state.

We see that in this simulation, once again Illinois is of prime importance. However, even if a candidate does lose Illinois, evidently a path to the nomination is still possible if that candidate wins Maryland and Arizona.

Conclusion: We see that from analyzing the data that Iowa and New Hampshire are actually not very important in becoming the Democratic party nomination. Rather, Illinois and Texas are much more important to ensure a candidate of a high probability of being the Democratic nominee.

A Problem With Offensive Rating

Abstract: It is shown that the standard/common definition of team offensive rating/offensive efficiency implies that a team’s offensive rating increases as its opponent’s offensive rebounds increase, which, in principle, should not be the case.

Over the past number of years, the advanced metric known as Offensive Rating has become the standard way of measuring a basketball team’s offensive efficiency. Broadly speaking, it is defined as points scored per 100 possessions. Specifically, for teams, it is defined as (See: https://www.basketball-reference.com/about/ratings.html and https://www.nbastuffer.com/analytics101/possession/ AND https://fansided.com/2015/12/21/nylon-calculus-101-possessions/):

ortg_eqn copy

There is a significant issue with this definition as I now demonstrate. Let us compute the partial derivative of this expression with respect to OppORB, we easily obtain:

partial2

As the denominator is always positive, we would like to examine the numerator. The numerator is always negative due to physical constraints (i.e., can’t have negative points or rebounds!) and if OppFG < OppFGA, which makes intuitive sense. It is only positive if OppFG > OppFGA, which logically cannot happen. Therefore, this numerator is always negative (except for the rare case when OppFG = OppFGA of course), which means that the entire partial derivative is positive.

This means that a team’s offensive rating / offensive efficiency increases as it’s opponent’s offensive rebounds increase. Intuitively, this shouldn’t be the case. If your opponent has a high number of offensive rebounds, this should give you less possessions, and put pressure on you to score, thus resulting in less points overall. The problem is that the more general definition of offensive efficiency is 100*(Points Scored)/(Possessions), which is obviously maximized when possessions is minimized. The problem of course, is that the more detailed definition of possessions implies that this minimization of possessions occurs at the cost of maximizing opponent offensive rebounds, which intuitively should not be the case.

NBA Analytics Dashboard

Here is an embedded dashboard that shows a number of statistical insights for NBA teams, their opponents, and individual players as well. You can compare multiple teams and players. Navigate through  the different pages by clicking through the scrolling arrow below. (The data is based on the most recent season “per-game” numbers.)

(If you cannot see the dashboard embedded below for whatever reason, click here to be taken directly to the dashboard in a separate page.)

The Probability of An Illegal Immigrant Committing a Crime In The United States

Trump has once again put The U.S. on the world stage this time at the expense of innocent children whose families are seeking asylum. The Trump administration’s justification is that:

 

“They want to have illegal immigrants pouring into our country, bringing with them crime, tremendous amounts of crime.”

 

I decided to try to analyze this statement quantitatively. Indeed, one can calculate the probability that an illegal immigrant will commit a crime within The United States as follows. Let us denote crime (or criminal) by C, while denoting illegal immigrant by ii. Then, by Bayes’ theorem, we have:

\boxed{P(C | ii) = \frac{P(ii | C) P(c)}{P(ii)}}

It is quite easy to find data associated with the various factors in this formula. For example, one finds that

  1. P(ii |c) = 0.21
  2. P(c) = 0.02
  3. P(ii) = 0.037

Putting all of this together, we find that:

P(C|ii) = 0.1135 = 11.35 \%

That is, the probability that an illegal immigrant will commit a crime (of any type) while in The United States is a very low 11.35%.

 

Therefore, Trump’s claim of “tremendous amounts of crime” being brought to The United States by illegal immigrants is incorrect.

 

Note that, the numerical factors used above were obtained from:

  1. https://www.justice.gov/opa/pr/departments-justice-and-homeland-security-release-data-incarcerated-aliens-94-percent-all
  2. https://www.washingtontimes.com/news/2017/aug/1/immigrants-22-percent-federal-prison-population/
  3. https://en.wikipedia.org/wiki/Incarceration_in_the_United_States