2016 Real-Time Election Predictions

Further to my original post on using physics to predict the outcome of the 2016 US Presidential elections, I have now written a cloud-based app using the powerful Wolfram Cloud to pull the most recent polling data on the web from The HuffPost Pollster, which “tracks thousands of public polls to give you the latest data on elections, political opinions and more”.  This app works in real-time and applies my PDE-solver / machine learning based algorithm to predict the probability of a candidate winning a state assuming the election is held tomorrow.

The app can be accessed by clicking the image below: (Note: If you obtain some type of server error, it means Wolfram’s server is busy, a refresh usually works. Also, results are only computed for states for which there exists reliable polling data. )

I have written an automatic script on my server that runs this app on each state, and the results can be found by clicking here.

 

Will Donald Trump’s Proposed Immigration Policies Curb Terrorism in The US?

In recent days, Donald Trump proposed yet another iteration of his immigration policy which is focused on “Keeping America Safe” as part of his plan to “Make America Great Again!”. In this latest iteration, in addition to suspending visas from countries with terrorist ties, he is also proposing introducing an ideological test for those entering the US. As you can see in the BBC article, he is also fond of holding up bar graphs of showing the number of refugees entering the US over a period of time, and somehow relates that to terrorist activities in the US, or at least, insinuates it.

Let’s look at the facts behind these proposals using the available data from 2005-2014. Specifically, we analyzed:

  1. The number of terrorist incidents per year from 2005-2014 from here (The Global Terrorism Database maintained by The University of Maryland)
  2. The Department of Homeland Security Yearbook of Immigration Statistics, available here . Specifically, we looked at Persons Obtaining Lawful Permanent Resident Status by Region and Country of Birth (2005-2014) and Refugee Arrivals by Region and Country of Nationality (2005-2014).

Given these datasets, we focused on countries/regions labeled as terrorist safe havens and state sponsors of terror based on the criteria outlined here .

We found the following.

First, looking at naturalized citizens, these computations yielded:

Country

Correlations

Percent of Variance Explained 

Afghanistan

0.61169

0.37416

Egypt

0.26597

0.07074

Indonesia

-0.66011

0.43574

Iran

-0.31944

0.10204

Iraq

0.26692

0.07125

Lebanon

-0.35645

0.12706

Libya

0.59748

0.35698

Malaysia

0.39481

0.15587

Mali

0.20195

0.04079

Pakistan

0.00513

0.00003

Phillipines

-0.79093

0.62557

Somalia

-0.40675

0.16544

Syria

0.62556

0.39132

Yemen

-0.11707

0.01371

In graphical form:

The highest correlations are 0.62556 and 0.61669 from Syria and Afghanistan respectively. The highest anti-correlations were from Indonesia and The Phillipines at -0.66011 and -0.79093 respectively. Certainly, none of the correlations exceed 0.65, which indicates that there could be some relationship between the number of naturalized citizens from these particular countries and the number of terrorist incidents, but, it is nowhere near conclusive. Further, looking at Syria, we see that the percentage of variance explained / coefficient of determination is 0.39132, which means that only about 39% of the variation in the number of terrorist incidents can be predicted from the relationship between where a naturalized citizen is born and the number of terrorist incidents in The United States.

Second, looking at refugees, these computations yielded:

Country

Correlations

Percent of Variance Explained

Afghanistan

0.59836

0.35803

Egypt

0.66657

0.44432

Iran

-0.29401

0.08644

Iraq

0.49295

0.24300

Pakistan

0.60343

0.36413

Somalia

0.14914

0.02224

Syria

0.56384

0.31792

Yemen

-0.35438

0.12558

Other

0.54109

0.29278

In graphical form:

We see that the highest correlations are from Egypt (0.6657), Pakistan (0.60343), and Afghanistan (0.59836). This indicates there is some mild correlation between refugees from these countries and the number of terrorist incidents in The United States, but it is nowhere near conclusive. Further, the coefficients of determination from Egypt and Syria are 0.44432 and 0.31792 respectively. This means that in the case of Syrian refugees for example, only 31.792% of the variation in terrorist incidents in the United States can be predicted from the relationship between a refugee’s country of origin and the number of terrorist incidents in The United States.

In conclusion, it is therefore unlikely that Donald Trump’s proposals would do anything to significantly curb the number of terrorist incidents in The United States. Further, repeatedly showing pictures like this:

at his rallies is doing nothing to address the issue at hand and is perhaps only serving as yet another fear tactic as has become all too common in his campaign thus far.

(Thanks to Hargun Singh Kohli, Honours B.A., LL.B. for the initial data mining and processing of the various datasets listed above.)

Some Thoughts on The US GDP

Here are some thoughts on the US GDP based on some data I’ve been looking at recently, mostly motivated by some Donald Trump supporters that have been criticizing President Obama’s record on the GDP and the economy. 

First, analyzing the real GDP’s average growth per year, we obtain that (based on a least squares regression analysis)

According to these calculations, President Clinton’s economic policies led to the best average GDP growth rate at $436 Billion / year. President Reagan and President Obama have almost identical average GDP growth rates in the neighbourhood of $320 Billion / year. However, an obvious caveat is that President Obama’s GDP record is still missing two years of data, so I will need to revisit these calculations in two years! Also, it should be noted that, historically, the US GDP has grown at an average of about $184 Billion / year. 

The second point I wanted to address is several Trump supporters who keep comparing the average real GDP annual percentage change between President Reagan and President Obama. Although they are citing the averages, they are not mentioning the standard deviations! Computing these we find that:


Looking at these calculations, we find that Presidents Clinton and Obama had the most stable growth in year-to-year real GDP %. Presidents Bush and Reagan had highly unstable GDP growth, with President Bush’s being far worse than President Reagan’s. Further, Trump supporters and most Republicans seem quick to point out the mean of 3.637% figure associated with President Reagan, but the point is this is +/- 2.55%, which indicates high volatility in the GDP under President Reagan, which has not been the case under President Obama. 

Another observation I would like to point out is that very few people have been mentioning the fact that the annual real US GDP % is in fact correlated to that of other countries. Based on data from the World Bank, one can compute the following correlations: 


One sees that the correlation between the annual growth % of the US real GDP and Canada is 0.826, while for Estonia and The UK is roughly close to 0.7. Therefore, evidently, any President that claims that his policies will increase the GDP, is not being truthful, since, it is quite likely that these numbers also depend on those for other countries, which, I am not entirely  convinced a US President has complete control over!

My final observation is with respect to the quarterly GDP numbers. There are some articles that I have seen in recent days in addition to several television segments in which Trump supporters are continuously citing how better Reagan’s quarterly GDP numbers were compared to Obama’s. We now show that in actuality this is not the case. 

The problem is that most of the “analysts” are just looking at the raw data, which on its face value actually doesn’t tell you much, since, as expected, fluctuates. Below, we analyze the quarterly GDP% data during the tenure of both Presidents Reagan and Obama, from 1982-1988 and 2010-2016 respectively, comparing data from the same length of time. 

For Reagan, we obtain: 


For Obama, we obtain:


The only way to reasonably compare these two data sets is to analyze the rate at which the GDP % has increased in time. Since the data is nonlinear in time, this means we must calculate the derivatives at instants of time / each quarter. We first performed cubic spline interpolation to fit curves to these data sets, which gave extremely good results: 


We then numerically computed the derivative of these curves at each quarter and obtained: 

The dashed curves in the above plot are plots of the derivatives of each curve at each quarter. In terms of numbers, these were found to be: 


Summarizing the table above in graphical format, we obtain: 


As can be calculated easily, Obama has higher GDP quarterly growth numbers for 15/26 (57.69%) quarters. Therefore, even looking at the quarterly real GDP numbers, overall, President Obama outperforms President Reagan. 

Thanks to Hargun Singh Kohli, B.A. Honours, LL.B. for the data collection and processing part of this analysis. 

Physics, Data, and The 2016 US Elections

In this post, I will explore whether it is possible to use the Fokker-Planck partial differential equation / Kolmogorov Forward equation to predict the likely outcome of the 2016 Elections. It should be noted that up to a certain point below, the work is fairly general, so it could be applied to a wide variety of similar scenarios!

The Fokker-Planck / forward Kolmogorov equation for constant \mu and \sigma is:

\frac{\partial}{\partial t} p(x,t) = -\mu \frac{\partial}{\partial x} p(x,t) + \frac{\sigma^2}{2} \frac{\partial^2}{\partial x^2} p(x,t).

We will solve the Dirichlet problem p(0,t) = p(1,t) = 0, with initial condition p(x,0) = \phi(x), which we will elaborate upon below.

We therefore seek separated solutions of the form p(x,t) = X(x) T(t). Indeed, for an arbitrary separation constant -\lambda, we obtain the following ODEs:

T'(t) = -\lambda T(t), which of course, has the solution

A \exp(-\lambda t),

where A is an arbitrary real constant.

The other ODE results in a  Sturm-Liouville problem corresponding to the boundary conditions X(0) = X(1) = 0:

-\mu X'(x) + \frac{\sigma^2}{2} X''(x) = -\lambda X(x).

It is easy to show that this ODE for the corresponding boundary conditions will have non-trivial solutions if \lambda is chosen such that

\lambda > \frac{\mu^2}{2\sigma^2}.

The roots of the characteristic polynomial corresponding to the S-L problem above have the form:

r_{1,2} = \frac{\mu }{\sigma ^2} \pm \frac{\sqrt{\mu ^2-2 \lambda  \sigma ^2}}{\sigma ^2}.

Therefore, as long as \lambda is chosen such that the condition above is met, these roots will be complex, and our eigenfunctions will be of an oscillatory nature, which is what we require for non-trivial solutions to our problem. If these roots are complex, then, in general, they will have the form:

r_{12} = \frac{\mu}{\sigma^2} \pm \frac{i}{\sigma^2} |\mu^2 - 2\lambda \sigma^2|.

Therefore, the eigenfunctions will have the form:

X(x) = \exp \left( \frac{\mu}{\sigma^2}  \right) \left[C \cos  \left(  \frac{ |\mu^2 - 2 \lambda \sigma^2|}{\sigma^2} x     \right) + D \sin\left(  \frac{ |\mu^2 - 2 \lambda \sigma^2|}{\sigma^2} x     \right)    \right].

Applying our boundary conditions, we observe that C = 0, which means that at the boundary condition X(1) =0, we require that:

\sin\left(  \frac{ |\mu^2 - 2 \lambda \sigma^2|}{\sigma^2} \right) = 0.

This, of course, means that

\frac{|\mu^2 - 2\lambda \sigma^2|}{\sigma^2} = m \pi,

where m is an integer.

The eigenvalues are thus found to be with our restriction above:

\lambda_{m} = \frac{\mu ^2+\pi  m \sigma ^2}{2 \sigma ^2},

which is true as long as \sigma > 0, m \geq 1 \in \mathbb{Z}.

Our time-dependent probability distribution finally takes the form:

\boxed{p(x,t) = \sum A_m \exp \left(-\frac{\mu ^2+\pi  m \sigma ^2}{2 \sigma ^2} t    \right)  \exp \left( \frac{\mu}{\sigma^2}  \right) \sin(m \pi x), \quad m \geq 1   }.

The key thing is now trying to find the coefficients in the above Fourier series, which solely depends on our choice of initial function as we mentioned from the onset.

We will assume a lognormal distribution, with the justification being that when we observe the data, we assume that all poll numbers are positive, and, based on initial numerical probability distribution fits, the data fits a lognormal distribution quite well. The lognormal distribution (normalized from 0 to 1) has the form:

\frac{\sqrt{\frac{2}{\pi }} e^{-\frac{(\mu -\log (x))^2}{2 \sigma ^2}}}{\sigma x \text{erfc}\left(\frac{\mu }{\sqrt{2} \sigma }\right)}.

Therefore, using the theory of orthogonal functions, we find that the Fourier coefficients A_m are given by:

\boxed{A_m = 2\frac{\sqrt{\frac{2}{\pi }}}{\sigma \text{erfc}\left(\frac{\mu }{\sqrt{2} \sigma }\right)} \int_{0}^{1} \frac{e^{-\frac{2 \mu +(\mu -\log (x))^2}{2 \sigma ^2}} \sin (\pi m x)}{x} dx},

which, as can be confirmed has no closed-form expression. Further, p(x,t), being a probability distribution, must be normalized such that:

\int_{0}^{1} \int_{0}^{T} p(x,t) dt dx = 1,

where T indicates the time at which we would like to obtain the probability of a candidate’s poll numbers being within a certain range of values.

Now, for the actual numbers!

Based on log-normal fits to the polling data obtained from Huffington Post Pollster , we found that:

  1. Hillary Clinton: (\mu, \sigma) = (-0.8915, 0.0664),
  2. Donald Trump: (\mu, \sigma) = (-1.0025, 0.0840).

(Recall, that for log-normal distributions \mu and \sigma represent the log mean, and log standard deviation respectively. Thanks to Hargun Singh Kohli for doing these statistical calculations.)

For Hillary Clinton, based on integrating p(x,t) above, we found that the probability of her to poll above 40% within the near future is 58.8369%. The probability of her to poll between 40% and 50% within the near future is 26.13%.

For Donald Trump, based on integrating p(x,t) above, we found that the probability of him to poll above 40% within the near future is 48.0882%. The probability of him to poll between 40% and 50% within the near future is 19.8545%.

Below, we present animations of the predicted time evolution of both candidate’s probability density functions.

trumpPlots_newest

 

 

ClintonPlots_newest

 

Further, we are perhaps most interested in the expectation value of p(x,t). We find that for Hillary Clinton, it is 44.7%, while for Donald Trump it is 42.16%. This shows that based on this model, Hillary Clinton’s lead is not as large as most of the polling is suggesting thus far.

Just for the sake of completion, here are 3D plots showing the space-time evolution of the PDF of each candidate:

dtrumppdf1hillaryclintonpdf1

Attempts at a General Einstein Equation for an Arbitrary FLRW Cosmology

I tried to derive a general Einstein field equation for an arbitrary FLRW cosmology. That is, one that can handle any of the possible spatial curvatures: hyperbolic, spherical, or flat. Deriving the equation was easy, solving it was not! It ends up being a nonlinear, second-order ODE, with singularities at a=0, which turns out to be the Big Bang singularity, which obviously is of physical significance. Anyways, here’s a log of my notebook, showing the attempts. More to follow! 

Breaking Down The Data of Victims of Police Violence

Abstract:

  1. Black victims are younger than those from other races.
  2. Black victims have a tendency to be unarmed / not known whether they are armed in confrontations with police.
  3. Black victims are more likely to be killed by police gunshots than victims from other races.
  4. Black and White victims seem to be on average, distributed amongst all 50 states.
  5. White victims also have a tendency to be unarmed or it is not known by police at the time whether they are armed/unarmed at the time of the confrontation with police.
  6. The majority of White victims are males.

With the recent shooting tragedies in The United States over this past week, a great deal of data analysis is being presented through various news outlets, blogs, etc… as can be seen from doing a simple Google News search. The problem that most people seem to agree on is that the amount of data concerning the race of the victims is surprisingly scarce, so it is somewhat difficult to draw any firm conclusions.

I therefore decided to look at this problem, and was fortunate to come across the website, Mapping Police Violence, which is a nonpartisan site that has data of concerning victims of police violence from 2013-Present. The actual dataset is available here.

From this data, we applied a principal components analysis (PCA) as our main unsupervised learning method to uncover deep patterns within the dataset. We specifically looked at a victim’s age, gender, race, state, cause of death, and whether or not he/she was armed/unarmed. In particular, we had to assign numerical values to apply PCA. For simplicity in interpretation, we assigned the following boolean variables:

  1. Age: Male = 1, Other = 0
  2. Race: Black = 1, Other = 0
  3. Cause: Gunshot = 1, Other = 0
  4. Unarmed: Unarmed/not known = 1, Other = 0

It turns out that these definitions will simplify the interpretation of our PCA analysis greatly.

Now, looking at the PCA algorithm output, we see that the first 5 principal components explain 90.28% of the variance:

compimp4

The eigenvectors from the PCA were found to be:

compimp5

One sees for example that from this that PC1 is largely a measure of a victim’s armed/unarmed factor and cause, PC2 is largely a measure of a victim’s age and race, PC3 is largely a measure of a victim’s state/location, PC4 is largely a measure of a victim’s gender, and PC5 is largely a measure of a victim’s race and age.

We can plot these PCs against each other to visualize the patterns in the dataset.

boolplot1

In fact, performing K-means clustering on these 5 principal components we obtain the following cluster plot:

Rplot

From our analysis, we saw that cluster 1 (the cluster of red points) contains only Black victims, while cluster 3 contains only non-Black victims (Asians, Hispanics, Native Americans, Pacific Islander and White). One sees that cluster 3 is largely skewed towards a higher score on PC1, which indicates that these victims were largely unarmed or it was not known whether they were armed at the time of being confronted by police. Black victims (in cluster 1) seem to have mostly positive scores along PC1, but skewed towards negative scores along PC2. This indicates that Black victims are younger than those from other races, and are more likely to be victims of gunshots from police as well compared to victims from other races. The positive scores along PC1 also suggest that Black victims have a tendency to be unarmed.

Now, comparing PC2 vs. PC3, we obtain the following biplot:

pc2vspc3

Looking at the K-means clustering once again, we obtain:

pc2pc3cluster

One sees from these two plots that Black victims seem to be on average, equally distributed amongst all 50 states, as their data points are spread almost evenly along PC3. We see that the largest eigenvalue for PC3 is 0.949, which indicates that PC3 is in fact a measure of a victim’s state/geographic location.

We were not able to ascertain any further patterns involving Black victims.

We now apply PCA on the same dataset, but this time we switch the boolean variable for race to have a value for 1 for White victims, to ascertain any patterns among white victims.

The eigenvectors from the PCA are:

whitevictimspca

As before, we apply a k-means clustering on these 5 principal components. We discovered that out of our 5 clusters, cluster #4 had only White victims. Plotting PC1 vs. PC2, we obtain:

whitepc1pc2

As was alluded to before, there is a surprising number of white victims that are killed by police while being unarmed/not known whether they were armed at the time of the police confrontation, which is evidenced by the cluster of blue points having a high score along PC1.

We also plot PC1 vs PC4 and obtain:

whitepc1pc4

From this plot, we see that a majority of White victims have a high score along PC4. From the principal components table above, we see that this indicates that the majority of White victims are males.

Plotting PC3 vs. PC4, we obtain:

whitepc3pc4

One sees that the White victims are largely spread about evenly along PC3. We see from above that PC3 is largely a measure of a victim’s state. We therefore conclude that like Black victims, White victims seem to also be distributed about evenly amongst all 50 states.

We were unable to ascertain any further patterns for White victims.

We therefore conclude the following:

  1. Black victims are younger than those from other races.
  2. Black victims have a greater tendency to be unarmed / not known whether they are armed in confrontations with police.
  3. Black victims are more likely to be killed by police gunshots than victims from other races.
  4. Black and White victims seem to be on average, distributed amongst all 50 states.
  5. White victims are largely unarmed or it is not known by police at the time whether they are armed/unarmed at the time of the confrontation with police.
  6. The majority of White victims are males.

 

A Really Quick Derivation of The Cauchy-Riemann Equations

Here is a really quick derivation of the Cauchy-Riemann equations of complex analysis.

Consider a function of a complex variable, z, where z = x + iy, such that:

f(z) = u(z) + i v(z) = u(x+ iy) + i v(x+iy),

where u and v are real-valued functions.

An analytic function is one that is expressible as a power series in z.
That is,

f(z) = \sum_{n=0}^{\infty} a_{n} z^{n}, \quad a_{n} \in \mathbb{C}.

Then,

u(x+iy) + i v(x+iy) = \sum_{n=0}^{\infty} a_{n} (x+iy)^{n}.

We formally differentiate this equation as follows. First, differentiating with respect to x, we obtain

u_{x} + i v_{x} = \sum_{n=1}^{\infty} n a_{n} \left(x+iy\right)^{n-1}.

Differentiating with respect to y, we obtain

u_{y} + i v_{y} = i \sum_{n=1}^{\infty} n a_{n} \left(x + i y\right)^{n-1}.

Multiplying the latter equation by -i and equating to the first result, we obtain

-iu_{y} + v_{y} = \sum_{n=1}^{\infty} na_{n} \left(x+iy\right)^{n-1} = u_{x} + i v_{x}.

Comparing imaginary and real parts of these equations, we obtain

\boxed{u_{x} = v_{y}, \quad u_{y} = -v_{x}},

which are the famous Cauchy-Riemann equations.