## Will Donald Trump’s Proposed Immigration Policies Curb Terrorism in The US?

In recent days, Donald Trump proposed yet another iteration of his immigration policy which is focused on “Keeping America Safe” as part of his plan to “Make America Great Again!”. In this latest iteration, in addition to suspending visas from countries with terrorist ties, he is also proposing introducing an ideological test for those entering the US. As you can see in the BBC article, he is also fond of holding up bar graphs of showing the number of refugees entering the US over a period of time, and somehow relates that to terrorist activities in the US, or at least, insinuates it.

Let’s look at the facts behind these proposals using the available data from 2005-2014. Specifically, we analyzed:

1. The number of terrorist incidents per year from 2005-2014 from here (The Global Terrorism Database maintained by The University of Maryland)
2. The Department of Homeland Security Yearbook of Immigration Statistics, available here . Specifically, we looked at Persons Obtaining Lawful Permanent Resident Status by Region and Country of Birth (2005-2014) and Refugee Arrivals by Region and Country of Nationality (2005-2014).

Given these datasets, we focused on countries/regions labeled as terrorist safe havens and state sponsors of terror based on the criteria outlined here .

We found the following.

First, looking at naturalized citizens, these computations yielded:

 Country Correlations Percent of Variance Explained Afghanistan 0.61169 0.37416 Egypt 0.26597 0.07074 Indonesia -0.66011 0.43574 Iran -0.31944 0.10204 Iraq 0.26692 0.07125 Lebanon -0.35645 0.12706 Libya 0.59748 0.35698 Malaysia 0.39481 0.15587 Mali 0.20195 0.04079 Pakistan 0.00513 0.00003 Phillipines -0.79093 0.62557 Somalia -0.40675 0.16544 Syria 0.62556 0.39132 Yemen -0.11707 0.01371

In graphical form:

The highest correlations are 0.62556 and 0.61669 from Syria and Afghanistan respectively. The highest anti-correlations were from Indonesia and The Phillipines at -0.66011 and -0.79093 respectively. Certainly, none of the correlations exceed 0.65, which indicates that there could be some relationship between the number of naturalized citizens from these particular countries and the number of terrorist incidents, but, it is nowhere near conclusive. Further, looking at Syria, we see that the percentage of variance explained / coefficient of determination is 0.39132, which means that only about 39% of the variation in the number of terrorist incidents can be predicted from the relationship between where a naturalized citizen is born and the number of terrorist incidents in The United States.

Second, looking at refugees, these computations yielded:

 Country Correlations Percent of Variance Explained Afghanistan 0.59836 0.35803 Egypt 0.66657 0.44432 Iran -0.29401 0.08644 Iraq 0.49295 0.24300 Pakistan 0.60343 0.36413 Somalia 0.14914 0.02224 Syria 0.56384 0.31792 Yemen -0.35438 0.12558 Other 0.54109 0.29278

In graphical form:

We see that the highest correlations are from Egypt (0.6657), Pakistan (0.60343), and Afghanistan (0.59836). This indicates there is some mild correlation between refugees from these countries and the number of terrorist incidents in The United States, but it is nowhere near conclusive. Further, the coefficients of determination from Egypt and Syria are 0.44432 and 0.31792 respectively. This means that in the case of Syrian refugees for example, only 31.792% of the variation in terrorist incidents in the United States can be predicted from the relationship between a refugee’s country of origin and the number of terrorist incidents in The United States.

In conclusion, it is therefore unlikely that Donald Trump’s proposals would do anything to significantly curb the number of terrorist incidents in The United States. Further, repeatedly showing pictures like this:

at his rallies is doing nothing to address the issue at hand and is perhaps only serving as yet another fear tactic as has become all too common in his campaign thus far.

(Thanks to Hargun Singh Kohli, Honours B.A., LL.B. for the initial data mining and processing of the various datasets listed above.)

Note, further to the results of this article, I was recently made aware of this excellent article from The WSJ, which I have summarized below:

## Some Thoughts on The US GDP

Here are some thoughts on the US GDP based on some data I’ve been looking at recently, mostly motivated by some Donald Trump supporters that have been criticizing President Obama’s record on the GDP and the economy.

First, analyzing the real GDP’s average growth per year, we obtain that (based on a least squares regression analysis)

According to these calculations, President Clinton’s economic policies led to the best average GDP growth rate at $436 Billion / year. President Reagan and President Obama have almost identical average GDP growth rates in the neighbourhood of$320 Billion / year. However, an obvious caveat is that President Obama’s GDP record is still missing two years of data, so I will need to revisit these calculations in two years! Also, it should be noted that, historically, the US GDP has grown at an average of about \$184 Billion / year.

The second point I wanted to address is several Trump supporters who keep comparing the average real GDP annual percentage change between President Reagan and President Obama. Although they are citing the averages, they are not mentioning the standard deviations! Computing these we find that:

Looking at these calculations, we find that Presidents Clinton and Obama had the most stable growth in year-to-year real GDP %. Presidents Bush and Reagan had highly unstable GDP growth, with President Bush’s being far worse than President Reagan’s. Further, Trump supporters and most Republicans seem quick to point out the mean of 3.637% figure associated with President Reagan, but the point is this is +/- 2.55%, which indicates high volatility in the GDP under President Reagan, which has not been the case under President Obama.

Another observation I would like to point out is that very few people have been mentioning the fact that the annual real US GDP % is in fact correlated to that of other countries. Based on data from the World Bank, one can compute the following correlations:

One sees that the correlation between the annual growth % of the US real GDP and Canada is 0.826, while for Estonia and The UK is roughly close to 0.7. Therefore, evidently, any President that claims that his policies will increase the GDP, is not being truthful, since, it is quite likely that these numbers also depend on those for other countries, which, I am not entirely  convinced a US President has complete control over!

My final observation is with respect to the quarterly GDP numbers. There are some articles that I have seen in recent days in addition to several television segments in which Trump supporters are continuously citing how better Reagan’s quarterly GDP numbers were compared to Obama’s. We now show that in actuality this is not the case.

The problem is that most of the “analysts” are just looking at the raw data, which on its face value actually doesn’t tell you much, since, as expected, fluctuates. Below, we analyze the quarterly GDP% data during the tenure of both Presidents Reagan and Obama, from 1982-1988 and 2010-2016 respectively, comparing data from the same length of time.

For Reagan, we obtain:

For Obama, we obtain:

The only way to reasonably compare these two data sets is to analyze the rate at which the GDP % has increased in time. Since the data is nonlinear in time, this means we must calculate the derivatives at instants of time / each quarter. We first performed cubic spline interpolation to fit curves to these data sets, which gave extremely good results:

We then numerically computed the derivative of these curves at each quarter and obtained:

The dashed curves in the above plot are plots of the derivatives of each curve at each quarter. In terms of numbers, these were found to be:

Summarizing the table above in graphical format, we obtain:

As can be calculated easily, Obama has higher GDP quarterly growth numbers for 15/26 (57.69%) quarters. Therefore, even looking at the quarterly real GDP numbers, overall, President Obama outperforms President Reagan.

Thanks to Hargun Singh Kohli, B.A. Honours, LL.B. for the data collection and processing part of this analysis.

## 2016 Michigan Primary Predictions

Using the Monte Carlo techniques I have described in earlier posts, I ran several simulations today to try to predict who will win the 2016 Michigan primaries. Here is what I found:

For the Republican primaries, I predict:

Trump: 89.64% chance of winning

Cruz: 5.01% chance of winning

Kasich: 3.29% chance of winning

Rubio: 2.06% chance of winning

The following plot is a histogram of the simulations:

## The Effect of Individual State Election Results on The National Election

A short post by me today. I wanted to look at the which states are important in winning the national election. Looking at the last 14 presidential elections, I generated the following correlation plot:

For those not familiar with how correlation plots work, the number bar on the right-hand-side of the graph indicates the correlation between a state on the left side with a state at the top, with the last row and column respectively indicating the national presidential election winner. Dark blue circles representing a correlation close to 1, indicate a strong relationship between the two variables, while orange-to-red circles representing a correlation close to -1 indicate a strong anti-correlation between the two variables, while almost white circles indicate no correlation between the two variables.

For example, one can see there is a very strong correlation between who wins Nevada and the winner of the national election. Indeed, Nevada has picked the last 13 of 14 U.S. Presidents. Darker blue circles indicate a strong correlation, while lighter orange-red circles indicate a weak correlation. This also shows the correlation between winning states. For example, from the plot above, candidates who win Alabama have a good chance of winning Mississippi or Wyoming, but virtually no chance of winning California.

This could serve as a potential guide in determining which states are extremely important to win during the election season!

## Do More Gun Laws Prevent Gun Violence?

Update: March 16, 2018: I have received quite a few comments about my critique of Volokh’s WaPo article, and just as a summary of my reply back to those comments:

The main point that I made and demonstrated below is that the concept of a correlation is only useful as a measure of linearity between the two variables you are comparing. ALL of Volokh’s correlations that he computes are close to zero: 0.032 for correlation between homicide rate, including gun accidents and the Brady score, 0.065 for correlation between intentional homicide rate and Brady score, 0.0178, correlation between the homicide rate including gun accidents and the National Journal score, and 0.0511, correlation between just the intentional homicide rate and National Journal score. All of these numbers are completely *useless*. You cannot conclude anything from these scores. All you can conclude is that the relationship between homicide rate (including or not including gun accidents) and the Brady score is highly nonlinear. Since they are nonlinear, I have investigated this nonlinear relationship using data science methodologies such as regression trees.

Article begins below:

Abstract:

1. The number and quality of gun-control laws a state has drastically effects the number of gun-related deaths.
2. Other factors like mean household income play a smaller role in the number of gun-related deaths.
3. Factors like the amount of money a state spends on mental-health care has a negligible effect on the number of gun-related deaths. This point is quite important as there are a number of policy-makers that consistently argue that the focus needs to be on the mentally ill and that this will curb the number of gun-related deaths.

Contents:

1. Critique of Recent Gun-Control Opposition Studies
2. A more correct way to look at the Gun Deaths data using data science methodologies.

A Critique of Recent Gun-Control Opposition Studies

In light of the recent tragedy in Oregon which is part of a disturbing trend in an increase in gun violence in The United States, we are once again in the aftermath where President Obama and most Democrats are advocating for more gun laws that they claim would aid in decreasing gun violence while their Republican counterparts are as usual arguing the precise opposite. Indeed, there have been two very simplified  “studies” presented in the media thus far that have been cited frequently by gun advocates:

I have singled out these two examples, but most of the studies claiming to “do statistics” follow a similar suit and methodology, so I have listed them here. It should be noted that these studies are extremely simplified, as they compute correlations, while in reality they only look at two factors (the gun death rate and a state’s “Brady grade”). As we show below, the answer to the question of interest and one that allows us to determine causation and correlation must depend on several state-dependent factors and hence, requires deeper statistical learning methodologies, of which NONE of the second amendment advocates seem to be aware of.

The reason why one cannot deduce anything significant from correlations as is done in Volokh’s article is correlation coefficients are good “summary statistics” but they hardly tell you anything deep about the data you are working with. For example, in Volokh’s article, he uses MS Excel to compute the correlations between a pair of variables, but Excel itself uses the Pearson correlation coefficient, which essentially is a measure of the linearity between two variables. If the underlying data exhibits a nonlinear relationship, the correlation coefficient will return a small value, but this in no way means there is no relationship between the data, it just means it is not linear. Similarly, other correlation coefficient computations make other assumptions about the data such as coming from a normal distribution, which is strange to assume from the onset. (There is also the more technical issue that a state’s Brady grade is not exactly a random variable. So measuring the correlation between a supposed random variable (the number of homicides) and a non-random variable is not exactly a sound idea.)

A simple example of where the correlation calculation fails is to try to determine the relationship between the following set of data. Consider 2 variables, x and y. Let x have the data

x              y
-1.0000  0.2420
-0.9000  0.2661
-0.8000  0.2897
-0.7000  0.3123
-0.6000  0.3332
-0.5000  0.3521
-0.4000  0.3683
-0.3000  0.3814
-0.2000  0.3910
-0.1000  0.3970
0            0.3989
0.1000  0.3970
0.2000  0.3910
0.3000  0.3814
0.4000  0.3683
0.5000  0.3521
0.6000  0.3332
0.7000  0.3123
0.8000  0.2897
0.9000  0.2661
1.0000  0.2420

If one tries to compute the correlation between x and y, one will obtain that the correlation coefficient is zero! (Try it!) A simple conclusion would be that therefore there is no linear causation/dependence between x and y. But, if one now makes a scatter plot of x and y, one gets:

Despite having zero correlation, there is apparently a very strong relationship between x and y. In fact, after some analysis,  one can show that they obey the following relationship:

$y = \frac{1}{\sqrt{2 \pi}} e^{-(x^2)/2}$,

that is, y is the normal distribution. So, in this example and similar examples where there is a strong nonlinear relationship between the two variables, the correlation, in particular, the Pearson correlation is meaningless. Strangely, despite this, Volokh uses a near-zero correlation of his data to demonstrate that there is no correlation between a state’s gun score and the number of gun-related deaths, but this is not what his results show! He is misinterpreting his calculations.

Indeed, looking at Volokh’s specific example of comparing the Brady score to the number of Homicides, one gets the following scatter plot:

Volokh that computes the Pearson correlation between the two variables and obtains a result of 0.0323, that is, quite close to zero, which leads him to conclude that there is no correlation between the two. But, this is not what this result means. What it is saying in this case, is that there is a strong nonlinear relationship between the two. Even a very rough analysis between the two variables, and as I’ve said above, and demonstrate below, looking at two variables for a state is hardly useful, but for argument sake, there is a rough sinusoidal relationship between the two variables:

In fact, the fit of this sum-of-sines curve is an 8-term sine function with a R^2 of 0.5322. So, it’s not great, but there is clearly at least some causal behaviour between the two variables. But, I will say again, that due to the clustering of points around zero on the x-axis above, there will be simply NO function that fits the points, because it will not be one-to-one and onto, that is, there are repeated x-points for the same y-value in the data, and this is problematic. So, looking at two variables is not useful at all, and what this calculation shows is that the relationship if there is one would be strongly nonlinear, so measuring the correlation doesn’t make any sense.

Therefore, one requires a much deeper analysis, which we attempt to provide below.

A more correct way to look at the Gun Homicide data using data science methodologies.

I wanted to analyze using data science methodologies which side is correct. Due to limited time resources, I was only able to look at data from previous years (2010-2014) and looked at state-by-state data comparing:

1. # of Firearm deaths per 100,000 people (Data from: http://kff.org/other/state-indicator/firearms-death-rate-per-100000/)
2. Total State Population (Obtained from Wikipedia)
3. Population Density / Square Mile (Obtained from Wikipedia)
4. Median Household Income (Obtained from Wikipedia)
5. Gun Law Grade: This data was obtained from http://gunlawscorecard.org/, which is The Law Center to Prevent Gun Violence and grades each state based on the number and quality of their gun laws using letter grades, i.e., A,A+,B+,F, etc… To use this data in the data science algorithms, I converted each letter grade to a numerical grade based on the following scale: A+: 90, A-: 90, A: 85, B:73,B-:70,B+:77,C:63,C-:60,C+:67, D:53,D-:50,D+:57,F:0.
6. State Mental Health Agency Per Capita Mental Health Services Expenditures (Obtained from: http://kff.org/other/state-indicator/smha-expenditures-per-capita/#table)
7. Some data was available for some years and not for others, so there are very slight percentage changes from year-to-year, but overall, this should have a negligible effect on the results.

This is what I found.

Using a boosted regression tree algorithm, I wanted to find which are the largest contributing factors to the number of firearm deaths per 100,000 people and found:

(The above numbers were calculated from a gradient boosted model with a gaussian loss function. 5000 iterations were performed.)

One sees right away that the quality and number of gun laws a state has is the overwhelming factor in the number of gun-related deaths, with the amount of money a state spends on mental health services having a negligible effect.

Next, I created a regression tree to analyze this problem further. I found the following:

The numbers in the very last level of each tree indicate the number of gun-related deaths. One sees that once again where the individual state’s gun law grade is above 73.5%, that is, higher than a “B”, the number of gun-related deaths is at its lowest at a predicted 5.7 / 100,000 people. (Note that: the sum of squares error for this regression was found to be 3.838). Interestingly, the regression tree also predicts that highest number of gun-related deaths all occur for states that score an “F”!

In fact, using a Principle Components Analysis (PCA), and plotting the first two principle components, we find that:

One sees from this PCA analysis, that states that have a high gun-law grade have a low death rate.

Finally, using K-means clustering, I found the following:

One sees from the above results, the states that have a very low “Gun Law grade” are clustered together in having the highest firearms death rate. (See the fourth column in this matrix). That is, zooming in: