**Update: March 16, 2018: I have received quite a few comments about my critique of Volokh’s WaPo article, and just as a summary of my reply back to those comments:**

The main point that I made and demonstrated below is that the concept of a correlation is only useful as a measure of linearity between the two variables you are comparing. ALL of Volokh’s correlations that he computes are close to zero: 0.032 for correlation between homicide rate, including gun accidents and the Brady score, 0.065 for correlation between intentional homicide rate and Brady score, 0.0178, correlation between the homicide rate including gun accidents and the National Journal score, and 0.0511, correlation between just the intentional homicide rate and National Journal score. All of these numbers are completely *useless*. You cannot conclude anything from these scores. All you can conclude is that the relationship between homicide rate (including or not including gun accidents) and the Brady score is highly nonlinear. Since they are nonlinear, I have investigated this nonlinear relationship using data science methodologies such as regression trees.

Article begins below:

**Abstract:**

- The number and quality of gun-control laws a state has drastically effects the number of gun-related deaths.
- Other factors like mean household income play a smaller role in the number of gun-related deaths.
- Factors like the amount of money a state spends on mental-health care has a negligible effect on the number of gun-related deaths. This point is quite important as there are a number of policy-makers that consistently argue that the focus needs to be on the mentally ill and that this will curb the number of gun-related deaths.

**Contents:**

- Critique of Recent Gun-Control Opposition Studies
- A more correct way to look at the Gun Deaths data using data science methodologies.

**A Critique of Recent Gun-Control Opposition Studies**

In light of the recent tragedy in Oregon which is part of a disturbing trend in an increase in gun violence in The United States, we are once again in the aftermath where President Obama and most Democrats are advocating for more gun laws that they claim would aid in decreasing gun violence while their Republican counterparts are as usual arguing the precise opposite. Indeed, there have been two very simplified “studies” presented in the media thus far that have been cited frequently by gun advocates:

- Glenn Kessler’s so-called Fact-Checker Article
- Eugene Volokh’s opinion article in The Washington Post

I have singled out these two examples, but most of the studies claiming to “do statistics” follow a similar suit and methodology, so I have listed them here. It should be noted that these studies are extremely simplified, as they compute correlations, while in reality they only look at two factors (the gun death rate and a state’s “Brady grade”). As we show below, the answer to the question of interest and one that allows us to determine causation and correlation must depend on several state-dependent factors and hence, requires deeper statistical learning methodologies, of which NONE of the second amendment advocates seem to be aware of.

The reason why one cannot deduce anything significant from correlations as is done in Volokh’s article is correlation coefficients are good “summary statistics” but they hardly tell you anything deep about the data you are working with. For example, in Volokh’s article, he uses MS Excel to compute the correlations between a pair of variables, but Excel itself uses the Pearson correlation coefficient, which essentially is a measure of the linearity between two variables. If the underlying data exhibits a nonlinear relationship, the correlation coefficient will return a small value, but this in no way means there is no relationship between the data, it just means it is not linear. Similarly, other correlation coefficient computations make other assumptions about the data such as coming from a normal distribution, which is strange to assume from the onset. (There is also the more technical issue that a state’s Brady grade is not exactly a *random* variable. So measuring the correlation between a supposed random variable (the number of homicides) and a non-random variable is not exactly a sound idea.)

A simple example of where the correlation calculation fails is to try to determine the relationship between the following set of data. Consider 2 variables, x and y. Let x have the data

**x y**

-1.0000 0.2420

-0.9000 0.2661

-0.8000 0.2897

-0.7000 0.3123

-0.6000 0.3332

-0.5000 0.3521

-0.4000 0.3683

-0.3000 0.3814

-0.2000 0.3910

-0.1000 0.3970

0 0.3989

0.1000 0.3970

0.2000 0.3910

0.3000 0.3814

0.4000 0.3683

0.5000 0.3521

0.6000 0.3332

0.7000 0.3123

0.8000 0.2897

0.9000 0.2661

1.0000 0.2420

If one tries to compute the correlation between x and y, one will obtain that the correlation coefficient is zero! (Try it!) A simple conclusion would be that therefore there is no linear causation/dependence between x and y. But, if one now makes a scatter plot of x and y, one gets:

Despite having zero correlation, there is apparently a very strong relationship between x and y. In fact, after some analysis, one can show that they obey the following relationship:

,

that is, y is the normal distribution. So, in this example and similar examples where there is a strong nonlinear relationship between the two variables, the correlation, in particular, the Pearson correlation is meaningless. Strangely, despite this, Volokh uses a near-zero correlation of his data to demonstrate that there is no correlation between a state’s gun score and the number of gun-related deaths, but this is not what his results show! He is misinterpreting his calculations.

Indeed, looking at Volokh’s specific example of comparing the Brady score to the number of Homicides, one gets the following scatter plot:

Volokh that computes the Pearson correlation between the two variables and obtains a result of 0.0323, that is, quite close to zero, which leads him to conclude that there is no correlation between the two. But, this is *not *what this result means. What it is saying in this case, is that there is a strong nonlinear relationship between the two. Even a very rough analysis between the two variables, and as I’ve said above, and demonstrate below, looking at two variables for a state is hardly useful, but for argument sake, there is a rough sinusoidal relationship between the two variables:

In fact, the fit of this sum-of-sines curve is an 8-term sine function with a R^2 of 0.5322. So, it’s not great, but there is clearly at least some causal behaviour between the two variables. But, I will say again, that due to the clustering of points around zero on the x-axis above, there will be simply NO function that fits the points, because it will not be one-to-one and onto, that is, there are repeated x-points for the same y-value in the data, and this is problematic. So, looking at two variables is not useful at all, and what this calculation shows is that the relationship if there is one would be strongly *nonlinear*, so measuring the correlation doesn’t make any sense.

**Therefore, one requires a much deeper analysis, which we attempt to provide below.**

**A more correct way to look at the Gun Homicide data using data science methodologies.**

I wanted to analyze using data science methodologies which side is correct. Due to limited time resources, I was only able to look at data from previous years (2010-2014) and looked at state-by-state data comparing:

- # of Firearm deaths per 100,000 people (Data from: http://kff.org/other/state-indicator/firearms-death-rate-per-100000/)
- Total State Population (Obtained from Wikipedia)
- Population Density / Square Mile (Obtained from Wikipedia)
- Median Household Income (Obtained from Wikipedia)
- Gun Law Grade: This data was obtained from http://gunlawscorecard.org/, which is The Law Center to Prevent Gun Violence and grades each state based on the number and quality of their gun laws using letter grades, i.e., A,A+,B+,F, etc… To use this data in the data science algorithms, I converted each letter grade to a numerical grade based on the following scale: A+: 90, A-: 90, A: 85, B:73,B-:70,B+:77,C:63,C-:60,C+:67, D:53,D-:50,D+:57,F:0.
- State Mental Health Agency Per Capita Mental Health Services Expenditures (Obtained from: http://kff.org/other/state-indicator/smha-expenditures-per-capita/#table)
- Some data was available for some years and not for others, so there are very slight percentage changes from year-to-year, but overall, this should have a negligible effect on the results.

This is what I found.

Using a boosted regression tree algorithm, I wanted to find which are the largest contributing factors to the number of firearm deaths per 100,000 people and found:

(The above numbers were calculated from a gradient boosted model with a gaussian loss function. 5000 iterations were performed.)

One sees right away that the quality and number of gun laws a state has is the overwhelming factor in the number of gun-related deaths, with the amount of money a state spends on mental health services having a negligible effect.

Next, I created a regression tree to analyze this problem further. I found the following:

The numbers in the very last level of each tree indicate the number of gun-related deaths. One sees that once again where the individual state’s gun law grade is above 73.5%, that is, higher than a “B”, the number of gun-related deaths is at its lowest at a predicted 5.7 / 100,000 people. (Note that: the sum of squares error for this regression was found to be 3.838). Interestingly, the regression tree also predicts that highest number of gun-related deaths all occur for states that score an “F”!

In fact, using a Principle Components Analysis (PCA), and plotting the first two principle components, we find that:

One sees from this PCA analysis, that states that have a high gun-law grade have a low death rate.

Finally, using K-means clustering, I found the following:

One sees from the above results, the states that have a very low “Gun Law grade” are clustered together in having the highest firearms death rate. (See the fourth column in this matrix). That is, zooming in:

**What about Suicides? **

This question has been raised many times because the gun deaths number above includes the number of self-inflicted gun deaths. The argument has been that if we filter out this data from the gun deaths above, the arguments in this article fall apart. As I now show, this is in fact, not the case. Using the state-by-state firearm suicide rate from (http://archinte.jamanetwork.com/article.aspx?articleid=1661390), I performed this filtering to obtain the following principle components analysis biplot:

One sees that the PCA puts approximately equal weight (loadings) onto population density, gun-law grade, and median household income. It is quite clear that states that have a very high gun-law grade have a low amount of gun murders, and vice-versa.

**One sees that the data shows that there is a very large anti-correlation between a state’s gun law grade and the death rate.** There is also a very small anti-correlation between how much a state spends on mental health care and the death rate.

Therefore, the conclusions one can draw immediately are:

**The number and quality of gun-control laws a state has***drastically*effects the number of gun-related deaths.**Other factors like mean household income play a smaller role in the number of gun-related deaths.****Factors like the amount of money a state spends on mental-health care has a negligible effect on the number of gun-related deaths. This point is quite important as there are a number of policy-makers that consistently argue that the focus needs to be on the mentally ill and that this will curb the number of gun-related deaths.**- It would be interesting to apply these methodologies to data from other years. I will perhaps pursue this at a later time.

I like this blog. Thanks for posting! 🙂

I’ve got some statistical background, but certainly not the background of a data-scientist. I think your analysis is very interesting, and is sparking discussion between a group of us at my office (who have similar levels of “some stats, but not deep stats”). I’ve got a question about the numerical scores you assign to the letter grades, especially with respect to the k-means clustering. From reading the wiki on k-means, it seems like the method would be sensitive to the actual number values (as opposed to an ordered list of categories/indicator variables). I’m curious if there’s an impact due to C and D being 10 points apart, but D and F are 50 points apart (I also noticed you gave A and A+ the same number, but wasn’t sure if that was a typo).

I’m partly curious due to the original subject, and partly because at work I deal in some numerical data (permeability), and some related categorical data (gravel, sand, silt, clay). It looks like k-means analysis might be relevant and useful at times at work, so it’s nice to find a new potential tool through an unexpected avenue!

Hi. Thanks for your message. No. K-means clustering is not impacted by the numerical difference between the grades. It just looks at the numerical grades themselves. So, all states with a 0 numerical grade got clustered together to also be the states that had the highest gun murder rate. If I was to make the grades more evenly distributed, I conjecture that you would have more clusters (because there are more categories), but the highest murder rate clusters would still correspond to the states that the lower gun grades.

I like your analysis, but it’s not a direct counter to the WAPO article. They compared gun laws to overall homicide rates, not just gun deaths. Would love to see your analysis for how gun laws affect the overall homicide rate.

Hi. Thanks for this great read. I agree with J, I would love to see an analysis of gun laws on overall homicide rate (which is the point of the WAPO article). You seem to suggest at the beginning that there is a relationship between gun laws and overall homicide rates, but, if I am reading the post correctly, your main analysis deals only with gun homicides. Another thing to think about might be non-lethal gun attacks. Thanks.

In your WaPo comments, you tell one commenter that considering race might be useful but is unnecessary. What does that mean?

In a two-variable comparison, the firearm homicide rate and the proportion of a state’s population that is African-American have a correlation in the 2010s around .87. Is this useful, unnecessary, or somehow different and irrelevant in a way that income, total population, and gun law grades are not?

In your WaPo comments, you tell one commenter that considering race might be useful but is unnecessary. What does that mean?

In a two-variable comparison, the firearm homicide rate and the proportion of a state’s population that is African-American have a correlation in the 2010s around .87. Is this useful, unnecessary, or somehow different and irrelevant in a way that income, total population, and gun law grades are not?

The data used also includes suicides and accidental deaths. Try using data that is firearms homicide/murder only and I think you will be surprised at the results.

Why is a gun-related suicide or gun-related “accidental” death less significant than a gun-related homicide?

The question, “do gun laws affect the overall homicide rate?” is not more or less significant, but that is the question he was attempting to answer. To argue his results are wrong, wouldn’t you need to answer the same question? It would be interesting to know if, for example, there was no correlation between the laws and homicides but there was a correlation between the laws and suicides or vice versa.

A quote from your article says:

“A more correct way to look at the Gun Homicide data using data science methodologies.”

Accidental death and suicide are not homicide. Maybe I am word smithing 🙂

Hi. I think you are! I meant gun-related death data…

Although, I do agree with Jawohl’s statement above”

“To argue his results are wrong, wouldn’t you need to answer the same question?”

I am a mathematician myself and a big friend of statistical investigations. However, if you are fighting the anti-NRA war you need to be aware that politicians use other arguments, some of them downright cynical.

First of all, conservatives hold suicide for a private matter, or at least a matter of health care only. Then, they like to argue that accidental firearm incidents (shooting ones kids while cleaning the weapon) can be prevented by education on weapons, starting in schools. Moreover, they claim that more weapons prevent shootings due enabling people to defend themselves properly. These are the arguments you have to deal with.

Hi. Thanks for your comments. I very much agree with what you say, and have dealt with my fair share of these types of arguments, including an article by Eugene Volokh, which I tried to dispute at the beginning of this posting. They are truly blind!

I came across your post via the Volokh WP article, and I think your analysis is very well done. However, there is a problem. Gun laws vary within a state, sometimes drastically so, which generally makes state-by-state comparisons meaningless, unfortunately, because now you’re comparing averaged legislation. A better analysis, I believe, would be county-by-county comparisons, since you’re less likely to deal with averaged legislative effects within counties (some cities within a county may have stricter laws than the county itself, but then that can be looked at separately), and county laws generally better reflect the character of the populations in question, than state laws as a whole.

For example, CA has an overall grade of A-, but if you look at CA gun statistics, gun laws in the cities are generally much more strict than in most rural counties that have laxer laws. I haven’t done this analysis myself yet, but it’s worth examining. Cheers!

Excellent statistical work and I thank you for sharing. However, as others have mentioned, this does not address the WAPO article by Volokh as you count total firearm related deaths (including suicides). I am interested to know if Volokh’s point is valid or not when restricting consideration to just homicides and accidental deaths. Would you be willing to make another post tackling this issue? I think this deeper analysis is a valuable contribution to the conversation regardless of which way it points.

Hello. Yes, in the near future, when I have more time I will add this analysis, though, I suspect the conclusions will stay the same.

The WAPO article already conceded the point that there was a correlation between gun death and gun laws and argued that it was not actually worth looking at that statistic and gave reasons why. I see there has not been any action by you for over a year and a half so I can only conclude that you have looked at the actual question at hand and realized that there is not a correlation between gun laws and total homicide + accidental gun death. Sorry it didn’t work out for you.

Your comment indicates that either you don’t understand what correlation means, or you missed the entire point of my blog post / reply. The main point that I made and demonstrated is that the concept of a correlation is only useful as a measure of linearity between the two variables you are comparing. ALL of Volokh’s correlations that he computes are close to zero: 0.032 for correlation between homicide rate, including gun accidents and the Brady score, 0.065 for correlation between intentional homicide rate and Brady score, 0.0178, correlation between the homicide rate including gun accidents and the National Journal score, and 0.0511, correlation between just the intentional homicide rate and National Journal score. All of these numbers are completely *useless*. You cannot conclude anything from these scores. All you can conclude is that the relationship between homicide rate (including or not including gun accidents) and the Brady score is highly nonlinear. Since they are nonlinear, I have investigated this nonlinear relationship using data science methodologies such as regression trees.