Analyzing US Gun Violence Data
Nishank Arora
Project Manager - Capgemini | Ex-Svam | Microsoft Certified Data Analyst Associate
In this post, we will be analyzing US gun violence data using the data set available on a popular site Kaggle.
Before starting to analyze data, its very important to understand why we need to analyze the data. In every data analysis use case, to understand the basic aim of analysis is the most fundamental step.
Every year, US witnesses a large number of gun violence incidents which are not terrorist activities but cases where people with unstable mindset (due to numerous reasons) shoot on innocent people. A large number of people are affected by these incidents, so it’s important to study data to understand and identify common points or patterns which government can use to provide better security to its citizens. So let’s hit it!
First Step: Import the Data Set
All the data analysis that we would be doing would be in python and we would be using pandas, numpy as well as modules like plotly for visualizations.
Following is the code to import the data set into a data frame ‘df’:
#Import pandas library
import pandas as pd
#Import data set
df = pd.read_csv('/Users/toughguy/Downloads/gun-violence-data_01-2013_03-2018.csv')
#prints the data for few rows
df.head()
Now that we have data frame available, let’s see what information we have in dataframe that we imported:
df.info()
The above line of code prints out the following:
Next Steps: Analyzing data frame
The above columns found in data frame can be used to understand and infer lots of information. But the information inferred should be useful enough. Till now, we just know the columns and type of data we have. For better understanding, lets visualize data and plot all the data points on US map to understand number of incidents reported in each state.
Number of incidents by state
We would use the current dataframe ‘df’ and perform operations to extract number of incidents of gun violence happened in US. Let’s see the results. Following is the code:
# Get statewise incidents rows and count
statewise_numbers = df.groupby('state')['state'].count()
statewise_numbers.columns = ['states', 'count']
statewise_numbers.rename(columns={'state': 'State', '': 'Number_of_incidents'}, inplace=True)
This would generate following results:
So, we now have the total count statewise for gun violence incidents. Let’s plot these on US map for better visualization.
Visualizing and plotting data on US map
Following is the code to import the appropriate libraries and plot data points:
#For plotting
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline
# Import FIPS of all US states
import pandas as pd
df_states = pd.read_csv('/Users/admin/my-documents/fips_state.csv')
# Create new data frame with statewise numbers
df_new = pd.DataFrame({'state':statewise_numbers.index, 'Number_of_incidents':statewise_numbers.values})
# Rename data frame columns
df_new.rename(columns={'post_code':'state'},inplace=True)
# Merge df_new and df_states
result = pd.merge(df_new,df_states,on='state',how='inner')
We have prepared data to be visualized above and ‘result’ data now looks like below:
For visualization, use the code below:
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.tools as tls
scl = [
[0.0, 'rgb(240,230,255)'],
[0.1, 'rgb(224,204,255)'],
[0.2, 'rgb(209,179,255)'],
[0.3, 'rgb(194,153,255)'],
[0.4, 'rgb(179,128,255)'],
[0.5, 'rgb(163,102,255)'],
[0.6, 'rgb(148,77,255)'],
[0.7, 'rgb(133,51,255)'],
[0.8, 'rgb(117,26,255)'],
[0.9, 'rgb(102,0,255)'],
[1.0, 'rgb(92,0,230)']
]
result['text'] = result['state']
data = [go.Choropleth(
colorscale = scl,
autocolorscale = False,
locations = result['post_code'],
z = result['Number_of_incidents'].astype(float),
locationmode = 'USA-states',
text = result['text'],
marker = go.choropleth.Marker(
line = go.choropleth.marker.Line(
color = 'rgb(255,255,255)',
width = 2,
)),
colorbar = go.choropleth.ColorBar(
title = "Number of incidents")
)]
layout = go.Layout(
title = go.layout.Title(
text = 'Gun Violence by State<br>(Hover for breakdown)'
),
geo = go.layout.Geo(
scope = 'usa',
projection = go.layout.geo.Projection(type = 'albers usa'),
showlakes = True,
lakecolor = 'rgb(255, 255, 255)'),
)
fig = go.Figure(data = data, layout = layout)
py.iplot(fig, filename = 'd3-cloropleth-map')
Above code would generate below visualization:
From above, we can infer that highest number of incidents were reported in Illionis, California, Texas and Florida
Above is just an example. Similarly, we can infer a lot more information from the data. That’s all folks.
Thanks for reading guys. I’ll try to post more frequently with similar data sets and new use cases :)