21 Days to Data Challenge - What I've Learned

21 Days to Data Challenge - What I've Learned

Introduction

Data is everywhere.

It is used in every industry, every company, and in every facet of our lives.

Did you know that when you are "googling" terms or a phrase, your search results are using data?

I have a background in mathematics and statistics, but it was only in the past few years that I really got to delve into analytics in my career. I use Excel almost daily, creating reports and analyzing KPIs, but I started to wonder if there were more tools and more things I could do with my data, beyond the spreadsheets many of us use in our daily work.

I started to learn more about data analysis and the different tools that can be used to help answer questions and solve problems, and in the midst of my learning, I came across Avery Smith's LinkedIn profile. I saw he had a #21daystodata challenge where he gives an introduction to data analysis while guiding us through our very own project, and of course I had to sign up!

In 21 days, we got an introductory look at different tools that data analysts use, and every day we got a short lesson and challenge to work on. We used a subset of New York City's open-source crime data for Q1 of 2018. The data source can be found on Kaggle:

The Mission

The course project is based around a fictional Police Commissioner of New York asking us (the data analysts) to help him with the city's crime problem. We were tasked with analyzing the crime data and providing some meaningful insights to the Commissioner so that he has an action plan to combat the city's crime.

In order to help the Commissioner with his problem, we needed to go through several steps. The process is outlined below.

STEP 1: Define the Problem

We want to provide insights to the Commissioner as to what is driving the high crime rates, and how the issue can be remedied.

STEP 2: Examine the Data

The data we are analyzing comes from a subset of the NYC open source crime data, of which we are using the 2018 Q1 subset. The data consists of 35 columns and 109,546 rows.

Taking a look at the dataset, we can see there are columns representing victim ages, suspect ages, crime, times crime were reported, locations of the crimes, types of crimes committed, and much more.

STEP 3: Clean the Data

Once I got an overall picture of the data, I needed to clean the data and add some information to help with the analysis I wanted to do later on.

Using OpenRefine, I used the Text Facet option for column 'VIC_AGE_GROUP' to take a closer look at the ages. I noticed that there were ages that do not make sense, which could have been do to typos or copy-paste errors when the data was collected.

No alt text provided for this image


No alt text provided for this image

There are several age ranges that have negative numbers, or are triple digits, that needed to be fixed.

In OpenRefine, I renamed the outliers as "UNKNOWN" to put them all into the already existing UNKNOWN category. The counts and age ranges now make more sense:

No alt text provided for this image

In order to save this to our dataset, though, I needed to make the same changes in Google Sheets. With so many rows, it did take a little longer, but I was able to add a filter to the 'VIC_AGE_GROUP' column and make the same changes as in OpenRefine, changing all outlier ages to "UNKNOWN."

I also wanted to add a few more columns to aid in my analysis later on, so I added columns for 'Hour' that the crime occurred, 'Date of the Week' that the crime occurred, and whether or not it occurred on a weekend or weekday.

To find the 'Hour' of each incident, I used the formula:

left(G2, FIND(":",G2)-1)        

To find the 'Day of the Week' for each row, I used this formula, where day 1 is Sunday, and day 7 is Saturday.

=weekday(D2)        

And finally, to find if it was on a 'Weekend or not?,' I used the below formula, where 1 is a weekend, 0 is a weekday.

=IF(OR(E2=7,E2=1),1,0)        

STEP 4: Visualize the Data

I started out with Flourish, where I created a fun horizontal bar graph animation, showing the number of crimes per borough over time.

You can see the animated graph here.

No alt text provided for this image

I then created a dashboard using Tableau Public that includes:

  • A bar graph of crime incidents per borough;
  • A tree map showing the number of criminal offenses by charge (felony, misdemeanor, or violation);
  • A pie chart comparing percentages of victim age groups;
  • A map showing locations where the different offenses occurred;
  • And a graph comparing suspect vs. victim sex, as well as suspect vs. victim race.

No alt text provided for this image

To view the full dashboard in detail, you can check out the link in Tableau here:

STEP 5: Do More In-Depth Analysis

After seeing the visualizations and having an idea of what the data is showing us, I took it a step further and did some more analysis in SQL and Python.

Using the cloud-based "CSV SQL Live" website, I uploaded our cleaned dataset and ran some queries to find out what other information I could pull.

I wanted to find out the top 5 offenses and borough combinations, using the following query:

SELECT BORO_NM, OFNS_DESC, COUNT(*) AS incident_number

FROM 'crime'

GROUP BY BORO_NM, OFNS_DESC

ORDER BY incident_number DESC

LIMIT 5        

The results are below, which show that Petit Larceny seems to make up the majority of offenses.

No alt text provided for this image

Then I wanted to find out what hours in the day had the highest and lowest numbers of reported crimes, so I used the query:

SELECT Hour, COUNT(*)

FROM 'crime'

GROUP BY Hour

ORDER BY COUNT(*) DESC
        

The results showed that 3pm (i.e. hour 15) had the highest crimes reported, and 5am (hour 5) had the lowest.

No alt text provided for this image
No alt text provided for this image

I continued my analysis using Python in the Google Colab environment. I began by importing the Pandas library and determining which day of the week had the highest reported crimes. Here is the code I used:

df['day_of_week'].value_counts()        

And here are the results:

No alt text provided for this image

After that, I imported the Seaborn library into Python and used it to create a few more data visualizations. Here are the results and codes used:

No alt text provided for this image

STEP 6: Results

There were several takeaways obtained from analyzing this data for the Commissioner:

  1. Most crime occurred on Fridays, and not on weekends as one might assume.
  2. Most crime occurred in the afternoons, around 3pm, and the least at 5am. This is another surprising result, as one would think most crime occurs in the evenings. Could this just be a result of more crimes being reported in the afternoons than at night, instead of reflecting the actual number of crimes that occurred? Something to think about.
  3. Most crime occurred between suspects and victims in the 25-44 age range.
  4. Black-on-black and male-on-male crime comprised the majority of reported incidents.
  5. Brooklyn had the highest reported crimes.
  6. Most reported offenses were misdemeanors (petit larceny, specifically).

As a result of our analyses, we can advise the Commissioner that possibly adding more police officers to the Brooklyn area in the afternoon hours may help to curb the amount of crime.

What I learned From Taking this Course

  1. There are so many different tools out there for a data analyst, beyond just Excel and Google Sheets.
  2. Excel and Google Sheets are good for small or moderate amounts of data, but when the dataset starts to reach the thousands, spreadsheets can be very slow in processing functions. As the dataset gets larger, there are better and more efficient tools that can be used to analyze the data, such as SQL and Python and other programming languages.
  3. Data visualizations are extremely important in being able to show your results to your audience. Make sure the visualizations make sense, are user friendly, and help solve the problem in question.

Conclusion

This 21 day challenge was a great experience, and I learned a lot of new skills. I got to use Flourish, OpenRefine, SQL, Tableau, and Python, and was able to construct a full data analysis project to showcase my new skills. Though I am still in the learning stages of programming languages like Python, I was able to see how powerful it can be to help take a deeper dive when there is a large amount of data. I found out how much I enjoy using SQL and creating data visualizations, and will definitely continue improving upon my current data analysis skills.

I can't wait to work on more projects and add some more data analysis tools to my skillset, so keep an eye out for more articles.

If you have any suggestions or feedback on this project, please connect with me on LinkedIn and share your thoughts!

Aygun A.

Co-Founder & CTO @Customereye.ai |Lead Data Scientist @Megalabs.ai | Sentiment Analysis | Gen AI | LLMs | Mining & Mineral Processing Engineer

1 年

Good job!

回复
Brender Aluoch

CHRP student @College of Human Resource management -Kenya.

2 年

Sarah Rajani, BSc, CTP ... great work

回复
Christiaan Louw

Sales Operations Analyst | Salesforce | Excel | SQL | Tableau | ????

2 年

Well done Sarah! I like how descriptive you were. It was very easy to follow along and not get lost along the way. You can be very proud of this ??

回复
Leke Adebogun, CPA, FRM

Risk Management | Assurance | Consulting

2 年

Excellent work, Sarah! I love your consistency.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了