登录查看更多内容

21 Days to Data Challenge - What I've Learned

Sarah Rajani, CTP

'Data with Sarah' -- Data Analyst || Proficient in SQL, Excel, Tableau, Power BI, Looker, and Looker Studio || Turning Complex Data into Actionable Insights

发布日期: 2022年10月24日

+ 关注

Introduction

Data is everywhere.

It is used in every industry, every company, and in every facet of our lives.

Did you know that when you are "googling" terms or a phrase, your search results are using data?

I have a background in mathematics and statistics, but it was only in the past few years that I really got to delve into analytics in my career. I use Excel almost daily, creating reports and analyzing KPIs, but I started to wonder if there were more tools and more things I could do with my data, beyond the spreadsheets many of us use in our daily work.

I started to learn more about data analysis and the different tools that can be used to help answer questions and solve problems, and in the midst of my learning, I came across Avery Smith's LinkedIn profile. I saw he had a #21daystodata challenge where he gives an introduction to data analysis while guiding us through our very own project, and of course I had to sign up!

In 21 days, we got an introductory look at different tools that data analysts use, and every day we got a short lesson and challenge to work on. We used a subset of New York City's open-source crime data for Q1 of 2018. The data source can be found on Kaggle:

The Mission

The course project is based around a fictional Police Commissioner of New York asking us (the data analysts) to help him with the city's crime problem. We were tasked with analyzing the crime data and providing some meaningful insights to the Commissioner so that he has an action plan to combat the city's crime.

In order to help the Commissioner with his problem, we needed to go through several steps. The process is outlined below.

STEP 1: Define the Problem

We want to provide insights to the Commissioner as to what is driving the high crime rates, and how the issue can be remedied.

STEP 2: Examine the Data

The data we are analyzing comes from a subset of the NYC open source crime data, of which we are using the 2018 Q1 subset. The data consists of 35 columns and 109,546 rows.

Taking a look at the dataset, we can see there are columns representing victim ages, suspect ages, crime, times crime were reported, locations of the crimes, types of crimes committed, and much more.

STEP 3: Clean the Data

Once I got an overall picture of the data, I needed to clean the data and add some information to help with the analysis I wanted to do later on.

Using OpenRefine, I used the Text Facet option for column 'VIC_AGE_GROUP' to take a closer look at the ages. I noticed that there were ages that do not make sense, which could have been do to typos or copy-paste errors when the data was collected.

There are several age ranges that have negative numbers, or are triple digits, that needed to be fixed.

In OpenRefine, I renamed the outliers as "UNKNOWN" to put them all into the already existing UNKNOWN category. The counts and age ranges now make more sense:

In order to save this to our dataset, though, I needed to make the same changes in Google Sheets. With so many rows, it did take a little longer, but I was able to add a filter to the 'VIC_AGE_GROUP' column and make the same changes as in OpenRefine, changing all outlier ages to "UNKNOWN."

I also wanted to add a few more columns to aid in my analysis later on, so I added columns for 'Hour' that the crime occurred, 'Date of the Week' that the crime occurred, and whether or not it occurred on a weekend or weekday.

To find the 'Hour' of each incident, I used the formula:

left(G2, FIND(":",G2)-1)

To find the 'Day of the Week' for each row, I used this formula, where day 1 is Sunday, and day 7 is Saturday.

=weekday(D2)

And finally, to find if it was on a 'Weekend or not?,' I used the below formula, where 1 is a weekend, 0 is a weekday.

=IF(OR(E2=7,E2=1),1,0)

STEP 4: Visualize the Data

I started out with Flourish, where I created a fun horizontal bar graph animation, showing the number of crimes per borough over time.

Metropolitan Police 1 年前

Big Data policing – here’s what it means

Naveen Joshi 6 年前

How The Pentagon Chief Data Officer And His Unit Are…

Randy Bean 3 年前

You can see the animated graph here.

I then created a dashboard using Tableau Public that includes:

A bar graph of crime incidents per borough;
A tree map showing the number of criminal offenses by charge (felony, misdemeanor, or violation);
A pie chart comparing percentages of victim age groups;
A map showing locations where the different offenses occurred;
And a graph comparing suspect vs. victim sex, as well as suspect vs. victim race.

To view the full dashboard in detail, you can check out the link in Tableau here:

STEP 5: Do More In-Depth Analysis

After seeing the visualizations and having an idea of what the data is showing us, I took it a step further and did some more analysis in SQL and Python.

Using the cloud-based "CSV SQL Live" website, I uploaded our cleaned dataset and ran some queries to find out what other information I could pull.

I wanted to find out the top 5 offenses and borough combinations, using the following query:

SELECT BORO_NM, OFNS_DESC, COUNT(*) AS incident_number

FROM 'crime'

GROUP BY BORO_NM, OFNS_DESC

ORDER BY incident_number DESC

LIMIT 5

The results are below, which show that Petit Larceny seems to make up the majority of offenses.

Then I wanted to find out what hours in the day had the highest and lowest numbers of reported crimes, so I used the query:

SELECT Hour, COUNT(*)

FROM 'crime'

GROUP BY Hour

ORDER BY COUNT(*) DESC

The results showed that 3pm (i.e. hour 15) had the highest crimes reported, and 5am (hour 5) had the lowest.

I continued my analysis using Python in the Google Colab environment. I began by importing the Pandas library and determining which day of the week had the highest reported crimes. Here is the code I used:

df['day_of_week'].value_counts()

And here are the results:

After that, I imported the Seaborn library into Python and used it to create a few more data visualizations. Here are the results and codes used:

STEP 6: Results

There were several takeaways obtained from analyzing this data for the Commissioner:

Most crime occurred on Fridays, and not on weekends as one might assume.
Most crime occurred in the afternoons, around 3pm, and the least at 5am. This is another surprising result, as one would think most crime occurs in the evenings. Could this just be a result of more crimes being reported in the afternoons than at night, instead of reflecting the actual number of crimes that occurred? Something to think about.
Most crime occurred between suspects and victims in the 25-44 age range.
Black-on-black and male-on-male crime comprised the majority of reported incidents.
Brooklyn had the highest reported crimes.
Most reported offenses were misdemeanors (petit larceny, specifically).

As a result of our analyses, we can advise the Commissioner that possibly adding more police officers to the Brooklyn area in the afternoon hours may help to curb the amount of crime.

What I learned From Taking this Course

There are so many different tools out there for a data analyst, beyond just Excel and Google Sheets.
Excel and Google Sheets are good for small or moderate amounts of data, but when the dataset starts to reach the thousands, spreadsheets can be very slow in processing functions. As the dataset gets larger, there are better and more efficient tools that can be used to analyze the data, such as SQL and Python and other programming languages.
Data visualizations are extremely important in being able to show your results to your audience. Make sure the visualizations make sense, are user friendly, and help solve the problem in question.

Conclusion

This 21 day challenge was a great experience, and I learned a lot of new skills. I got to use Flourish, OpenRefine, SQL, Tableau, and Python, and was able to construct a full data analysis project to showcase my new skills. Though I am still in the learning stages of programming languages like Python, I was able to see how powerful it can be to help take a deeper dive when there is a large amount of data. I found out how much I enjoy using SQL and creating data visualizations, and will definitely continue improving upon my current data analysis skills.

I can't wait to work on more projects and add some more data analysis tools to my skillset, so keep an eye out for more articles.

If you have any suggestions or feedback on this project, please connect with me on LinkedIn and share your thoughts!

Aygun A.

1 年

Good job!

Brender Aluoch

CHRP student @College of Human Resource management -Kenya.

2 年

Sarah Rajani, BSc, CTP ... great work

Christiaan Louw

2 年

Well done Sarah! I like how descriptive you were. It was very easy to follow along and not get lost along the way. You can be very proud of this ??

Leke Adebogun, CPA, FRM

Risk Management | Assurance | Consulting

2 年

Excellent work, Sarah! I love your consistency.

查看更多评论

要查看或添加评论，请登录

查看全部

21 Days to Data Challenge - What I've Learned

Sarah Rajani, CTP

'Data with Sarah' -- Data Analyst || Proficient in SQL, Excel, Tableau, Power BI, Looker, and Looker Studio || Turning Complex Data into Actionable Insights

Introduction

The Mission

STEP 1: Define the Problem

STEP 2: Examine the Data

STEP 3: Clean the Data

STEP 4: Visualize the Data

领英推荐

STEP 5: Do More In-Depth Analysis

STEP 6: Results

What I learned From Taking this Course

Conclusion

更多精彩文章

社区洞察

其他会员也浏览了

Why do we actually do data?

Top 5 Common Data Fallacies that you need to know

My Data Journey: 21 Days to Data

High-Frequency Checks- what they are and how they help

From Catastrophe to Celebration: How Our Data Blunder Almost Drained Millions (and How We Saved the Day as data engineers and scientist)"

NYC Crime Analysis Project

Data in Policing, Justice, and Government: A Strategic Approach

TURNING SURVEYS INTO GOLD By W H Inmon

Be a Data Crimefighter at the Met

A project of Analytics...

Introduction

The Mission

STEP 1: Define the Problem

STEP 2: Examine the Data

STEP 3: Clean the Data

STEP 4: Visualize the Data

领英推荐

STEP 5: Do More In-Depth Analysis

STEP 6: Results

What I learned From Taking this Course

Conclusion

City of Edmonton Crime Stats - A Power BI Project

2024年8月12日

Analysis of British Airways Customer Reviews - A Tableau Project

2024年7月29日

Hospital Patient Waitlist Analysis with Power BI

2024年7月11日

Cleaning, Preparing, And Analyzing IMDb Box Office Data in Power BI

2023年7月12日

How to Create a Dynamic Dashboard in Excel Using Call Center Performance Data

2023年7月10日

社区洞察

其他会员也浏览了

Why do we actually do data?

Top 5 Common Data Fallacies that you need to know

My Data Journey: 21 Days to Data

High-Frequency Checks- what they are and how they help

From Catastrophe to Celebration: How Our Data Blunder Almost Drained Millions (and How We Saved the Day as data engineers and scientist)"

NYC Crime Analysis Project

Data in Policing, Justice, and Government: A Strategic Approach

TURNING SURVEYS INTO GOLD By W H Inmon

Be a Data Crimefighter at the Met

A project of Analytics...