21 Days to Data Challenge - What I've Learned
Sarah Rajani, CTP
'Data with Sarah' -- Data Analyst || Proficient in SQL, Excel, Tableau, Power BI, Looker, and Looker Studio || Turning Complex Data into Actionable Insights
Introduction
Data is everywhere.
It is used in every industry, every company, and in every facet of our lives.
Did you know that when you are "googling" terms or a phrase, your search results are using data?
I have a background in mathematics and statistics, but it was only in the past few years that I really got to delve into analytics in my career. I use Excel almost daily, creating reports and analyzing KPIs, but I started to wonder if there were more tools and more things I could do with my data, beyond the spreadsheets many of us use in our daily work.
I started to learn more about data analysis and the different tools that can be used to help answer questions and solve problems, and in the midst of my learning, I came across Avery Smith's LinkedIn profile. I saw he had a #21daystodata challenge where he gives an introduction to data analysis while guiding us through our very own project, and of course I had to sign up!
In 21 days, we got an introductory look at different tools that data analysts use, and every day we got a short lesson and challenge to work on. We used a subset of New York City's open-source crime data for Q1 of 2018. The data source can be found on Kaggle:
The Mission
The course project is based around a fictional Police Commissioner of New York asking us (the data analysts) to help him with the city's crime problem. We were tasked with analyzing the crime data and providing some meaningful insights to the Commissioner so that he has an action plan to combat the city's crime.
In order to help the Commissioner with his problem, we needed to go through several steps. The process is outlined below.
STEP 1: Define the Problem
We want to provide insights to the Commissioner as to what is driving the high crime rates, and how the issue can be remedied.
STEP 2: Examine the Data
The data we are analyzing comes from a subset of the NYC open source crime data, of which we are using the 2018 Q1 subset. The data consists of 35 columns and 109,546 rows.
Taking a look at the dataset, we can see there are columns representing victim ages, suspect ages, crime, times crime were reported, locations of the crimes, types of crimes committed, and much more.
STEP 3: Clean the Data
Once I got an overall picture of the data, I needed to clean the data and add some information to help with the analysis I wanted to do later on.
Using OpenRefine, I used the Text Facet option for column 'VIC_AGE_GROUP' to take a closer look at the ages. I noticed that there were ages that do not make sense, which could have been do to typos or copy-paste errors when the data was collected.
There are several age ranges that have negative numbers, or are triple digits, that needed to be fixed.
In OpenRefine, I renamed the outliers as "UNKNOWN" to put them all into the already existing UNKNOWN category. The counts and age ranges now make more sense:
In order to save this to our dataset, though, I needed to make the same changes in Google Sheets. With so many rows, it did take a little longer, but I was able to add a filter to the 'VIC_AGE_GROUP' column and make the same changes as in OpenRefine, changing all outlier ages to "UNKNOWN."
I also wanted to add a few more columns to aid in my analysis later on, so I added columns for 'Hour' that the crime occurred, 'Date of the Week' that the crime occurred, and whether or not it occurred on a weekend or weekday.
To find the 'Hour' of each incident, I used the formula:
left(G2, FIND(":",G2)-1)
To find the 'Day of the Week' for each row, I used this formula, where day 1 is Sunday, and day 7 is Saturday.
=weekday(D2)
And finally, to find if it was on a 'Weekend or not?,' I used the below formula, where 1 is a weekend, 0 is a weekday.
=IF(OR(E2=7,E2=1),1,0)
STEP 4: Visualize the Data
I started out with Flourish, where I created a fun horizontal bar graph animation, showing the number of crimes per borough over time.
领英推荐
You can see the animated graph here.
I then created a dashboard using Tableau Public that includes:
To view the full dashboard in detail, you can check out the link in Tableau here:
STEP 5: Do More In-Depth Analysis
After seeing the visualizations and having an idea of what the data is showing us, I took it a step further and did some more analysis in SQL and Python.
Using the cloud-based "CSV SQL Live" website, I uploaded our cleaned dataset and ran some queries to find out what other information I could pull.
I wanted to find out the top 5 offenses and borough combinations, using the following query:
SELECT BORO_NM, OFNS_DESC, COUNT(*) AS incident_number
FROM 'crime'
GROUP BY BORO_NM, OFNS_DESC
ORDER BY incident_number DESC
LIMIT 5
The results are below, which show that Petit Larceny seems to make up the majority of offenses.
Then I wanted to find out what hours in the day had the highest and lowest numbers of reported crimes, so I used the query:
SELECT Hour, COUNT(*)
FROM 'crime'
GROUP BY Hour
ORDER BY COUNT(*) DESC
The results showed that 3pm (i.e. hour 15) had the highest crimes reported, and 5am (hour 5) had the lowest.
I continued my analysis using Python in the Google Colab environment. I began by importing the Pandas library and determining which day of the week had the highest reported crimes. Here is the code I used:
df['day_of_week'].value_counts()
And here are the results:
After that, I imported the Seaborn library into Python and used it to create a few more data visualizations. Here are the results and codes used:
STEP 6: Results
There were several takeaways obtained from analyzing this data for the Commissioner:
As a result of our analyses, we can advise the Commissioner that possibly adding more police officers to the Brooklyn area in the afternoon hours may help to curb the amount of crime.
What I learned From Taking this Course
Conclusion
This 21 day challenge was a great experience, and I learned a lot of new skills. I got to use Flourish, OpenRefine, SQL, Tableau, and Python, and was able to construct a full data analysis project to showcase my new skills. Though I am still in the learning stages of programming languages like Python, I was able to see how powerful it can be to help take a deeper dive when there is a large amount of data. I found out how much I enjoy using SQL and creating data visualizations, and will definitely continue improving upon my current data analysis skills.
I can't wait to work on more projects and add some more data analysis tools to my skillset, so keep an eye out for more articles.
If you have any suggestions or feedback on this project, please connect with me on LinkedIn and share your thoughts!
Co-Founder & CTO @Customereye.ai |Lead Data Scientist @Megalabs.ai | Sentiment Analysis | Gen AI | LLMs | Mining & Mineral Processing Engineer
1 年Good job!
CHRP student @College of Human Resource management -Kenya.
2 年Sarah Rajani, BSc, CTP ... great work
Sales Operations Analyst | Salesforce | Excel | SQL | Tableau | ????
2 年Well done Sarah! I like how descriptive you were. It was very easy to follow along and not get lost along the way. You can be very proud of this ??
Risk Management | Assurance | Consulting
2 年Excellent work, Sarah! I love your consistency.