21 Days to Data Challenge : Complete?
Introduction
It didn’t take long in every new role for me to be tasked with creating or managing spreadsheets. While it wasn’t in my job description, I enjoyed the challenge, and others noticed. Though no one crowned me Queen of Excel, my knack for turning whiteboards into dashboards often sparked, “Who made that?!” moments. I strategically positioned myself as the go-to problem solver in data organization and visualization.
Recently, I participated in the 21 Days to Data Challenge to deepen my skills in analytics, a field where I’ve found both passion and purpose. The experience introduced me to tools like OpenRefine, Tableau, and Python’s seaborn library, while reinforcing my SQL and data wrangling capabilities. Avery Smith’s advice stuck with me:
"The best project is the one you finish..."
That quote evolved into a mindset that propelled me to dig deeper into the data and push past technical challenges.
The challenge provided daily lessons, exercises, and almost-daily LinkedIn shares to complete, and it was an incredible experience.
What I Learned
Key Analysis Takeaways
The Process
Step 1: Define the Problem or Question
Provide data-backed insights to help the Police Commissioner reduce crime in NYC’s “Data District.”
Step 2: Collect & Examine the Data
The subset of New York City's open-source reported crime data can be found here. Data dictionary for further explanations of terms can be found here.
The dataset contained over 109,000 rows of crime reports, including dates, times, locations, and details about victims and suspects.
Step 3: Clean the Data
After importing the dataset into OpenRefine , I took my time to familiarize and understand the data. This helps me to identify errors and pin-point anything that seems illogical.
With a Text Facet selection in the 'VIC_AGE_GROUP' column, the granular view is shown on the left-hand side.
Several outliers are seen as negative or triple-digit numbers meant to represent the victim's age. I renamed them to fall in the UNKNOWN category. Although they may have been typos, I cannot assume, so I must take the logical step to mark them as UNKNOWN.
Next, I checked that the dates were properly formatted and, NOT to my surprise, the CMPLNT_TO_DT had a data type of TEXT. Common mistake, simple fix.
In the column header, I selected drop down > Edit Cells > Common Transforms > To date. This took care of the entire column.
Now I'm not going to sit here and give screenshots all day because cleaning is a PROCESS....but one more because, C'MON!!!!
Can you see what the problem is? If not, that's what you have me here for! [hint: Look at the years...EEEEK!]
I wanted to pull the hour of the day from the timestamps the complaints were reported. To do that I imported the data into 谷歌 Sheets [I'm familiar with Microsoft Excel and we're on an adventure here...let's explore].
领英推荐
I added a column to the right of the CMPLNT_FR_DT and labeled it HOUR_OF_DAY.
=LEFT(F2,FIND(":",F2)-1)
After splitting and aggregating, this HOUR_OF_DAY column will eventually provide a count for incident occurrences throughout the day for the timeline of the data.
Another interesting insight could potentially come from the day of the week to include determining weekend, so that required two more new columns and formulas.
=WEEKDAY(D2)
=IF(OR(E2=7,E2=1),1,0)
Step 4: Analyze & Visualize & Analyze some more...
I started with a racing chart in Flourish to track the daily crime reports per borough. I then wanted to get a visual for the total incident accounts per borough so I threw the stats into a column chart.
Next I used SQL queries with the CSV SQL LIVE cloud platform to find specific parameters requested by the Commissioner. The priority was to retrieve the counts of the top 5 borough and offense combinations.
SELECT BORO_NM, OFNS_DESC, COUNT(*) AS incident_number
FROM "crime_csv"
GROUP BY BORO_NM, OFNS_DESC
ORDER BY incident_number DESC
LIMIT 5
Petit Larceny is the most frequent, but the amount in Manhattan is significantly more than Brooklyn who ranked the highest for overall reported incidents. There may be something to look further into there.
Next I imported the dataset into 谷歌 Colab to manipulate and see what I could see with Python and [code snippets above - do not run all at once break up by line to follow my methods].
import?pandas?as?pd
df?=?pd.read_excel('data_district_crime.xlsx')
df.head()
df.describe()
df[['HOUR','BORO_NM']]
df.loc[10:15]
df['BORO_NM'].value_counts()
df[df['BORO_NM']=='BROOKLYN']['HOUR'].count()
df.columns
df.info()
df['MONTH']?=?df['CMPLNT_TO_DT'].dt.month
df.head(10)
df['MONTH'].value_counts()
df['DAY_OF_WEEK']?=?df['CMPLNT_TO_DT'].dt.day_name()
df['DAY_OF_WEEK'].value_counts()
Incorporated some charts via the library.
Finally, I created an interactive dashboard in Tableau . Click image to go to public dash.
My Tableau Public dashboard includes:
Step 5: Interpret and Apply [See Key Analysis Takeaways]
Conclusion
Overall I learned a plethora of new skills and platforms to keep tinkering with. I met a lot of great people that I now feel have formed a tiny data community to hold one another accountable going forward.
The experience pushed me and guided me to new things I may or may not have learned on my own. It was nice to have a a motivated and knowledgeable guide.
Thank you, Avery Smith ! Also, shout out to Catherine | Chris French | Sarah Rajani | Ashley Zacharias | Kimberly Ouimette | James Charest ! We did it!
Data Analyst & Visualization Engineer - Insights might be difficult to get from data, but I help make them clear and actionable.
2 年Great work Angela
Research Analyst | Master of Science in Data Analytics and Policy
2 年Awesome work! Love how in-depth you went!????
Data Analyst
2 年I love how you organized this! Awesome job!
AWS Cloud Engineer - Containers | DevOps | Kubernetes | 2x AWS Certified | 2x CompTIA | DoD IAM-I | Army Veteran |
2 年Awesome project Angela.
I run a data analytics learning platform geared towards unappreciated and underpaid professionals. Give DataFrenchy a look!
2 年The details and organization is amazing!! Fantastic job Angela!