Cyclistic Bike-Share data using Excel, SQL and R Programming
In this Case study I acted as a Junior Data Analyst and worked along with some team members for a fictional company called Cyclistic. I have included the steps I followed to complete the entire data analysis process: Ask, Prepare, Process, Analyze and Share
Purpose
The bigger purpose of this project is to utilise and demonstrate my knowledge gained from the Google Data analytics Certification and in using Excel, SQL and R programming to uncover insights using the power of data. In a professional team dynamic, different objectives could be achieved using this data. For now, I will be following the prompts offered in the Case study
Scenario
I am a junior data analyst for the Chicago bike-share startup ‘Cyclistic,’ where I am a member of the marketing analyst team. The company's future prosperity, in the director of marketing's opinion, is on increasing the number of yearly memberships. As a result, my team is interested in seeing how yearly members and casual riders utilise cycling bikes differently. The marketing team will use my recommendations, based on the findings, to run the marketing plan to turn casual riders into subscribers or members. In order to get my proposals approved by Cyclistic executives, I will be supporting my recommendations with convincing data insights and expert data visualisations.
Stakeholders: Cyclistic(The company), Lily Moreno(Director of Marketing), Marketing Analytics Team and the Executive Team at Cyclistic.
Step- 1 Ask?
The following questions will guide the future marketing program:
1.? How do annual members and casual riders use Cyclistic bikes differently?
2.? Why would casual riders buy Cyclistic annual memberships?
3. ?How can Cyclistic use digital media to influence casual riders to become members?
Step- 2 Prepare?
Business Task
Analyse the most recent,? past 12 months, bikeshare data to discover trends, similarities and differences in bike uses by casual riders and member riders.
About the Data
The dataset I will be using is between April, 2023 to March, 2024. All data is publicly available data by the Motivate International Inc. here, under this licence
At first it was important to see if this data passes the ROCCC test? (Reliable, Original, Comprehensive, Current, and Cited)
Reliable: The dataset comes from the City of Chicago’s Divvy program, which is managed by a publicly traded company called Lyft Bikes and Scooters, LLC.
Original: The data is original and collected directly from the first party
Comprehensive: The data is comprehensive to achieve the business task, and answer the questions at hand.
Current: The data used is from April 2023 to March 2024, so it is current and relevant
Cited: The data has become available by Motivate International Inc, here, under this licence. The data is cited and credible.
Step- 3A Process in Excel
I first downloaded the data for only 1 month, March 2024 in Excel and wanted to look for some opportunities and use cases of this data. I did the following:?
Step- 3B Analysing Data in Excel
Basics in Excel: I started by calculating the Mean ride_length, using =Average(select entire column ride_length), Max ride_length, using ?=Max(select entire column ride_length), then found the Mode of day_of_week, using =mode(weekday(select entire column day_of_week, 1))
Trends:? I created these using Pivot tables and then creating Pivot charts for visualisations. Some trends I analysed in the month of April, 2024 in Excel included the following:?
Step- 4A Process in Google Bigquery
I used Google Bigquery to run my SQL commands and chose only the 1st Quarter of 2024 data to see some patterns and trends.?
I started by combining the 3 months of data: Jan 2024, Feb 2024, and March 2024 using UNION ALL, and while doing so I also created the following columns- ride_length column by using timestamp_diff() func & day_of_week column by using EXTRACT() function, I further used CAST() function to change it into a string.
The CAST() function in the previous step allowed me to change the values from numbers to actual days.(PS: Ignore that I left Friday, Saturday all caps)
?Once I got here I could not, not use the CAST() function and Case When on Month as well.
Step- 4A Process in Google Bigquery
I used Google Bigquery to run my SQL commands and chose only the 1st Quarter of 2024 data to see some patterns and trends.?
I started by combining the 3 months of data: Jan 2024, Feb 2024, and March 2024 using UNION ALL, and while doing so I also created the following columns- ride_length column by using timestamp_diff() func & day_of_week column by using EXTRACT() function, I further used CAST() function to change it into a string.
The CAST() function in the previous step allowed me to change the values from numbers to actual days.(PS: Ignore that I left Friday, Saturday all caps)
?Once I got here I could not, not use the CAST() function and CASE WHEN on Month as well.
Step- 4B Analysing Data in Google BigQuery
It was now time to analyse this data and get some insights using SQL. I started with the basics like Counting number of rides, Looking at ride length, Mean, then progressed into more detailed findings spread out over the Quarter, and Days of the week.
领英推荐
Step- 5A Process Data in R Console
I then wanted to analyse data for the entire year from April 2023 to March 2024 in the fastest way, with visualisations for which I had to move to R. I used R Desktop Console for this process
I started by installing the tidyverse package and relevant libraries
> install.packages('tidyverse')
> library(tidyverse)
> library(lubridate)
> library(dplyr)
> library(ggplot2)
> library(readr)
After installing packages and entering into the specific libraries I reviewed the working directory and had to set its path to the right folder. To see the current working directory path I used getwd() function
> getwd()
[1] "/Users/main"
The? path was by default? set to Main, so I changed it to the folder named Bikeshare Data on my Desktop, containing all the data sets using setwd() function
> setwd('/Users/main/Desktop/Bikeshare Data')
Then I checked if it worked using getwd() function
> getwd()
[1] "/Users/main/Desktop/Bikeshare Data"
Then I created each month as an objects with stored values which was sitting in my Bikeshare Data folder and, ? ? ?
> april_2023 <- read.csv('202304-divvy-tripdata 2.csv')
may_2023 <- read.csv('202305-divvy-tripdata.csv')
june_2023 <- read.csv('202306-divvy-tripdata.csv')
july_2023 <- read.csv('202307-divvy-tripdata 2.csv')
august_2023 <- read.csv('202308-divvy-tripdata.csv')
sep_2023 <- read.csv('202309-divvy-tripdata.csv')
oct_2023 <- read.csv('202310-divvy-tripdata.csv')
nov_2023 <- read.csv('202311-divvy-tripdata.csv')
dec_2023 <- read.csv('202312-divvy-tripdata.csv')
jan_2024 <- read.csv('202401-divvy-tripdata 2.csv')
feb_2024 <- read.csv('202402-divvy-tripdata 2.csv')
march_2024 <- read.csv('202403-divvy-tripdata 2.csv')
Compiled it all into full_data object using rbind() function,
> full_data <- rbind(april_2023, may_2023, june_2023, july_2023, august_2023, sep_2023, oct_2023, nov_2023, dec_2023, jan_2024, feb_2024, march_2024)
I then reviewed the structure of my data using str() function
> str(full_data)
It was time to now clean the data, by removing nulls using na.omit() and removing duplicates by piping %>%? down to distinct()
????????????????> new_data <- na.omit(full_data) %>% distinct()
Then I created some more columns to my dataset like day_of_week, month, and ride_length by mutating started_at for month and day_of_week, and using difftime(ended_at,started_at,units=’mins’) to get ride_length in? minutes
> new_data <- new_data %>% mutate(started_at = as.POSIXct(started_at, format = "%Y-%m-%d %H:%M:%S"),ride_length = difftime(ended_at, started_at, units = 'mins'),day_of_week = wday(started_at),month = format(started_at, "%m"))
Once this was done, I reviewed the structure of my data again using str() function, and I was ready to analyse the new_data
> str(new_data) ?
It was time to analyse the new_data, by starting with the basics like mean, counts, etc. So I checked to see the mean ride_length for casual vs member users over the year or 12 month period
> new_data %>% group_by(member_casual) %>% summarise(total_rides = n()) %>% arrange(member_casual) %>% print()
> new_data %>% group_by(member_casual, rideable_type) %>% summarise(biketype_total =?
n()) %>% print()?
> new_data %>% group_by(month, member_casual) %>% summarise(mean_ridelength = mean(ride_length)) %>%? arrange(member_casual) %>% ? print(n=24)
Then I printed the mean(ride_length) over days of the week, where 1 is Sunday, 2 is Monday and so on, ending at 7 being Saturday.
> new_data %>% group_by(day_of_week,member_casual) %>% summarise(mean_ridelength = mean(ride_length)) %>% arrange(member_casual) %>% print(n=14)
I followed the same process to get total_rides or sum of rides by riders over a month and over the week
> new_data %>% group_by(month, member_casual) %>% summarise(total_rides = n()) %>% arrange(member_casual) %>% print(n=24)
> new_data %>% group_by(day_of_week, member_casual) %>% summarise(total_rides = n()) %>% arrange(member_casual) %>% print(n=14)
Step- 5B Analysing Data in R Console
Now, I wanted to see some plots and use R’s real power and create plots to be able to visualise the data better and make recommendations
> new_data %>% group_by(member_casual, rideable_type) %>% summarise(biketype_total = n())? %>% ggplot(aes(x = rideable_type, y = biketype_total, fill = member_casual)) + geom_bar(stat = "identity", position = "dodge") + labs(x = "Bike Type", y = "Count", title = "Total Counts of Bike Types by Membership Type") + scale_fill_manual(values = c("member" = "blue", "casual" = "orange")) + theme_minimal()
> new_data %>% group_by(member_casual) %>% summarise(total_rides = n()) %>% arrange(member_casual) %>% ggplot(aes(x = member_casual, y = total_rides, fill = member_casual)) + geom_bar(stat = "identity", position = "dodge") + labs(x = "Membership Type", y = "Total Rides", title = "Total Rides by Membership Type") + scale_fill_manual(values = c("member" = "blue", "casual" = "orange")) + theme_minimal()
> new_data %>% group_by(month, member_casual) %>% summarise(total_rides = n()) %>% arrange(member_casual) %>% ggplot(aes(x = month, y = total_rides, fill = member_casual)) + geom_histogram(stat = "identity", position = "dodge") + labs(x = "Month", y = "Total Rides", title = "Total Rides by Month and Membership Type") + facet_wrap(~ member_casual, scales = "free_y", nrow = 1) + scale_fill_manual(values = c("member" = "blue", "casual" = "orange")) + theme_minimal()
> new_data %>% group_by(day_of_week, member_casual) %>% summarise(total_rides = n()) %>% arrange(member_casual) %>% ggplot(aes(x = day_of_week, y = total_rides, fill = member_casual)) + geom_histogram(stat = "identity", position = "dodge") + labs(x = "Day_of_week", y = "Total Rides", title = "Total Rides by days of the week and Membership Type") + facet_wrap(~ member_casual, scales = "free_y", nrow = 1) + scale_fill_manual(values = c("member" = "blue", "casual" = "orange")) + theme_minimal()
> new_data %>% group_by(day_of_week,member_casual) %>% summarise(mean_ridelength = mean(ride_length)) %>% arrange(member_casual) %>% ggplot(aes(x = day_of_week, y = mean_ridelength, fill = member_casual)) + geom_histogram(stat = "identity", position = "dodge") + labs(x = "Days of the week(1-Sun,2-Mon,3-Tue,4-Wed,5-Thu,6-Fri,7-Sat)", y = "Ride Length", title = "Average Ride Length over the week") + facet_wrap(~ member_casual, scales = "free_y", nrow = 1) + scale_fill_manual(values = c("member" = "blue", "casual" = "orange")) + theme_minimal()
> new_data %>% group_by(month,member_casual) %>% summarise(mean_ridelength = mean(ride_length)) %>% arrange(member_casual) %>% ggplot(aes(x = month, y = mean_ridelength, fill = member_casual)) + geom_histogram(stat = "identity", position = "dodge") + labs(x = "Months", y = "Ride Length", title = "Average Ride Length over the year") + facet_wrap(~ member_casual, scales = "free_y", nrow = 1) + scale_fill_manual(values = c("member" = "blue", "casual" = "orange")) + theme_minimal()
Step- 6 Share Findings and Insights
Some insights that I was able to generate from all the analysis are:
Step- 7 Act?
Some recommendations for the marketing team to convert casual riders into members:
Marketing l Growing l Connecting
11 个月Awesome!