Cyclistic Bike-Share data using Excel, SQL and R Programming

Cyclistic Bike-Share data using Excel, SQL and R Programming

In this Case study I acted as a Junior Data Analyst and worked along with some team members for a fictional company called Cyclistic. I have included the steps I followed to complete the entire data analysis process: Ask, Prepare, Process, Analyze and Share

Purpose

The bigger purpose of this project is to utilise and demonstrate my knowledge gained from the Google Data analytics Certification and in using Excel, SQL and R programming to uncover insights using the power of data. In a professional team dynamic, different objectives could be achieved using this data. For now, I will be following the prompts offered in the Case study

Scenario

I am a junior data analyst for the Chicago bike-share startup ‘Cyclistic,’ where I am a member of the marketing analyst team. The company's future prosperity, in the director of marketing's opinion, is on increasing the number of yearly memberships. As a result, my team is interested in seeing how yearly members and casual riders utilise cycling bikes differently. The marketing team will use my recommendations, based on the findings, to run the marketing plan to turn casual riders into subscribers or members. In order to get my proposals approved by Cyclistic executives, I will be supporting my recommendations with convincing data insights and expert data visualisations.

Stakeholders: Cyclistic(The company), Lily Moreno(Director of Marketing), Marketing Analytics Team and the Executive Team at Cyclistic.


Step- 1 Ask?

The following questions will guide the future marketing program:

1.? How do annual members and casual riders use Cyclistic bikes differently?

2.? Why would casual riders buy Cyclistic annual memberships?

3. ?How can Cyclistic use digital media to influence casual riders to become members?

Step- 2 Prepare?

Business Task

Analyse the most recent,? past 12 months, bikeshare data to discover trends, similarities and differences in bike uses by casual riders and member riders.

About the Data

The dataset I will be using is between April, 2023 to March, 2024. All data is publicly available data by the Motivate International Inc. here, under this licence

At first it was important to see if this data passes the ROCCC test? (Reliable, Original, Comprehensive, Current, and Cited)

Reliable: The dataset comes from the City of Chicago’s Divvy program, which is managed by a publicly traded company called Lyft Bikes and Scooters, LLC.

Original: The data is original and collected directly from the first party

Comprehensive: The data is comprehensive to achieve the business task, and answer the questions at hand.

Current: The data used is from April 2023 to March 2024, so it is current and relevant

Cited: The data has become available by Motivate International Inc, here, under this licence. The data is cited and credible.

Step- 3A Process in Excel

I first downloaded the data for only 1 month, March 2024 in Excel and wanted to look for some opportunities and use cases of this data. I did the following:?

  1. Data Cleaning: Removed Duplicates, and irrelevant columns like Station IDs, Station names, Latitude and Longitudes; Filtered for NULL values and hid or removed any irrelavant rows or fields; Trimmed any spaces in any columns; Replaced Null values with relevant fields if the information was available


Before cleaning (March 2024 dataset)
After cleaning (March 2024 dataset)

  1. Reviewing and Separating data type: Making sure all data is in the right data type(TEXT, DATE, TIME, etc) to avoid problems during calculations and Using text to columns to separate date and time columns

  1. Creating Calculated Fields: I added a ride_length column calculated by the following formula =INT(MINUTE(ended_at - started_at)), which gave me the ride length for each ride_id


With Calculated Field

  1. Adding a Column: day_of_week column was created by using =weekday(started_at), then used custom under data type, creating type ‘dddd,’ which gives the day

Before Changing data type(March 2024 dataset)
After Changing data type using custom,’dddd’ (March 2024 dataset)

Step- 3B Analysing Data in Excel

Basics in Excel: I started by calculating the Mean ride_length, using =Average(select entire column ride_length), Max ride_length, using ?=Max(select entire column ride_length), then found the Mode of day_of_week, using =mode(weekday(select entire column day_of_week, 1))

Trends:? I created these using Pivot tables and then creating Pivot charts for visualisations. Some trends I analysed in the month of April, 2024 in Excel included the following:?

Count of Ride_id in Values, Member_casual in Legends, day_of_week in Axis(March 2024 dataset)


Average ride_length in Values, Member_casual in Legends, day_of_week in Axis(March 2024 dataset)


Count of Ride_id in Values, Member_casual in Legends, rideable_type in Axis(March 2024 dataset)

Step- 4A Process in Google Bigquery

I used Google Bigquery to run my SQL commands and chose only the 1st Quarter of 2024 data to see some patterns and trends.?


I started by combining the 3 months of data: Jan 2024, Feb 2024, and March 2024 using UNION ALL, and while doing so I also created the following columns- ride_length column by using timestamp_diff() func & day_of_week column by using EXTRACT() function, I further used CAST() function to change it into a string.

  1. SELECT ride_id,rideable_type,started_at,ended_at,timestamp_diff(ended_at,started_at,MINUTE) AS ride_length, start_station_name, member_casual, CAST(EXTRACT(dayofweek FROM started_at) AS STRING)AS day_of_week FROM model-cirrus-421301.Cyclistic_bikeshare.2024_01 UNION ALL SELECT ride_id,rideable_type,started_at,ended_at,timestamp_diff(ended_at,started_at,MINUTE) AS ride_length, start_station_name, member_casual, CAST(EXTRACT(dayofweek FROM started_at) AS STRING)AS day_of_week FROM model-cirrus-421301.Cyclistic_bikeshare.2024_02 UNION ALL SELECT ride_id,rideable_type,started_at,ended_at,timestamp_diff(ended_at,started_at,MINUTE) AS ride_length, start_station_name, member_casual, CAST(EXTRACT(dayofweek FROM started_at) AS STRING)AS day_of_week FROM model-cirrus-421301.Cyclistic_bikeshare.2024_03


The CAST() function in the previous step allowed me to change the values from numbers to actual days.(PS: Ignore that I left Friday, Saturday all caps)


  1. SELECT ride_id,rideable_type,started_at, ended_at, ride_length,member_casual,CASE day_of_week WHEN '1' THEN 'Sunday' WHEN '2' THEN 'Monday' WHEN '3' THEN 'Tuesday' WHEN '4' THEN 'Wednesday' WHEN '5' THEN 'Thursday' WHEN '6' THEN 'FRIDAY' ELSE 'SATURDAY' END AS day_of_week_final FROM model-cirrus-421301.Cyclistic_bikeshare.2024_Q1_V0


?Once I got here I could not, not use the CAST() function and Case When on Month as well.


  1. SELECT ride_id,rideable_type,started_at, ended_at, ride_length,member_casual, CAST(EXTRACT(Month FROM started_at) AS STRING) AS Month FROM model-cirrus-421301.Cyclistic_bikeshare.2024_Q1_V0


  1. SELECT ride_id,rideable_type,started_at, ended_at, ride_length,member_casual,CASE Month WHEN '1' THEN 'January' WHEN '2' THEN 'February' ELSE 'March' END AS month_final FROM model-cirrus-421301.Cyclistic_bikeshare.2024_Q1_monthly

Step- 4A Process in Google Bigquery

I used Google Bigquery to run my SQL commands and chose only the 1st Quarter of 2024 data to see some patterns and trends.?


I started by combining the 3 months of data: Jan 2024, Feb 2024, and March 2024 using UNION ALL, and while doing so I also created the following columns- ride_length column by using timestamp_diff() func & day_of_week column by using EXTRACT() function, I further used CAST() function to change it into a string.

  1. SELECT ride_id,rideable_type,started_at,ended_at,timestamp_diff(ended_at,started_at,MINUTE) AS ride_length, start_station_name, member_casual, CAST(EXTRACT(dayofweek FROM started_at) AS STRING)AS day_of_week FROM model-cirrus-421301.Cyclistic_bikeshare.2024_01 UNION ALL SELECT ride_id,rideable_type,started_at,ended_at,timestamp_diff(ended_at,started_at,MINUTE) AS ride_length, start_station_name, member_casual, CAST(EXTRACT(dayofweek FROM started_at) AS STRING)AS day_of_week FROM model-cirrus-421301.Cyclistic_bikeshare.2024_02 UNION ALL SELECT ride_id,rideable_type,started_at,ended_at,timestamp_diff(ended_at,started_at,MINUTE) AS ride_length, start_station_name, member_casual, CAST(EXTRACT(dayofweek FROM started_at) AS STRING)AS day_of_week FROM model-cirrus-421301.Cyclistic_bikeshare.2024_03


The CAST() function in the previous step allowed me to change the values from numbers to actual days.(PS: Ignore that I left Friday, Saturday all caps)


  1. SELECT ride_id,rideable_type,started_at, ended_at, ride_length,member_casual,CASE day_of_week WHEN '1' THEN 'Sunday' WHEN '2' THEN 'Monday' WHEN '3' THEN 'Tuesday' WHEN '4' THEN 'Wednesday' WHEN '5' THEN 'Thursday' WHEN '6' THEN 'FRIDAY' ELSE 'SATURDAY' END AS day_of_week_final FROM model-cirrus-421301.Cyclistic_bikeshare.2024_Q1_V0


?Once I got here I could not, not use the CAST() function and CASE WHEN on Month as well.


  1. SELECT ride_id,rideable_type,started_at, ended_at, ride_length,member_casual, CAST(EXTRACT(Month FROM started_at) AS STRING) AS Month FROM model-cirrus-421301.Cyclistic_bikeshare.2024_Q1_V0


  1. SELECT ride_id,rideable_type,started_at, ended_at, ride_length,member_casual,CASE Month WHEN '1' THEN 'January' WHEN '2' THEN 'February' ELSE 'March' END AS month_final FROM model-cirrus-421301.Cyclistic_bikeshare.2024_Q1_monthly


Step- 4B Analysing Data in Google BigQuery

It was now time to analyse this data and get some insights using SQL. I started with the basics like Counting number of rides, Looking at ride length, Mean, then progressed into more detailed findings spread out over the Quarter, and Days of the week.


  1. SELECT COUNT(ride_id) AS number_of_rides, member_casual, rideable_type FROM model-cirrus-421301.Cyclistic_bikeshare.2024_Q1_V0 GROUP BY rideable_type, member_casual

  1. SELECT MAX(ride_length) AS Max_ride_length, member_casual, rideable_type FROM model-cirrus-421301.Cyclistic_bikeshare.2024_Q1_V0 GROUP BY rideable_type, member_casual



  1. SELECT member_casual, rideable_type, avg(ride_length) AS mean_ride_length FROM model-cirrus-421301.Cyclistic_bikeshare.2024_Q1_V01Month GROUP BY rideable_type, member_casual ORDER BY member_casual

  1. SELECT day_of_week_final, member_casual, count(ride_id) as number_of_rides, avg(ride_length)as avg_ride_length, max(ride_length) as max_ride_length FROM model-cirrus-421301.Cyclistic_bikeshare.2024_Q1_V01 Group by member_casual, day_of_week_final


  1. SELECT month_final, member_casual, count(ride_id) as number_of_rides, avg(ride_length)as avg_ride_length, max(ride_length) as max_ride_length FROM model-cirrus-421301.Cyclistic_bikeshare.2024_Q1_V01Month Group by member_casual, month_final


Step- 5A Process Data in R Console

I then wanted to analyse data for the entire year from April 2023 to March 2024 in the fastest way, with visualisations for which I had to move to R. I used R Desktop Console for this process


I started by installing the tidyverse package and relevant libraries

> install.packages('tidyverse')

> library(tidyverse)

> library(lubridate)

> library(dplyr)

> library(ggplot2)

> library(readr)

After installing packages and entering into the specific libraries I reviewed the working directory and had to set its path to the right folder. To see the current working directory path I used getwd() function


> getwd()

[1] "/Users/main"

The? path was by default? set to Main, so I changed it to the folder named Bikeshare Data on my Desktop, containing all the data sets using setwd() function


> setwd('/Users/main/Desktop/Bikeshare Data')

Then I checked if it worked using getwd() function


> getwd()

[1] "/Users/main/Desktop/Bikeshare Data"

Then I created each month as an objects with stored values which was sitting in my Bikeshare Data folder and, ? ? ?

> april_2023 <- read.csv('202304-divvy-tripdata 2.csv')

may_2023 <- read.csv('202305-divvy-tripdata.csv')

june_2023 <- read.csv('202306-divvy-tripdata.csv')

july_2023 <- read.csv('202307-divvy-tripdata 2.csv')

august_2023 <- read.csv('202308-divvy-tripdata.csv')

sep_2023 <- read.csv('202309-divvy-tripdata.csv')

oct_2023 <- read.csv('202310-divvy-tripdata.csv')

nov_2023 <- read.csv('202311-divvy-tripdata.csv')

dec_2023 <- read.csv('202312-divvy-tripdata.csv')

jan_2024 <- read.csv('202401-divvy-tripdata 2.csv')

feb_2024 <- read.csv('202402-divvy-tripdata 2.csv')

march_2024 <- read.csv('202403-divvy-tripdata 2.csv')


Compiled it all into full_data object using rbind() function,

> full_data <- rbind(april_2023, may_2023, june_2023, july_2023, august_2023, sep_2023, oct_2023, nov_2023, dec_2023, jan_2024, feb_2024, march_2024)

I then reviewed the structure of my data using str() function

> str(full_data)

It was time to now clean the data, by removing nulls using na.omit() and removing duplicates by piping %>%? down to distinct()


????????????????> new_data <- na.omit(full_data) %>% distinct()

Then I created some more columns to my dataset like day_of_week, month, and ride_length by mutating started_at for month and day_of_week, and using difftime(ended_at,started_at,units=’mins’) to get ride_length in? minutes

> new_data <- new_data %>% mutate(started_at = as.POSIXct(started_at, format = "%Y-%m-%d %H:%M:%S"),ride_length = difftime(ended_at, started_at, units = 'mins'),day_of_week = wday(started_at),month = format(started_at, "%m"))


Once this was done, I reviewed the structure of my data again using str() function, and I was ready to analyse the new_data

> str(new_data) ?

It was time to analyse the new_data, by starting with the basics like mean, counts, etc. So I checked to see the mean ride_length for casual vs member users over the year or 12 month period

> new_data %>% group_by(member_casual) %>% summarise(total_rides = n()) %>% arrange(member_casual) %>% print()

> new_data %>% group_by(member_casual, rideable_type) %>% summarise(biketype_total =?

n()) %>% print()?

> new_data %>% group_by(month, member_casual) %>% summarise(mean_ridelength = mean(ride_length)) %>%? arrange(member_casual) %>% ? print(n=24)

Then I printed the mean(ride_length) over days of the week, where 1 is Sunday, 2 is Monday and so on, ending at 7 being Saturday.


> new_data %>% group_by(day_of_week,member_casual) %>% summarise(mean_ridelength = mean(ride_length)) %>% arrange(member_casual) %>% print(n=14)

I followed the same process to get total_rides or sum of rides by riders over a month and over the week

> new_data %>% group_by(month, member_casual) %>% summarise(total_rides = n()) %>% arrange(member_casual) %>% print(n=24)

> new_data %>% group_by(day_of_week, member_casual) %>% summarise(total_rides = n()) %>% arrange(member_casual) %>% print(n=14)


Step- 5B Analysing Data in R Console

Now, I wanted to see some plots and use R’s real power and create plots to be able to visualise the data better and make recommendations

> new_data %>% group_by(member_casual, rideable_type) %>% summarise(biketype_total = n())? %>% ggplot(aes(x = rideable_type, y = biketype_total, fill = member_casual)) + geom_bar(stat = "identity", position = "dodge") + labs(x = "Bike Type", y = "Count", title = "Total Counts of Bike Types by Membership Type") + scale_fill_manual(values = c("member" = "blue", "casual" = "orange")) + theme_minimal()


> new_data %>% group_by(member_casual) %>% summarise(total_rides = n()) %>% arrange(member_casual) %>% ggplot(aes(x = member_casual, y = total_rides, fill = member_casual)) + geom_bar(stat = "identity", position = "dodge") + labs(x = "Membership Type", y = "Total Rides", title = "Total Rides by Membership Type") + scale_fill_manual(values = c("member" = "blue", "casual" = "orange")) + theme_minimal()

> new_data %>% group_by(month, member_casual) %>% summarise(total_rides = n()) %>% arrange(member_casual) %>% ggplot(aes(x = month, y = total_rides, fill = member_casual)) + geom_histogram(stat = "identity", position = "dodge") + labs(x = "Month", y = "Total Rides", title = "Total Rides by Month and Membership Type") + facet_wrap(~ member_casual, scales = "free_y", nrow = 1) + scale_fill_manual(values = c("member" = "blue", "casual" = "orange")) + theme_minimal()


> new_data %>% group_by(day_of_week, member_casual) %>% summarise(total_rides = n()) %>% arrange(member_casual) %>% ggplot(aes(x = day_of_week, y = total_rides, fill = member_casual)) + geom_histogram(stat = "identity", position = "dodge") + labs(x = "Day_of_week", y = "Total Rides", title = "Total Rides by days of the week and Membership Type") + facet_wrap(~ member_casual, scales = "free_y", nrow = 1) + scale_fill_manual(values = c("member" = "blue", "casual" = "orange")) + theme_minimal()

> new_data %>% group_by(day_of_week,member_casual) %>% summarise(mean_ridelength = mean(ride_length)) %>% arrange(member_casual) %>% ggplot(aes(x = day_of_week, y = mean_ridelength, fill = member_casual)) + geom_histogram(stat = "identity", position = "dodge") + labs(x = "Days of the week(1-Sun,2-Mon,3-Tue,4-Wed,5-Thu,6-Fri,7-Sat)", y = "Ride Length", title = "Average Ride Length over the week") + facet_wrap(~ member_casual, scales = "free_y", nrow = 1) + scale_fill_manual(values = c("member" = "blue", "casual" = "orange")) + theme_minimal()

> new_data %>% group_by(month,member_casual) %>% summarise(mean_ridelength = mean(ride_length)) %>% arrange(member_casual) %>% ggplot(aes(x = month, y = mean_ridelength, fill = member_casual)) + geom_histogram(stat = "identity", position = "dodge") + labs(x = "Months", y = "Ride Length", title = "Average Ride Length over the year") + facet_wrap(~ member_casual, scales = "free_y", nrow = 1) + scale_fill_manual(values = c("member" = "blue", "casual" = "orange")) + theme_minimal()


Step- 6 Share Findings and Insights

Some insights that I was able to generate from all the analysis are:

  1. Members use Electric and Classic bikes in a similar amount, and do not use docked bikes at all. Casuals use electric bikes more than classic bikes, but they also use some docked bikes.
  2. Casual riders take longer rides per ride when compared with members
  3. Member take more Rides than casuals, or members ride bikes more frequently than casuals
  4. Both members and casuals took most rides during the months of June, July and August. Where casuals rode most in July and Members in August. Both members and casuals rode the least in January.
  5. Casuals ride more on the weekends, while members ride more during the weekdays. Casuals drove most of Saturdays, where members drove most on Thursdays.
  6. Both members and casuals rode about the same length of time. Riding bikes for longest in one ride on Sundays and Saturdays
  7. Both members and casuals rode the most or went for long rides during the months of June, July and August. Casuals rode the least in January, and members rode the least in December.

Step- 7 Act?

Some recommendations for the marketing team to convert casual riders into members:

  1. Casual riders usually ride longer per ride than members, and this could be used as a method of attracting them to become member riders
  2. Casual riders do not ride as often, when compared to member riders. Offering a discount based on their specific frequency of rides on membership would help convert them into members
  3. Casual riders enjoy riding most in the month of July. Offering a summer discount could convert them into members


Lynch Litala

Marketing l Growing l Connecting

11 个月

Awesome!

要查看或添加评论,请登录

Vishal S.的更多文章

社区洞察

其他会员也浏览了