Cyclistic Bike Share : Case study with R
Image source : https://www.istockphoto.com/photos/divvy-bike-share-chicago

Cyclistic Bike Share : Case study with R

Introduction

This is a capstone project as a part of my?Google Data Analytics Professional Certificate?course. For the analysis I will be using R programming language and?RStudio?IDE for it's easy statistical analysis tools and data visualizations.

For this project following data analysis steps will be followed :

  • Ask
  • Prepare
  • Process
  • Analyze
  • Share
  • Act

Following Case Study Roadmap will be followed on each data analysis process

  • Code, when needed on the step.
  • Key tasks, as a checklist.
  • Deliverable, as a checklist.

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.

Ask

Three questions will guide the future marketing program:

  1. How do annual members and casual riders use Cyclistic bikes differently?
  2. Why would casual riders buy Cyclistic annual memberships?
  3. How can Cyclistic use digital media to influence casual riders to become members?

The director of marketing and your manager?Lily Moreno?has assigned you the first question to answer:?How do annual members and casual riders use Cyclistic bikes differently?

Key tasks

  • [x] Identify the business task
  • The main objective is to build the best marketing strategies to turn casual bike riders into annual members by analyzing how the 'Casual' and 'Annual' customers use Cyclistic bike share differently.
  • [x] Consider key stakeholders
  • Cyclistic executive team, Director of Marketing (Lily Moreno), Marketing Analytics team.

Deliverable

  • [x] A clear statement of the business task
  • Find the differences between casual and member riders.

Prepare

I will use Cyclistic’s historical trip data to analyze and identify trends. The data has been made available by?Motivate International Inc.?under this?license. Datasets are available?here.

Key tasks

  • [x] Download data and store it appropriately.
  • Data has been downloaded from?here?and copies have been stored securely on my computer and here on Kaggle.
  • [x] Identify how it’s organized.
  • All trip data is in comma-delimited (.CSV) format. Column names "ride_id", "rideable_type", "started_at", "ended_at", "start_station_name", "start_station_id", "end_station_name", "end_station_id", "start_lat", "start_lng", "end_lat", "end_lng", "member_casual" (Total 13 column)
  • [x] Sort and filter the data.
  • For this analysis I'm going to use last 12 months data of the year 2021, as it's the more current period to the business task.
  • [x] Determine the credibility of the data.
  • For the purposes of this case study, the datasets are appropriate and it will enable me to answer the business questions. But due data-privacy I cannot use rider's personally identification information, and this will prevent me from determining if a single user/rider taken several rides. All ride ids are unique in this data-set.

Deliverable

  • [x] A description of all data sources used
  • Main source of data provided by the?Cylistic company.

Collect and combine data into a single data frame (Code)

## install and load necessary packages
library(tidyverse)
library(janitor)
library(ggmap)
library(geosphere)
library(lubridate)

## import data in R studio
jan21 <- read_csv("../input/divvytripdata/202101-divvy-tripdata.csv")
feb21 <- read_csv("../input/divvytripdata/202102-divvy-tripdata.csv")
mar21 <- read_csv("../input/divvytripdata/202103-divvy-tripdata.csv")
apr21 <- read_csv("../input/divvytripdata/202104-divvy-tripdata.csv")
may21 <- read_csv("../input/divvytripdata/202105-divvy-tripdata.csv")
jun21 <- read_csv("../input/divvytripdata/202106-divvy-tripdata.csv")
jul21 <- read_csv("../input/divvytripdata/202107-divvy-tripdata.csv")
aug21 <- read_csv("../input/divvytripdata/202108-divvy-tripdata.csv")
sep21 <- read_csv("../input/divvytripdata/202109-divvy-tripdata.csv")
oct21 <- read_csv("../input/divvytripdata/202110-divvy-tripdata.csv")
nov21 <- read_csv("../input/divvytripdata/202111-divvy-tripdata.csv")
dec21 <- read_csv("../input/divvytripdata/202112-divvy-tripdata.csv")

After importing the datasets into RStudio environment I have checked each individual datasets for consistency.

## merge individual monthly data frames into one large data fram
tripdata <- bind_rows(jan21, feb21, mar21, apr21, may21, jun21, jul21, aug21, sep21, oct21, nov21, dec21)e        

Process

Cleaning and Preparation of data for analysis

Key tasks

  • [x] Check the data for errors.
  • [x] Choose your tools.
  • [x] Transform the data so you can work with it effectively.
  • [x] Document the cleaning process.

Deliverable

  • [x] Documentation of any cleaning or manipulation of data

Following code chunks were used for this 'Process' phase.

## checking merged data frame
colnames(tripdata)  #List of column names
head(tripdata)  #See the first 6 rows of data frame.  Also tail(tripdata)
str(tripdata)  #See list of columns and data types (numeric, character, etc)
summary(tripdata)  #Statistical summary of data. Mainly for numeric.

## Adding date, month, year, day of week columntripdata <- tripdata %>%   mutate(year = format(as.Date(started_at), "%Y")) %>% # extract year  mutate(month = format(as.Date(started_at), "%B")) %>% #extract month  mutate(date = format(as.Date(started_at), "%d")) %>% # extract date  mutate(day_of_week = format(as.Date(started_at), "%A")) %>% # extract day of week  mutate(ride_length = difftime(ended_at, started_at)) %>%   mutate(start_time = strftime(started_at, "%H"))# converting 'ride_length' to numeric for calculation on datatripdata <- tripdata %>%   mutate(ride_length = as.numeric(ride_length))is.numeric(tripdata$ride_length) # to check it is right formats

# adding ride distance in k
tripdata$ride_distance <- distGeo(matrix(c(tripdata$start_lng, tripdata$start_lat), ncol = 2), matrix(c(tripdata$end_lng, tripdata$end_lat), ncol = 2))

tripdata$ride_distance <- tripdata$ride_distance/1000 #distance in km

# Remove "bad" data
# The dataframe includes a few hundred entries when bikes were taken out of docks 
# and checked for quality by Divvy where ride_length was negative or 'zero'

tripdata_clean <- tripdata[!(tripdata$ride_length <= 0),]        

Analyze

Now all the required information are in one place and ready for exploration.

Key tasks

  • [x] Aggregate your data so it’s useful and accessible.
  • [x] Organize and format your data.
  • [x] Perform calculations.
  • [x] Identify trends and relationships.

Deliverable

  • [x] A summary of the analysis

Following code chunks were used for this 'Analyze' phase

Compare members and casual users :

# first lets check the cleaned data frame
str(tripdata_clean)

# lets check summarised details about the cleaned dataset
summary(tripdata_clean) 

Full detailed outputs are available on my kaggle notebook, link available at the bottom of this article.        

Conduct descriptive analysis :

## Conduct descriptive analysis
# descriptive analysis on 'ride_length'
# mean = straight average (total ride length / total rides)
# median = midpoint number of ride length array
# max = longest ride
# min = shortest ride

tripdata_clean %>% 
  summarise(average_ride_length = mean(ride_length), median_length = median(ride_length), 
            max_ride_length = max(ride_length), min_ride_length = min(ride_length))

        
No alt text provided for this image

  • This above data is about 'ride_length' depending on the whole year 2021. Minimum ride length (min_ride_length) and Maximum ride length (max_ride_length) has absurd values, due to the lack of scope it not possible to find out the problem behind it, but it need to be analyzed further.

Compare members and casual users

  • Members vs casual riders difference depending on total rides taken

# members vs casual riders difference depending on total rides taken
tripdata_clean %>% 
    group_by(member_casual) %>% 
    summarise(ride_count = length(ride_id), ride_percentage = (length(ride_id) / nrow(tripdata_clean)) * 100)

ggplot(tripdata_clean, aes(x = member_casual, fill=member_casual)) +
    geom_bar() +
    labs(x="Casuals vs Members", y="Number Of Rides", title= "Casuals vs Members distribution")        
No alt text provided for this image
No alt text provided for this image

  • We can see on the Casuals vs Members distribution chart, members possesing ~55%, and casual riders have ~45% of the dataset. So it is clearly visible that in the whole year 2021 members used ride share ~10% more than casual riders.

Comparison between Members Causal riders depending on ride length (mean, median, minimum, maximum)

tripdata_clean %>%
  group_by(member_casual) %>% 
  summarise(average_ride_length = mean(ride_length), median_length = median(ride_length), 
            max_ride_length = max(ride_length), min_ride_length = min(ride_length))        
No alt text provided for this image

  • From the above table we can conclude that casual riders took bike for longer rides than members, as the average trip duration / average ride length of member riders is lower than the average trip duration / average ride length of casual riders.

See total rides and average ride time by each day for members vs casual riders

# lets fix the days of the week order.
tripdata_clean$day_of_week <- ordered(tripdata_clean$day_of_week, 
                                    levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

tripdata_clean %>% 
  group_by(member_casual, day_of_week) %>%  #groups by member_casual
  summarise(number_of_rides = n() #calculates the number of rides and average duration 
  ,average_ride_length = mean(ride_length),.groups="drop") %>% # calculates the average duration
  arrange(member_casual, day_of_week) #sort        
No alt text provided for this image

Visualize total rides data by type and day of week

tripdata_clean %>%  
  group_by(member_casual, day_of_week) %>% 
  summarise(number_of_rides = n(), .groups="drop") %>% 
  arrange(member_casual, day_of_week)  %>% 
  ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
  labs(title ="Total rides by Members and Casual riders Vs. Day of the week") +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))        
No alt text provided for this image

Visualize average ride time data by type and day of week

tripdata_clean %>%  
  group_by(member_casual, day_of_week) %>% 
  summarise(average_ride_length = mean(ride_length), .groups="drop") %>%
  ggplot(aes(x = day_of_week, y = average_ride_length, fill = member_casual)) +
  geom_col(width=0.5, position = position_dodge(width=0.5)) + 
  labs(title ="Average ride time by Members and Casual riders Vs. Day of the week")        
No alt text provided for this image

  • From the first chart above members took consistent trips throughout the week, but there is less rides in Sunday. For casual riders the most taken rides are in weekends, starting rise in Friday followed by Saturday and Sunday.
  • The average ride length for members are much much less than that of casual riders. Also it can be seen that weekend average ride length is much higher for casual riders along with total rides. So both of this facts can be correlated for casual riders. For members average ride length is about the same throughout the week (<1000 sec).

See total rides and average ride time by each month for members vs casual riders

# First lets fix the days of the week order.
tripdata_clean$month <- ordered(tripdata_clean$month, 
                            levels=c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))

tripdata_clean %>% 
  group_by(member_casual, month) %>%  
  summarise(number_of_rides = n(), average_ride_length = mean(ride_length), .groups="drop") %>% 
  arrange(member_casual, month)        
No alt text provided for this image

Visualize total rides data by type and month

tripdata_clean %>%  
  group_by(member_casual, month) %>% 
  summarise(number_of_rides = n(),.groups="drop") %>% 
  arrange(member_casual, month)  %>% 
  ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
  labs(title ="Total rides by Members and Casual riders Vs. Month", x = "Month", y= "Number Of Rides") +
  theme(axis.text.x = element_text(angle = 45)) +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))        
No alt text provided for this image

Visualize average ride time data by type and month

tripdata_clean %>%  
  group_by(member_casual, month) %>% 
  summarise(average_ride_length = mean(ride_length),.groups="drop") %>%
  ggplot(aes(x = month, y = average_ride_length, fill = member_casual)) +
  geom_col(width=0.5, position = position_dodge(width=0.5)) + 
  labs(title ="Average ride length by Members and Casual riders Vs. Month") +
  theme(axis.text.x = element_text(angle = 30))        
No alt text provided for this image

  • The months June, July, August and September are the most busy time of the year among both members and casual riders. It is possible due to winter there is a significant drop in total rides in the months of November, December, January and February for both type of customers. But we can see that member's total rides are higher than casual riders throughout the year except from June, July and August.
  • Average ride length of members is about the same <1000 secs throughout the year. While casual riders average ride length is between 1000 - 2000 secs throughout the year. But in the month of February average right length is higher but total rides are lowest as compared to other months.

Comparison between Members and Casual riders depending on ride distance

tripdata_clean %>% 
  group_by(member_casual) %>% drop_na() %>%
  summarise(average_ride_distance = mean(ride_distance)) %>%
  ggplot() + 
  geom_col(mapping= aes(x= member_casual,y= average_ride_distance,fill=member_casual), show.legend = FALSE)+
  labs(title = "Mean travel distance by Members and Casual riders", x="Member and Casual riders", y="Average distance In Km")        
No alt text provided for this image

  • From the above chart we can see that both riders travel about the same average distance. This similarity could be possible due to that member take (same ride time) rides throughout the week, but casual riders took rides mostly in weekends with higher ride time.

Analysis and visualization on cyclistic's bike demand by hour in a day

tripdata_clean %>%
    ggplot(aes(start_time, fill= member_casual)) +
    labs(x="Hour of the day", title="Cyclistic's Bike demand by hour in a day") +
    geom_bar()        
No alt text provided for this image

  • From the above chart we can see more members between 7am and 11am and more casual riders between 3pm and 12am. Also there is bigger volume rise in the afternoon for both type of riders. This information needs to be checked on day basis.

Analysis and visualization on cyclistic's bike demand per hour by day of the week

tripdata_clean %>%
    ggplot(aes(start_time, fill=member_casual)) +
    geom_bar() +
    labs(x="Hour of the day", title="Cyclistic's bike demand per hour by day of the week") +
    facet_wrap(~ day_of_week)        
No alt text provided for this image

  • There is a lot of difference between the weekdays and weekends. There is a big increase of volume in the weekdays between 7am to 10am and another volume increase from 5pm to 7pm. We can hypothesize that members use the bikes as daily routine like going to work (same behaviour throughout the weekdays) and go back from work (5pm - 7pm). Weekends are completely different for members and casual riders, Friday, Saturday and Sunday there is huge peak in volume for casual riders, from this we can hypothesize that casual riders mostly use bike share for leisure activity in the weekends.

Analysis and visualization of Rideable type Vs. total rides by Members and casual riders

tripdata_clean %>%
    group_by(rideable_type) %>% 
    summarise(count = length(ride_id))

ggplot(tripdata_clean, aes(x=rideable_type, fill=member_casual)) +
    labs(x="Rideable type", title="Rideable type Vs. total rides by Members and casual riders") +
    geom_bar()        
No alt text provided for this image



No alt text provided for this image

  • From the above viz we can see that members mostly use classic bikes, followed by electric bikes Docked bikes mostly used by casual riders. Electric bikes are more favored by members.

Now analyze and visualize the dataset on coordinate basis

#Lets check the coordinates data of the rides.
#adding a new data frame only for the most popular routes >200 rides
coordinates_df <- tripdata_clean %>% 
filter(start_lng != end_lng & start_lat != end_lat) %>%
group_by(start_lng, start_lat, end_lng, end_lat, member_casual, rideable_type) %>%
summarise(total_rides = n(),.groups="drop") %>%
filter(total_rides > 200)

# now lets create two different data frames depending on rider type (member_casual)

casual_riders <- coordinates_df %>% filter(member_casual == "casual")
member_riders <- coordinates_df %>% filter(member_casual == "member")        

Lets setup ggmap and store map of Chicago (bbox, stamen map)

chicago <- c(left = -87.700424, bottom = 41.790769, right = -87.554855, top = 41.990119)

chicago_map <- get_stamenmap(bbox = chicago, zoom = 12, maptype = "terrain")        

Visualization on the map

# maps on casual riders
ggmap(chicago_map,darken = c(0.1, "white")) +
   geom_point(casual_riders, mapping = aes(x = start_lng, y = start_lat, color=rideable_type), size = 2) +
   coord_fixed(0.8) +
   labs(title = "Most used routes by Casual riders",x=NULL,y=NULL) +
   theme(legend.position="none")

#map on member riders
ggmap(chicago_map,darken = c(0.1, "white")) +
    geom_point(member_riders, mapping = aes(x = start_lng, y = start_lat, color=rideable_type), size = 2) +  
    coord_fixed(0.8) +
    labs(title = "Most used routes by Member riders",x=NULL,y=NULL) +
    theme(legend.position="none")        
No alt text provided for this image
No alt text provided for this image

  • We can clearly see the casual rides are mostly located around the center of the town (or the bay area), with all their trips located around that area points towards their bike usage pattern, which is for leisure, probably tourist or sightseeing related rides.
  • Members are mostly use bike all over the city including main city area and outside main center. This can be hypothesize as they travel for work purpose.

Share

This phase will be done by presentation, but here on?Kaggle?we can use Notebooks to share our analysis and visualizations.

Key tasks

  • [x] Determine the best way to share your findings.
  • [x] Create effective data visualizations.
  • [x] Present your findings.
  • [x] Ensure your work is accessible.

Deliverable

  • [x] Supporting visualizations and key findings

Main insights and finding conclusions

  • Members holds the biggest proportion of the total rides, ~10% bigger than casuals riders.
  • In all months we have more members than casual riders.
  • For casual riders the biggest volume of data is on the the weekend.
  • There is a bigger volume of bikers in the afternoon.

This could be possible that member use bikes for work purpose, this information can be backed by their bike usage in colder months, where there is significant drop in casual members in those months.

Now for how members differs from casuals:

  • Members have the bigger volume of data, except on saturday and sunday. On the weekend, casuals riders have the most data points.
  • Casuals riders have more ride length (ride duration) than members. Average ride time of member are mostly same slight increase in end of week.
  • We have more members during the morning, mainly between 7am and 10am. And more casuals between 3pm and 12am.
  • Members have a bigger preference for classic bikes, followed by electric bike.
  • Members have a more fixed use for bikes for routine activities. Where as casual rider's usage is different, mostly all activiy in the weekend.
  • Casual member spend time near the center of the city or the bay area, where as member are scattered throughout the city.

Act

Act phase will be done by the Cyclistic's executive team, Director of Marketing (Lily Moreno), Marketing Analytics team on the basis of my analysis. (Data-driven decision making)

Deliverable

  • [x] Your top three recommendations based on your analysis
  • Offer a weekend-only membership at a different price point than the full annual membership.
  • Coupons and discounts could be handed out along with the annual subscription / weekend-only membership for the usage of electric bikes targeting casual riders. Also increasing the number of electric bike while reducing classic bikes if electric bike costs more for the pass, this can be beneficial for the company. (As electric bike are already in trend and usage is good as per member and ride type data.
  • Create marketing campaigns which can be sent via email, or advertisement in the docking stations explaining why annual member is beneficial. Campaigns should be placed at the peak months of the year.

Note : All ride ids are unique so we cannot conclude if the same rider taken several rides. More rider data needed for further analysis

Additional data that could expand scope of analysis

  • Pricing details for members and casual riders - Based on this data, we might be to optimize cost structure for casual riders or provide discounts without affecting the profit margin.
  • Address/ neighborhood details of members to investigate if there are any location specific parameters that encourage membership.
  • Way to determine a recurring bike user using payment information or any personal identification.

Resources

Thanks for reading and I hope you like it.

Please give your valuable feedback.

Also for the Kaggle Notebook access and to check detailed analysis with proper output details of the codes please visit My Notebook

Yusuf Aliyu

Production Planning Control @ African Industries Group | Data Analyst

2 年

this is awesome i actually finds it interesting and educating

回复
Shahar Mazor

Operations Manager | Business and Data Analyst | Small Businesses Owner.

2 年

Nice work my friend, I've been looking at many different solutions for this case study and you are the first I saw that dug into the details as much as I tried, Even liked it more as you've seen few things I missed and I believe its vice versa as well, which why I think "hive mind" is so great when people collaborate. I know who I will ask questions next time I hit a wall of thought :) keep up the good work ! ?? ??

回复
Zalman Z.

Acquisitions ? Management ? Trade

3 年

Very nicely done! All done in R. Great insights.

要查看或添加评论,请登录

Sayantan Bagchi的更多文章

社区洞察

其他会员也浏览了