Google Data Analytics Professional Certificate Capstone Project: Cyclistic
Filipe Balseiro
?? Data Engineer | ?? Snowflake SnowPro Core & dbt Developer Certified | Python | GCP BigQuery | CI/CD Github Actions. Let's elevate your data strategy!
Case Study: Help a bike-share company to convert casual riders into annual members
In this article I showcase my approach to solve the case study of Google Data Analytics Professional Certificate Capstone Project. The goal is to help a bike-share company to increase the number of their annual memberships by understanding how they can convert the casual riders into members.
As I learned from the Google Data Analytics program, I will follow the six phases of the data analysis process: ask, prepare, process, analyze, share and act.
About the Company
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, team wants to understand how casual riders and annual members use Cyclistic bikes differently.
Business Task
To answer this question: How do annual members and casual riders use Cyclistic bikes differently?
Ask
Three questions need to be answered:
1. How do annual members and casual riders use Cyclistic bikes differently?
2. Why would casual riders buy Cyclistic annual memberships?
3. How can Cyclistic use digital media to influence casual riders to become members?
Prepare
In this phase, I collected the data organized in monthly .csv files from this source.
I downloaded the files from 2021 to perform this analysis.
Are there issues with bias or credibility in this data?
The data is from last year and is collected by a bike-share company directly. It includes all rides, so it's not a sample from the whole data. Therefore, it's possible to conclude that there are no credibility and bias issues with this data.
How are you addressing licensing, privacy, security and accessibility?
The data is under this license. There are no privacy concerns since the data does not contain personal information.?
Process
In this phase I processed the data and get it ready for the next phase where I will look for insights that help me answer to our stakeholders questions and business task.
I used R to perform this step since the data is too big to merge and process (5.595.063 records).
I started by importing the packages needed to this task.
library(tidyverse)? #helps wrangle data
library(lubridate)? #helps wrangle date attributes
Next, I collected the data from the 12 months of 2021 and combined to a single file.
m01_2021 <- read_csv("202101-divvy-tripdata.csv")
m02_2021 <- read_csv("202102-divvy-tripdata.csv")
m03_2021 <- read_csv("202103-divvy-tripdata.csv")
m04_2021 <- read_csv("202104-divvy-tripdata.csv")
m05_2021 <- read_csv("202105-divvy-tripdata.csv")
m06_2021 <- read_csv("202106-divvy-tripdata.csv")
m07_2021 <- read_csv("202107-divvy-tripdata.csv")
m08_2021 <- read_csv("202108-divvy-tripdata.csv")
m09_2021 <- read_csv("202109-divvy-tripdata.csv")
m10_2021 <- read_csv("202110-divvy-tripdata.csv")
m11_2021 <- read_csv("202111-divvy-tripdata.csv")
m12_2021 <- read_csv("202112-divvy-tripdata.csv")
# Stack individual monthly's data frames into one big data fram
all_trips <- bind_rows(m01_2021, m02_2021, m03_2021, m04_2021,
? ? ? ? ? ? ? ? ? ? ? ?m05_2021, m06_2021, m07_2021, m08_2021,
? ? ? ? ? ? ? ? ? ? ? ?m09_2021, m10_2021, m11_2021, m12_2021)
Let's start to observe the number of rows and columns.
We have 13 columns that represent the following attributes:
Let's remove the columns that won't be used in the data analysis process: latitude, longitude, start_station_id and end_station_id.
all_trips <- all_trips %>%
? select(-c(start_lat, start_lng, end_lat, end_lng, start_station_id, end_station_id))
Next, I renamed the column member_casual to a more suitable name (rider_type).
all_trips <- all_trips %>%
? rename(rider_type = member_casual)
I created new columns that list the date, year, month, day and start hour of each ride, that will help in the analyze and share phases.
all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday",?
? ? ? ? ? ? ? ? ? ? ? ? ? ?"Friday", "Saturday")[as.POSIXlt(all_trips$date)$wday + 1]
all_trips$hour_of_day <- format(as.POSIXct(all_trips$started_at), format = "%H")
Added a new column called ride_length to calculate each trip (in minutes).
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at, units = "mins")
I sorted the dataset based on ride_length to check if there are any errors, like negative values.
As you can see there are 147 negative values in the dataset which represent wrong data, so I proceed with removing these rows from the dataset.
all_trips_v2 <- all_trips[!(all_trips$ride_length<0), nrow(all_trips_v2)]
As we can see below the dataset has a considerable amount of NA values in start_station_name and end_station_name columns.
We have 690789 NA values in start_station_name and 739149 in end_station_name which represent around 13% of the total records.
Instead of removing the records with NA values, I assigned all NA values as "Missing Data" so that I can analyze missing values as well.
all_trips_v2$start_station_name <-
? replace(all_trips_v2$start_station_name, is.na(all_trips_v2$start_station_name), "Missing Data")
all_trips_v2$end_station_name <-
? replace(all_trips_v2$end_station_name, is.na(all_trips_v2$end_station_name), "Missing Data")
Finally, I exported the dataset as a csv file to performe the analyze, share and act phases in Tableau.
counts <- write.csv(all_trips_v2, file = 'all_trips.csv')
Tableau
Here you can find my insights, conclusions and recommendations using Tableau.
You can find the code used to accomplish this task on github.