登录查看更多内容

Google Data Analytics Professional Certificate Capstone Project: Cyclistic

Filipe Balseiro

?? Data Engineer | ?? Snowflake SnowPro Core & dbt Developer Certified | Python | GCP BigQuery | CI/CD Github Actions. Let's elevate your data strategy!

发布日期: 2022年1月29日

Case Study: Help a bike-share company to convert casual riders into annual members

In this article I showcase my approach to solve the case study of Google Data Analytics Professional Certificate Capstone Project. The goal is to help a bike-share company to increase the number of their annual memberships by understanding how they can convert the casual riders into members.

As I learned from the Google Data Analytics program, I will follow the six phases of the data analysis process: ask, prepare, process, analyze, share and act.

About the Company

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, team wants to understand how casual riders and annual members use Cyclistic bikes differently.

Business Task

To answer this question: How do annual members and casual riders use Cyclistic bikes differently?

Ask

Three questions need to be answered:

1. How do annual members and casual riders use Cyclistic bikes differently?

2. Why would casual riders buy Cyclistic annual memberships?

3. How can Cyclistic use digital media to influence casual riders to become members?

Prepare

In this phase, I collected the data organized in monthly .csv files from this source.

I downloaded the files from 2021 to perform this analysis.

Are there issues with bias or credibility in this data?

The data is from last year and is collected by a bike-share company directly. It includes all rides, so it's not a sample from the whole data. Therefore, it's possible to conclude that there are no credibility and bias issues with this data.

How are you addressing licensing, privacy, security and accessibility?

The data is under this license. There are no privacy concerns since the data does not contain personal information.?

Process

In this phase I processed the data and get it ready for the next phase where I will look for insights that help me answer to our stakeholders questions and business task.

I used R to perform this step since the data is too big to merge and process (5.595.063 records).

I started by importing the packages needed to this task.

library(tidyverse)? #helps wrangle data
library(lubridate)? #helps wrangle date attributes

Next, I collected the data from the 12 months of 2021 and combined to a single file.

m01_2021 <- read_csv("202101-divvy-tripdata.csv")
m02_2021 <- read_csv("202102-divvy-tripdata.csv")
m03_2021 <- read_csv("202103-divvy-tripdata.csv")
m04_2021 <- read_csv("202104-divvy-tripdata.csv")
m05_2021 <- read_csv("202105-divvy-tripdata.csv")
m06_2021 <- read_csv("202106-divvy-tripdata.csv")
m07_2021 <- read_csv("202107-divvy-tripdata.csv")
m08_2021 <- read_csv("202108-divvy-tripdata.csv")
m09_2021 <- read_csv("202109-divvy-tripdata.csv")
m10_2021 <- read_csv("202110-divvy-tripdata.csv")
m11_2021 <- read_csv("202111-divvy-tripdata.csv")
m12_2021 <- read_csv("202112-divvy-tripdata.csv")

# Stack individual monthly's data frames into one big data fram
all_trips <- bind_rows(m01_2021, m02_2021, m03_2021, m04_2021,
? ? ? ? ? ? ? ? ? ? ? ?m05_2021, m06_2021, m07_2021, m08_2021,
? ? ? ? ? ? ? ? ? ? ? ?m09_2021, m10_2021, m11_2021, m12_2021)

Let's start to observe the number of rows and columns.

We have 13 columns that represent the following attributes:

ride_id: Id for each bike trip
rideable_type: Type of bike used
started_at: Start date and time
ended_at: End date and time
start_station_name: Name of the starting station
start_station_id: Id of the starting station
end_station_name: Name of the ending station
end_station_id: Id of the ending station
start_lat: Latitude of the starting point
start_lng: Longitude of the starting point
end_lat: Latitude of the ending point
end_lng: Longitude of the ending point
member_casual: Membership status

Let's remove the columns that won't be used in the data analysis process: latitude, longitude, start_station_id and end_station_id.

all_trips <- all_trips %>%
? select(-c(start_lat, start_lng, end_lat, end_lng, start_station_id, end_station_id))

Next, I renamed the column member_casual to a more suitable name (rider_type).

all_trips <- all_trips %>%
? rename(rider_type = member_casual)

I created new columns that list the date, year, month, day and start hour of each ride, that will help in the analyze and share phases.

all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday",?
? ? ? ? ? ? ? ? ? ? ? ? ? ?"Friday", "Saturday")[as.POSIXlt(all_trips$date)$wday + 1]
all_trips$hour_of_day <- format(as.POSIXct(all_trips$started_at), format = "%H")

Added a new column called ride_length to calculate each trip (in minutes).

all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at, units = "mins")

I sorted the dataset based on ride_length to check if there are any errors, like negative values.

As you can see there are 147 negative values in the dataset which represent wrong data, so I proceed with removing these rows from the dataset.

all_trips_v2 <- all_trips[!(all_trips$ride_length<0), nrow(all_trips_v2)]

As we can see below the dataset has a considerable amount of NA values in start_station_name and end_station_name columns.

We have 690789 NA values in start_station_name and 739149 in end_station_name which represent around 13% of the total records.

Instead of removing the records with NA values, I assigned all NA values as "Missing Data" so that I can analyze missing values as well.

all_trips_v2$start_station_name <-
? replace(all_trips_v2$start_station_name, is.na(all_trips_v2$start_station_name), "Missing Data")
all_trips_v2$end_station_name <-
? replace(all_trips_v2$end_station_name, is.na(all_trips_v2$end_station_name), "Missing Data")

Finally, I exported the dataset as a csv file to performe the analyze, share and act phases in Tableau.

counts <- write.csv(all_trips_v2, file = 'all_trips.csv')

Tableau

Here you can find my insights, conclusions and recommendations using Tableau.

You can find the code used to accomplish this task on github.

要查看或添加评论，请登录

Filipe Balseiro的更多文章

Introduction to Streaming - Apache Kafka

2022年6月18日

Introduction to Streaming - Apache Kafka

References Alvaro Navas Notes Data Engineering Zoomcamp Repository What is a streaming data pipeline? A data pipeline…
Spark - Setting up a Dataproc Cluster on GCP

2022年6月17日

Spark - Setting up a Dataproc Cluster on GCP

Dataproc is Google's cloud-managed service for running Spark and other data processing tools such as Flink, Presto…

6 条评论
Apache Spark

2022年6月13日

Apache Spark

References Alvaro Navas Notes Data Engineering Zoomcamp Repository Installing Spark Installation instructions for…

3 条评论
DBT- Data Build Tool (Part II)

2022年6月11日

DBT- Data Build Tool (Part II)

References Alvaro Navas Notes Data Engineering Zoomcamp Repository Testing and documenting dbt models Although testing…

2 条评论
DBT- Data Build Tool (Part I)

2022年6月10日

DBT- Data Build Tool (Part I)

References Alvaro Navas Notes Data Engineering Zoomcamp Repository What is dbt? dbt stands for data build tool. It's a…

3 条评论
BigQuery

2022年5月5日

BigQuery

Partitioning vs Clustering It's possible to combine both partitioning and clustering in a table, but there are…
DataCamp - Data Engineering with Python

2022年4月24日

DataCamp - Data Engineering with Python

Data Engineers Data engineers deliver: The correct data In the right form To the right people As efficiently as…
Youtubers Popularity

2022年3月14日

Youtubers Popularity

Working with Youtube's API to collect channel and video statistics from 10 youtubers I follow and upload the data to an…

12 条评论

See all articles

Case Study: Help a bike-share company to convert casual riders into annual members

About the Company

Business Task

Ask

Prepare

Process

Tableau

Filipe Balseiro的更多文章

Introduction to Streaming - Apache Kafka

Spark - Setting up a Dataproc Cluster on GCP

Apache Spark

DBT- Data Build Tool (Part II)

DBT- Data Build Tool (Part I)

BigQuery

DataCamp - Data Engineering with Python

Youtubers Popularity