Google Data Analytics Professional Certificate Capstone Project: Cyclistic

Google Data Analytics Professional Certificate Capstone Project: Cyclistic

Case Study: Help a bike-share company to convert casual riders into annual members

In this article I showcase my approach to solve the case study of Google Data Analytics Professional Certificate Capstone Project. The goal is to help a bike-share company to increase the number of their annual memberships by understanding how they can convert the casual riders into members.

As I learned from the Google Data Analytics program, I will follow the six phases of the data analysis process: ask, prepare, process, analyze, share and act.

About the Company

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, team wants to understand how casual riders and annual members use Cyclistic bikes differently.

Business Task

To answer this question: How do annual members and casual riders use Cyclistic bikes differently?

Ask

Three questions need to be answered:

1. How do annual members and casual riders use Cyclistic bikes differently?

2. Why would casual riders buy Cyclistic annual memberships?

3. How can Cyclistic use digital media to influence casual riders to become members?

Prepare

In this phase, I collected the data organized in monthly .csv files from this source.

I downloaded the files from 2021 to perform this analysis.

Are there issues with bias or credibility in this data?

The data is from last year and is collected by a bike-share company directly. It includes all rides, so it's not a sample from the whole data. Therefore, it's possible to conclude that there are no credibility and bias issues with this data.

How are you addressing licensing, privacy, security and accessibility?

The data is under this license. There are no privacy concerns since the data does not contain personal information.?

Process

In this phase I processed the data and get it ready for the next phase where I will look for insights that help me answer to our stakeholders questions and business task.

I used R to perform this step since the data is too big to merge and process (5.595.063 records).

I started by importing the packages needed to this task.

library(tidyverse)? #helps wrangle data
library(lubridate)? #helps wrangle date attributes        

Next, I collected the data from the 12 months of 2021 and combined to a single file.

m01_2021 <- read_csv("202101-divvy-tripdata.csv")
m02_2021 <- read_csv("202102-divvy-tripdata.csv")
m03_2021 <- read_csv("202103-divvy-tripdata.csv")
m04_2021 <- read_csv("202104-divvy-tripdata.csv")
m05_2021 <- read_csv("202105-divvy-tripdata.csv")
m06_2021 <- read_csv("202106-divvy-tripdata.csv")
m07_2021 <- read_csv("202107-divvy-tripdata.csv")
m08_2021 <- read_csv("202108-divvy-tripdata.csv")
m09_2021 <- read_csv("202109-divvy-tripdata.csv")
m10_2021 <- read_csv("202110-divvy-tripdata.csv")
m11_2021 <- read_csv("202111-divvy-tripdata.csv")
m12_2021 <- read_csv("202112-divvy-tripdata.csv")

# Stack individual monthly's data frames into one big data fram
all_trips <- bind_rows(m01_2021, m02_2021, m03_2021, m04_2021,
? ? ? ? ? ? ? ? ? ? ? ?m05_2021, m06_2021, m07_2021, m08_2021,
? ? ? ? ? ? ? ? ? ? ? ?m09_2021, m10_2021, m11_2021, m12_2021)        

Let's start to observe the number of rows and columns.

No alt text provided for this image

We have 13 columns that represent the following attributes:

No alt text provided for this image

  1. ride_id: Id for each bike trip
  2. rideable_type: Type of bike used
  3. started_at: Start date and time
  4. ended_at: End date and time
  5. start_station_name: Name of the starting station
  6. start_station_id: Id of the starting station
  7. end_station_name: Name of the ending station
  8. end_station_id: Id of the ending station
  9. start_lat: Latitude of the starting point
  10. start_lng: Longitude of the starting point
  11. end_lat: Latitude of the ending point
  12. end_lng: Longitude of the ending point
  13. member_casual: Membership status

Let's remove the columns that won't be used in the data analysis process: latitude, longitude, start_station_id and end_station_id.

all_trips <- all_trips %>%
? select(-c(start_lat, start_lng, end_lat, end_lng, start_station_id, end_station_id))        

Next, I renamed the column member_casual to a more suitable name (rider_type).

all_trips <- all_trips %>%
? rename(rider_type = member_casual)        

I created new columns that list the date, year, month, day and start hour of each ride, that will help in the analyze and share phases.

all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday",?
? ? ? ? ? ? ? ? ? ? ? ? ? ?"Friday", "Saturday")[as.POSIXlt(all_trips$date)$wday + 1]
all_trips$hour_of_day <- format(as.POSIXct(all_trips$started_at), format = "%H")        

Added a new column called ride_length to calculate each trip (in minutes).

all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at, units = "mins")        

I sorted the dataset based on ride_length to check if there are any errors, like negative values.

No alt text provided for this image

As you can see there are 147 negative values in the dataset which represent wrong data, so I proceed with removing these rows from the dataset.

all_trips_v2 <- all_trips[!(all_trips$ride_length<0), nrow(all_trips_v2)]        

As we can see below the dataset has a considerable amount of NA values in start_station_name and end_station_name columns.

No alt text provided for this image


We have 690789 NA values in start_station_name and 739149 in end_station_name which represent around 13% of the total records.


No alt text provided for this image


Instead of removing the records with NA values, I assigned all NA values as "Missing Data" so that I can analyze missing values as well.


all_trips_v2$start_station_name <-
? replace(all_trips_v2$start_station_name, is.na(all_trips_v2$start_station_name), "Missing Data")
all_trips_v2$end_station_name <-
? replace(all_trips_v2$end_station_name, is.na(all_trips_v2$end_station_name), "Missing Data")        

Finally, I exported the dataset as a csv file to performe the analyze, share and act phases in Tableau.

counts <- write.csv(all_trips_v2, file = 'all_trips.csv')        

Tableau

Here you can find my insights, conclusions and recommendations using Tableau.

You can find the code used to accomplish this task on github.

要查看或添加评论,请登录

Filipe Balseiro的更多文章

  • Introduction to Streaming - Apache Kafka

    Introduction to Streaming - Apache Kafka

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository What is a streaming data pipeline? A data pipeline…

  • Spark - Setting up a Dataproc Cluster on GCP

    Spark - Setting up a Dataproc Cluster on GCP

    Dataproc is Google's cloud-managed service for running Spark and other data processing tools such as Flink, Presto…

    6 条评论
  • Apache Spark

    Apache Spark

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository Installing Spark Installation instructions for…

    3 条评论
  • DBT- Data Build Tool (Part II)

    DBT- Data Build Tool (Part II)

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository Testing and documenting dbt models Although testing…

    2 条评论
  • DBT- Data Build Tool (Part I)

    DBT- Data Build Tool (Part I)

    References Alvaro Navas Notes Data Engineering Zoomcamp Repository What is dbt? dbt stands for data build tool. It's a…

    3 条评论
  • BigQuery

    BigQuery

    Partitioning vs Clustering It's possible to combine both partitioning and clustering in a table, but there are…

  • DataCamp - Data Engineering with Python

    DataCamp - Data Engineering with Python

    Data Engineers Data engineers deliver: The correct data In the right form To the right people As efficiently as…

  • Youtubers Popularity

    Youtubers Popularity

    Working with Youtube's API to collect channel and video statistics from 10 youtubers I follow and upload the data to an…

    12 条评论