Google Data Analytics Capstone Project

Brief Introduction

Welcome to this case study, in this data analysis project I’m working as a junior data analyst working on the marketing analyst team of bellabeat, a high-tech manufacturer of health focused products for women. bellabeat is a successful small company but it has the potential to become larger player in the industry of smart fitness tracker.

We will be analyzing this case study in six parts grouped by six different phases of data analysis those are ask, prepare, process, analyze, share, act.

Products

There are multiple lineups of fitness products offered by bellabeat fitness company the detailed description about product description and services offered by them is given down below

Bellabeat app: The Bellabeat app provides users with health data related to their activity, sleep, stress, menstrual cycle, and mindfulness habits. This data can help users better understand their current habits and make healthy decisions. The Bellabeat app connects to their line of smart wellness products

○ Leaf: Bellabeat’s classic wellness tracker can be worn as a bracelet, necklace, or clip. The Leaf tracker connects to the Bellabeat app to track activity, sleep, and stress.

○ Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.

○ Spring: This is a water bottle that tracks daily water intake using smart technology to ensure that you are appropriately hydrated throughout the day. The Spring bottle connects to the Bellabeat app to track your hydration levels.

○ Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.

Phase I: Ask

Major stakeholders

The major stakeholders in this data analysis case study will be the founders, chief executive officer and the marketing analytics team at bellabeat fitness company . A detailed description about all of the major stakeholders are given down below.

○ Ur?ka Sr?en: Bellabeat’s cofounder and Chief Creative Officer

○ Sando Mur: Mathematician and Bellabeat’s cofounder; key member of the Bellabeat executive team

○ Bellabeat marketing analytics team: A team of data analysts responsible for collecting, analyzing, and reporting data that helps guide Bellabeat’s marketing strategy

Business task

The goal of this data analysis project uncover some facts regarding how consumers use non bellabeat smart devices and then compare them to one of the bellabeat product from their line-up to apply to?the presentations using these following questions to guide the analysis :

1. What are some trends in smart device usage?

2. How could these trends apply to Bellabeat customers?

3. How could these trends help influence Bellabeat marketing strategy?

Phase II: Prepare

Sr?en encourages me to use public data that explores smart device users’ daily habits. She points you to a specific data set:

FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty Fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for

·????????Physical Activity

·????????Heart Rate

·????????Sleep Monitoring

?It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

Limitations of Data Set

·???????Data is collected 5 years ago in 2016. Users’ daily activity, fitness and sleeping habits, diet and food consumption may have changed since then. Data may not be timely or relevant.

Sample size of 30 FitBit users is not representative of the entire fitness population.
As data is collected in a survey, we are unable to ascertain its integrity or accuracy.

?

Data Integrity

The completeness of the data is measured by the Usability Score given by Kaggle on a scale of 1.0 to 10.0 in which 10.0 is the highest. the data is supposedly most complete if it receives a 10.0 score which means it is most complete, credible and most compatible. The following snippet below shows the breakdown of the usability score the dataset:

Reliability of the Data

A good data source is ROCCC which stands for?Reliable,?Original,?Comprehensive,?Current, and?Cited.

·????????Reliable — LOW — Not reliable as it only has 30 respondents

·????????Original — LOW — Third party provider (This is a FitBit dataset and not from Bellabeat)

·????????Comprehensive — MED — Parameters match most of Bellabeat products’ parameters

·????????Current — LOW — Data is 5 years old and may not be relevant

·????????Cited — LOW — Data collected from third party, hence unknown

Overall, the dataset is considered bad quality data and it is not recommended to produce business recommendations based on this data.

Phase III: Process

In this phase we will be choosing a tool to clean and view the cleaned and filtered data for this I’m choosing R programming language in the Rstudio because this tool enables us to clean the data, analyse the data as well as visualise the data all with just writing few lines of code the packages we are going to use to conduct these analyses are listed below:

1.?????Tidyverse

2.?????Janitor

3.?????Lubridate

4.???????Skimr

Note: All the lines written after the # are the comments which are for the personal reference only and not active lines of code

#installing the package

#tidyverse, janitor, lubridate, skimr

install.packages('tidyverse')

install.packages('janitor')

install.packages('lubridate')

install.packages('skimr')

Next, we use the library function to load up the installed packages in the rstudio using the library function as shown below

library(tidyverse)

library(janitor)

library(lubridate)

library(skimr)

then we use read.csv function to import the chosen datasets into the rstudio to start processing the data

daily_activity <- read.csv('dailyActivity_merged.csv')

daily_sleep <- read.csv('sleepday_merged2.csv')

weight_log <- read.csv('weightLogInfo_merged.csv')?

next we need to inspect the data to check the structure of the different data tables to check they are formatted correctly , for that we use str() function?

str(daily_activity)

str(daily_sleep)

str(weight_log)

we get the following output as below……

we can see that some there is some formatting error :

1.?????The naming of the column is in CamelCase

2.?????daily_activity$ActivityDate – it is formatted in CHR not as a date format

3.?????daily_sleep$SleepDay - it is formatted in CHR not as a date format

4.??????weight_log$Date - it is formatted in CHR not as a date format??

To clean the column names we use clean_names() function:

daily_activity<-clean_names(daily_activity

daily_sleep<-clean_names(daily_sleep)

weight_log<-clean_names(weight_log) ? ? ? ??)

To format structure of the date from CHR(character) to DATE format in the structure of the above mentioned dataset which are also mentioned below:

·????????daily_activity$ActivityDate

·????????daily_sleep$SleepDay

·????????weight_log$Date?

we use as.date() and as.POSIXct() function to convert them as shown below:

daily_activity$activity_date<-as.Date(daily_activity$activity_date,'%m%d%y')

daily_sleep$sleep_day<-as.Date(daily_sleep$sleep_day,'%m%d%y')

convert weight_log$date instead of as.POSIXct() we use parse_date_time?

weight_log$date<- parse_date_time(weight_log$date,'%m%d%y %H:%M:%S %p')

Next, we need to add day of the week , sedentary hours & total active hours column for further analysis in daily activity. Month column cannot be added as data is limited and is only collected within a month

We can also add new columns which converts the current minutes collection to hours and round it using round() function in daily_sleep?also new column named TimeTakenToSleep will be added to daily_sleep dataset

We will also be removing a column from the weight_log dataset called weight_log$fat which has little to no context to the type of analysis we are conducting using the select(-c()) function

daily_activity$day_of_week <- wday(daily_activity$activity_date,label = T,abbr = T)

daily_activity$total_active_hours =round((daily_activity$very_active_minutes+daily_activity$fairly_active_minutes+daily_activity$lightly_active_minutes)/60,digits = 2)

daily_activity$sedentary_hours = round((daily_activity$sedentary_minutes)/60,digits = 2)

daily_sleep$hours_in_bed = round((daily_sleep$total_time_in_bed)/60,digits = 2)

daily_sleep$hours_asleep = round((daily_sleep$total_minutes_asleep)/60,digits = 2)

daily_sleep$time_taken_to_sleep = (daily_sleep$hours_in_bed - daily_sleep$hours_asleep)

weight_log<- weight_log %>% select(-c(fat))

weight_log <- weight_log %>%

? mutate(bmi2 = case_when(bmi > 24.9 ~ 'overweight', bmi < 18.5 ~ 'underweight', TRUE ~'healthy'))

Column name fat is removed from the dataset and also added a new column called bmi2 which indicates whether a person is ‘underweight’ or ‘overweight’ depending on the bmi of individual person which is calculated using the case when function

We also need to remove outliers from this dataset which are not very relevant to our analysis for example we need to remove some rows in total_active_hours & calories_burned are 0 the reason being that we are using the data from fitbits which are wearables so if they don’t wear the smartwatches it will indicate it as zero and doesn’t collect information hence to remove the clutter from the dataset we are doing it .?

daily_activity_cleaned <- daily_activity[!(daily_activity$calories<=0),]

daily_activity_cleaned<- daily_activity_cleaned[!(daily_activity_cleaned$total_active_hours<=0.00),]

Now if we are using a external visualization software such as PowerBI or Tableau then we need to export this Cleaned and Processed data before we begin the Analyze phase of the Data analysis . To do that we will be using the write.csv() function to export the dataset

Phase IV: Analyse

For this section of the data analysis process, we would be using all of our cleaned and processed data to uncover some trends and patterns to discover some insights which would help us in decision making which would further help us in growing the business

I will be using the package in R called ggplot2 for further data visualization part of the data analysis and also, we would be using the tableau to create some interactive visuals to uncover some facts and trends in the dataset?

To do this we have to first revisit the earlier question that we had asked ourselves in phase I of the data analysis to help us in plotting or hypothesize information/relation that will help us in solving the business task

1. What are some trends in smart device usage?

2. How could these trends apply to Bellabeat customers?

3. How could these trends help influence Bellabeat marketing strategy?

After having a brief view of the data , analyzing it thoroughly I have decided to plot some visuals and graphs revolving around:

1.??????The average: Steps taken, sedentary hours, very active minutes & total hours asleep.

2.??????Which days are users the most active.

3.??????The relationship between total active hours, total steps taken, and sedentary hours against calories burned.

4.??????The relationship between weight, total active hours & steps taken

5.??????The number of overweight users

Let's have a quick look at the average steps taken, sedentary hours, very active minutes & total hours of sleep using?summary()

1.????? summary(daily_activity_cleaned$total_steps

Min. 1st Qu.? Median??? Mean 3rd Qu.??? Max.

????? 0??? 4920??? 8053??? 8319?? 11100?? 36019

2.????? summary(daily_activity_cleaned$sedentary_hours)

Min. 1st Qu.? Median??? Mean 3rd Qu.??? Max.

?? 0.00?? 12.02?? 17.00?? 15.87?? 19.80?? 23.98

3.????? summary(daily_activity_cleaned$very_active_minutes)

Min. 1st Qu.? Median??? Mean 3rd Qu.??? Max.

?? 0.00??? 0.00??? 7.00?? 23.21?? 36.00? 210.00

4.????? summary(daily_sleep4$hours_asleep)

Min. 1st Qu.? Median??? Mean 3rd Qu.??? Max.

? 0.970?? 6.020?? 7.220?? 6.992?? 8.170? 13.270)

With this brief output we can view that:

·????????The average number of steps per day were 8319, which is within the 6000–8000 recommended steps per day, however, 25% of people do not hit that recommended quota.

·????????The average sedentary hours were 15.87 hours, which is absurdly high, shattering the recommended limit of?7–10 hours

·????????The average very active minutes also falls short of the recommended 30 minutes of vigorous exercise every day. Only 25% of people manage to hit this quota

The average hours spent asleep (6.9) also barely hits the quota of the recommended sleep time of 7–9 hours

options(scipen=999

? ggplot(data = daily_activity_cleaned)+

? aes(x = day_of_week, y= total_steps)+

? geom_col(fill = 'blue')+

? labs(x = 'Day of Week', y = 'Total Steps', title = 'Total Steps Taken In a Week')

ggplot(data = daily_activity_cleaned)+

? aes(x = day_of_week, y = very_active_minutes) +

? geom_col(fill =? 'red') +

? labs(x = 'Day of week', y = 'Total very active minutes', title = 'Total activity in a week')

ggplot(data = daily_activity_cleaned)+

? aes(x = day_of_week, y = calories) +

? geom_col(fill =? 'brown') +

? labs(x = 'Day of week', y = 'Calories burned', title = 'Total calories burned in a week')

We can see that there is a jump in ‘total steps taken in a week’ bar graph on Tuesday and then onwards it is more or less the same in the proceeding days . Also, we can see that lowest steps were on Sunday, Now if we have a look on all 3 plots we can see that all are more or less similar with Tuesday being the highest and Sunday being the lowest which indicates that people are most inactive on Sunday and with slightly rising on Monday and Tuesday being highly active then onwards it is almost similar

Next, let's investigate the relationship between total active hours, total steps taken, and sedentary hours against calories burned by using the following:

ggplot(data = daily_activity_cleaned) +

? aes(x= total_active_hours, y = calories) +

? geom_point(color = 'red') +

? geom_smooth() +

? labs(x = 'Total active hours', y = 'Calories burned', title = 'Calories burned vs active hours')

ggplot(data = daily_activity_cleaned) +

? aes(x= total_steps, y = calories) +

? geom_point(color = 'orange') +

? geom_smooth() +

? labs(x = 'Total steps', y = 'Calories burned', title = 'Calories burned vs total steps')

ggplot(data = daily_activity_cleaned) +

? aes(x= sedentary_hours, y = calories) +

? geom_point(color = 'purple') +

? geom_smooth() +

? labs(x = 'Sedentary hours', y = 'Calories burned', title = 'Calories burned vs sedentary hours')

At the first look on the first two graphs we can tell that there is positive correlation to the Calories Burned v/s Active Hours or Steps Taken . we can safely say that more the number of active hours or more steps are taken more calories will be burned

But if we look at the third scatterplot Calories Burned v/s Sedentary Hours it is a little bit confusing, it goes positive until its 15 hour mark , then it starts falling below , but having a closer look we can say that?more the hours you are at rest , less calories will be burned which is very true !!!!?

I also carried out a descriptive analysis to check how many users fall in Healthy, Overweight, Underweight using the distinct() function

Out of the 30 users, only 8 of them submitted their response regarding the weight 5 users are overweight, 0 underweight and 3 fall in Healthy category with the BMI range of 18.5 – 24.9.

Phase V:Share (Tableau)

Previously we did some visualization in R using ggplot, now we are doing data visualization using tableau and some my findings:

From the Above Three graphs We can Safely say that:

Majority Of users Have an average of 5,000 to 10,000 steps per day and from that point on there is a sharp Decline
While in second graph we can see that there is a big spike in 0 on the x-axis then on it’s a sharp decline, second highest spike is in 20 minutes tick which goes to show that most users at least spend 20 minutes in exercising
And in the Third Graph we can see the distribution of time spent in sedentary(rest), Although it is a little bit confusing at start because there are two high spikes in graph which suggests that people spend at least 12 – 20 hours at rest (sedentary) which is a lot of rest?

From the Above two Heat Maps Visualisation we can say that:

Most Intense Activity is on Tuesday followed by other days of the week then the least activity is on Sunday as everyone knows is a holiday so people usually take rest
And Same is the case with the heat map of Calories Burned by Weekday Tuesday is the high Intense Activities leading to more Calories Being burned followed by other days ?

?*Thicker the lines the more recorded activity count

From the above two Line Graph we can say that:

Most of the users are from the weight class of 55-85kg because after that there is a sharp decline in the graph suggesting no recorded count
Most recorded physical activity count is in 55-60kg and 80-85kg weight and most recorded Total steps count is same as in physical activity 55-60kg and 80-85kg weight which suggests that only people with around that weight have more physical activities

In the last chart we have BMI (Body Mass Index) . Out of 30 Fitbit users only the 8 submitted their weight records in the dataset out of which 8 people are overweight and only 3 people have a healthy weight?

Phase VI:Act

In the previous section of Analyze & Share, we have covered the 1st and 2nd business task which are:

1.??What are some trends in smart device usage

2.??How could these trends apply to Bellabeat customers (I believe that displaying the trends would already indicate how Bellabeat customers would follow suit.)

Based on my findings after my analysis, I would like to share my hypothesis on this matter.

In all the physical activity graph by weekday we can see that the most rise is on Tuesday and since Age is not mentioned in the dataset we can assume that, the data that is taken from maybe working class so that is why we can see that there is more physical activity from Monday to slight decrease till Friday and then Saturday and Sunday being the lowest which indicates holidays on that day which is the result of Rest Taken.

Now to answer the final business task, I would like to share my recommendations based on my findings to help influence Bellabeat’s marketing strategy.

1.??Bellabeat could host events limited to those that are enrolled in their Bellabeat memberships which would reward users who engage in a healthy lifestyle(IE 8k steps a day, less than 7 hours sedentary etc.) with points. With enough points, users could then use points to purchase products that help supplement a healthy lifestyle.

2.??Bellabeat could partner with brands (Herbalife Nutrition, Healthify) to reward users who consistently engage in a healthy lifestyle with coupons/store discounts for Protein Powders or Gym Equipment

3.??With the 2 previous points combined, Bellabeat could select previously unhealthy individuals (who are now healthy), interview them and publish motivational videos as to how Bellabeat encouraged them to have a change in lifestyle.

?

Next, I would provide some general recommendations to further improve Bellabeat’s products:

1.??Bellabeat could implement personalized milestones, to encourage users to slowly engage in a healthier lifestyle. A simple way of doing this is to create some sort of AI companion on the app/product that would be grumpy/sad if the user does not hit the milestone.

2.??Bellabeat could implement a simple reminder to inform users that they’ve been sedentary for too long by?indefinitely vibrating the device?until the device picks up movement/increase in heart rate, which would indicate that they’ve engaged in some sort of physical activity.

Authors Note: This was the end of my Bellabeat Case Study, thank you so much reading it till the end. I hope it has allowed you to learn more insights about people’s current fitness lifestyle to how they move to healthier lifestyle. I was able to conduct this data analysis case study from the knowledge that I acquired from Google Data Analytics Professional Certificate and I’m very grateful to have complete it successfully with this project as my Capstone Project. Thank You.

Google Data Analytics Capstone Project

Tanmay Redkar

Data Analyst @IFB | Microsoft Certified BI Analyst(PL-300) | R Studio | SQL | Power BI | Ex-Commscope

领英推荐

社区洞察

其他会员也浏览了

Q1 2022 - Industry Opinions

You are recruiting users wrong! Stop it and do this instead...

People Analytics: An Exercise Routine (Oops Did I Think That Out Loud #22)

Privacy vs Personalization: the value exchange for sharing your lifestyle, health and behaviour data

How Terry Crews Taught Me to Get RIO on My Data Projects!

Weekly Round-Up 22/06/23

Why AI-Powered Personalisation Will Dominate in 2025 (B2B, B2C, D2C)

Making Clear & Confident Strategic Decisions

Revolutionise Your Fitness Business: Automate Client Engagement with ChatGPT

The McKinsey 7S Model for Product Managers