Google Data Analytics Case Study
Bellabeat
Austin Doerr
2023-06-17
Introduction of Case Analysis
Hello, my name is Austin Doerr and this is my capstone project for the Google Data Analytics Program. I will be observing data related toBellabeat, a tech-company that provides fitness tracking devices that are designed for women. I have been given the task to analyze data to discover insights and create new growth opportunities for marketing their products.
Throughout this report, I will take you step-by-step through my thought process and strategies I used to complete my analysis of Bellabeat. The primary tool used was R programming but I also used Excel, Taleau, and SQL depending on the parameters of the data set. I will also briefly describe the data sources I used as well as, the background information to understand the context of this case study.
Background Information: Bellabeat
Beallabeat is a company in the fitness industry that creates and markets fashionable, wearable health trackers for women. These products have a variety of uses, such as tracking energy, stress, and productivity levels. As of 2023, the products sold on their website that are important to this analysis are as listed below;
- Ivy Health Tracker:?A wearable tracker that is designed to show data regarding the user’s lifestyle based on activity levels. This is Bellabeat’s most popular product but also has the highest price tag of $249.99 USD.
- Leaf Urban:?A wearable that tracker that is similar to the Ivy but is limited to only track activity and sleep levels. However, this makes it come in at a lower price at $99.00 USD.
This information was collected from?www.bellabeat.com. For more information on the brand and other products, please click the link.
Preparing the Data
Data Sources
Throughout the analysis, I used one primary data source.
- FitBit Fitness Tracker Data:?This data set is public information that tracked 30 different users’ activity, heart rate, sleep, and other metrics. The data is split up into multiple different .csv files that contains everything from daily steps to heart rate each second.
- For this analysis, we uploaded and combined the following data sets into R:
daily_steps <- read.csv("Data Sources/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv")
daily_intensity <- read.csv("Data Sources/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv")
daily_calories <- read.csv("Data Sources/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")
heartrate_seconds <- read.csv("Data Sources/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")
weight_info <- read.csv("Data Sources/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
First, I wanted to convert the heartrate_seconds to mean heartrate_daily. This will allow me to merge this data with the other data sets.
daily_heartrate <- heartrate_seconds %>%
mutate(ActivityDay = substr(Time,1,9)) %>%
select(Id, ActivityDay, Value) %>%
group_by(ActivityDay, Id) %>%
summarise(heartrate = mean(Value), .groups = NULL)
Now that we have cleaned and prepared the heart rate data set, lets take a look at the new data table.
summary(daily_heartrate)
## ActivityDay Id heartrate
## Length:334 Min. :2.022e+09 Min. : 59.38
## Class :character 1st Qu.:4.388e+09 1st Qu.: 70.47
## Mode :character Median :5.577e+09 Median : 77.49
## Mean :5.565e+09 Mean : 78.61
## 3rd Qu.:6.962e+09 3rd Qu.: 84.93
## Max. :8.878e+09 Max. :109.79
We can now see, based of the column length, that there is a limitation of the data set and that only a portion of the members provided heart rate data (33%). We will consider this for future analysis.
Merge Data
Now I will merge the data so I can use variables from different data tables for my analysis:
merged_data_1 <- merge(daily_steps, daily_intensity, by = c("Id", "ActivityDay"))
merged_data_2 <- merge(merged_data_1, daily_calories, by = c("Id", "ActivityDay"))
merged_data_3 <- left_join(merged_data_2, daily_heartrate, by = c("ActivityDay", "Id"))
#Now we are ready to create our cleaned data set
领英推è
complete_daily_data <- left_join(merged_data_3, new_weight_info, by = "Id") %>%
subset(select = -c(IsManualReport, Fat, LogId, Date)) %>%
filter(Calories > 0)
#This is so we do not include the data where the fitbit was not turned on.
summary(complete_daily_data)
## Id ActivityDay StepTotal SedentaryMinutes
## Min. :1.504e+09 Length:936 Min. : 0 Min. : 0.0
## 1st Qu.:2.320e+09 Class :character 1st Qu.: 3818 1st Qu.: 729.0
## Median :4.445e+09 Mode :character Median : 7441 Median :1057.0
## Mean :4.850e+09 Mean : 7671 Mean : 989.3
## 3rd Qu.:6.962e+09 3rd Qu.:10734 3rd Qu.:1226.0
## Max. :8.878e+09 Max. :36019 Max. :1440.0
##
## LightlyActiveMinutes FairlyActiveMinutes VeryActiveMinutes
## Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.:128.0 1st Qu.: 0.00 1st Qu.: 0.00
## Median :199.0 Median : 7.00 Median : 4.00
## Mean :193.6 Mean : 13.62 Mean : 21.26
## 3rd Qu.:264.2 3rd Qu.: 19.00 3rd Qu.: 32.00
## Max. :518.0 Max. :143.00 Max. :210.00
##
## SedentaryActiveDistance LightActiveDistance ModeratelyActiveDistance
## Min. :0.000000 Min. : 0.000 Min. :0.00
## 1st Qu.:0.000000 1st Qu.: 1.960 1st Qu.:0.00
## Median :0.000000 Median : 3.380 Median :0.24
## Mean :0.001613 Mean : 3.355 Mean :0.57
## 3rd Qu.:0.000000 3rd Qu.: 4.790 3rd Qu.:0.80
## Max. :0.110000 Max. :10.710 Max. :6.48
##
## VeryActiveDistance Calories heartrate WeightKg
## Min. : 0.000 Min. : 52 Min. : 59.38 Min. : 52.60
## 1st Qu.: 0.000 1st Qu.:1834 1st Qu.: 70.57 1st Qu.: 62.50
## Median : 0.220 Median :2144 Median : 79.27 Median : 71.05
## Mean : 1.509 Mean :2313 Mean : 78.94 Mean : 78.04
## 3rd Qu.: 2.090 3rd Qu.:2794 3rd Qu.: 85.33 3rd Qu.: 85.80
## Max. :21.920 Max. :4900 Max. :109.79 Max. :133.50
## NA's :693 NA's :690
## WeightPounds BMI
## Min. :116.0 Min. :21.45
## 1st Qu.:137.8 1st Qu.:24.39
## Median :156.6 Median :26.46
## Mean :172.0 Mean :28.07
## 3rd Qu.:189.2 3rd Qu.:27.45
## Max. :294.3 Max. :47.54
## NA's :690 NA's :690
Analysis
For analysis, I will be grouping the participants into different subgroups based on their calories burned each day. The following list will be the categorized groups based on the mean(Calories);
- 0-1834 = Not Active
- 1835-2313 = Somewhat Active
- 2314-2794 = Active
- 2794-4900 = Very Active
group_by_id <- complete_daily_data %>%
mutate(total_distance = LightActiveDistance + ModeratelyActiveDistance + VeryActiveDistance) %>%
group_by(Id) %>%
summarise(avg_calories = mean(Calories),
avg_steps = mean(StepTotal),
avg_sedentary_mins = mean(SedentaryMinutes),
avg_light_mins = mean(LightlyActiveMinutes),
avg_fair_mins = mean(FairlyActiveMinutes),
avg_very_mins = mean(VeryActiveMinutes),
avg_total_distance = mean(total_distance),
avg_heartrate = mean(heartrate, na.rm = TRUE),
weight_pounds = mean(WeightPounds, na.rm = TRUE)) %>%
mutate(
activity_level = case_when(
between(avg_calories, 0, 1834) ~ "Not Active",
between(avg_calories, 1835, 2313) ~ "Somewhat Active",
between(avg_calories, 2314, 2794) ~ "Active",
between(avg_calories, 2794, 4900) ~ "Very Active"
)
)
group_by_act_lvl <- group_by_id %>%
group_by(activity_level) %>%
summarise(num_of_ids = n(),
avg_calories =mean(avg_calories),
avg_steps = mean(avg_steps),
avg_sedentary_mins = mean(avg_sedentary_mins),
avg_light_mins = mean(avg_light_mins),
avg_fair_mins = mean(avg_fair_mins),
avg_very_mins = mean(avg_very_mins),
avg_total_disatance = mean(avg_total_distance),
avg_heartrate= mean(avg_heartrate, na.rm = TRUE),
avg_weight_pounds = mean(weight_pounds, na.rm = TRUE)
) %>%
arrange(avg_calories)
tibble(group_by_act_lvl)
## # A tibble: 4 × 11
## activity_level num_of_ids avg_calories avg_steps avg_sedentary_mins
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Not Active 5 1567. 5918. 1016.
## 2 Somewhat Active 15 2018. 6509. 977.
## 3 Active 5 2540. 8037. 1088.
## 4 Very Active 8 3107. 10242. 970.
## # ? 6 more variables: avg_light_mins <dbl>, avg_fair_mins <dbl>,
## # avg_very_mins <dbl>, avg_total_disatance <dbl>, avg_heartrate <dbl>,
## # avg_weight_pounds <dbl>
Now, the data is organized into different groups based on their activity level. We can now find underlying trends based on these category’s that can give us a better direction at marketing strategies.
- Let’s take a look to see what activity levels the users fall into in the study.
ggplot(data = group_by_act_lvl, aes(x = "", y = num_of_ids, fill = activity_level)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start = 0) +
geom_text(aes(label = paste0(round(num_of_ids/33*100), "%")), position = position_stack(vjust = 0.5)) +
theme_void() #Removes background, theme, grids.
Based on this information, we are led to believe that a majority of the Fitbit users are somewhat active. This means that they burn between 1835-2313 calories on average each day.
There is more data that was derived from the data table grouped by activity level. The main observation i want to focus on is characteristics of the largest group, which is “somewhat activeâ€. If Bellabeat can understand attributes of people that wear fitness trackers, then they can understand how to better target them. Based on the data below, there is a correlation to calories burned and average distance traveled. This means Bellabeat could target individuals that travel around 5 miles each day.
Another attribute that can be advertised to consumers is the positive relationship between sedentary minutes and high heart rate of the Fitbit users.
Many consumers are concerned with their heart rate, so using this data can show how reducing the amount of sedentary time, can lower an individuals heart rate. This demonstrates the benefit of buying a Bellabeat product.
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 19 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 19 rows containing missing values (`geom_point()`).
Based on this data, the more time users spend still, the higher their average heart rate. So the marketing could focus on the benefits of staying active with Bellabeat products.
Conclusion
Overall, the data had many valuable insights that can lead to valuable marketing decisions being made. After my analysis, I have three main recommendations that could lead to possible growth within the company.
- Focus on individuals that burn, on average, 1835-2313 calories every day. This was the largest sub- group based on activity level and could mean they are more inclined to buy a fitness tracker.
- Market the advantages of the effects that physical activity has on someone’s heart rate. Advertisements that show that sedimentary minutes lead to a higher average heart rate.
- Target individuals that travel approximately 5 miles per day. This data could even be found in how far their commute to work is or what hobbies they enjoy. However, will need further data analysis to further support this theory.
Recommendations
- Collect additional data internally regarding activity level and heart rate to see if there is a strong correlation between these variables.
- Gather more personal information of the users including gender, age, and other data to make better informed decisions about demographics.
Integration Specialist at Mansfield Oil Co.
1 å¹´Great work Austin. I will have to check this out too
Registered Behavior Technician
1 å¹´Congratulations, Austin!