Google Data Analytics Case Study

Google Data Analytics Case Study

Bellabeat

Austin Doerr

2023-06-17

Introduction of Case Analysis

Hello, my name is Austin Doerr and this is my capstone project for the Google Data Analytics Program. I will be observing data related toBellabeat, a tech-company that provides fitness tracking devices that are designed for women. I have been given the task to analyze data to discover insights and create new growth opportunities for marketing their products.

Throughout this report, I will take you step-by-step through my thought process and strategies I used to complete my analysis of Bellabeat. The primary tool used was R programming but I also used Excel, Taleau, and SQL depending on the parameters of the data set. I will also briefly describe the data sources I used as well as, the background information to understand the context of this case study.

Background Information: Bellabeat

Beallabeat is a company in the fitness industry that creates and markets fashionable, wearable health trackers for women. These products have a variety of uses, such as tracking energy, stress, and productivity levels. As of 2023, the products sold on their website that are important to this analysis are as listed below;

  • Ivy Health Tracker:?A wearable tracker that is designed to show data regarding the user’s lifestyle based on activity levels. This is Bellabeat’s most popular product but also has the highest price tag of $249.99 USD.
  • Leaf Urban:?A wearable that tracker that is similar to the Ivy but is limited to only track activity and sleep levels. However, this makes it come in at a lower price at $99.00 USD.

This information was collected from?www.bellabeat.com. For more information on the brand and other products, please click the link.


Preparing the Data

Data Sources

Throughout the analysis, I used one primary data source.

  1. FitBit Fitness Tracker Data:?This data set is public information that tracked 30 different users’ activity, heart rate, sleep, and other metrics. The data is split up into multiple different .csv files that contains everything from daily steps to heart rate each second.

  • For this analysis, we uploaded and combined the following data sets into R:

daily_steps <- read.csv("Data Sources/Fitabase Data 4.12.16-5.12.16/dailySteps_merged.csv")
daily_intensity <- read.csv("Data Sources/Fitabase Data 4.12.16-5.12.16/dailyIntensities_merged.csv")
daily_calories <- read.csv("Data Sources/Fitabase Data 4.12.16-5.12.16/dailyCalories_merged.csv")
heartrate_seconds <- read.csv("Data Sources/Fitabase Data 4.12.16-5.12.16/heartrate_seconds_merged.csv")
weight_info <- read.csv("Data Sources/Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")        

First, I wanted to convert the heartrate_seconds to mean heartrate_daily. This will allow me to merge this data with the other data sets.

daily_heartrate <- heartrate_seconds %>%
  mutate(ActivityDay = substr(Time,1,9)) %>%
  select(Id, ActivityDay, Value) %>%
  group_by(ActivityDay, Id) %>%
  summarise(heartrate = mean(Value), .groups = NULL)        

Now that we have cleaned and prepared the heart rate data set, lets take a look at the new data table.

summary(daily_heartrate)
##  ActivityDay              Id              heartrate     
##  Length:334         Min.   :2.022e+09   Min.   : 59.38  
##  Class :character   1st Qu.:4.388e+09   1st Qu.: 70.47  
##  Mode  :character   Median :5.577e+09   Median : 77.49  
##                     Mean   :5.565e+09   Mean   : 78.61  
##                     3rd Qu.:6.962e+09   3rd Qu.: 84.93  
##                     Max.   :8.878e+09   Max.   :109.79        

We can now see, based of the column length, that there is a limitation of the data set and that only a portion of the members provided heart rate data (33%). We will consider this for future analysis.


Merge Data

Now I will merge the data so I can use variables from different data tables for my analysis:

merged_data_1 <- merge(daily_steps, daily_intensity, by = c("Id", "ActivityDay"))
merged_data_2 <- merge(merged_data_1, daily_calories, by = c("Id", "ActivityDay"))
merged_data_3 <- left_join(merged_data_2, daily_heartrate, by = c("ActivityDay", "Id"))
        

#Now we are ready to create our cleaned data set

complete_daily_data <- left_join(merged_data_3, new_weight_info, by = "Id") %>%
  subset(select = -c(IsManualReport, Fat, LogId, Date)) %>%
  filter(Calories > 0)
#This is so we do not include the data where the fitbit was not turned on.

summary(complete_daily_data)
##        Id            ActivityDay          StepTotal     SedentaryMinutes
##  Min.   :1.504e+09   Length:936         Min.   :    0   Min.   :   0.0  
##  1st Qu.:2.320e+09   Class :character   1st Qu.: 3818   1st Qu.: 729.0  
##  Median :4.445e+09   Mode  :character   Median : 7441   Median :1057.0  
##  Mean   :4.850e+09                      Mean   : 7671   Mean   : 989.3  
##  3rd Qu.:6.962e+09                      3rd Qu.:10734   3rd Qu.:1226.0  
##  Max.   :8.878e+09                      Max.   :36019   Max.   :1440.0  
##                                                                         
##  LightlyActiveMinutes FairlyActiveMinutes VeryActiveMinutes
##  Min.   :  0.0        Min.   :  0.00      Min.   :  0.00   
##  1st Qu.:128.0        1st Qu.:  0.00      1st Qu.:  0.00   
##  Median :199.0        Median :  7.00      Median :  4.00   
##  Mean   :193.6        Mean   : 13.62      Mean   : 21.26   
##  3rd Qu.:264.2        3rd Qu.: 19.00      3rd Qu.: 32.00   
##  Max.   :518.0        Max.   :143.00      Max.   :210.00   
##                                                            
##  SedentaryActiveDistance LightActiveDistance ModeratelyActiveDistance
##  Min.   :0.000000        Min.   : 0.000      Min.   :0.00            
##  1st Qu.:0.000000        1st Qu.: 1.960      1st Qu.:0.00            
##  Median :0.000000        Median : 3.380      Median :0.24            
##  Mean   :0.001613        Mean   : 3.355      Mean   :0.57            
##  3rd Qu.:0.000000        3rd Qu.: 4.790      3rd Qu.:0.80            
##  Max.   :0.110000        Max.   :10.710      Max.   :6.48            
##                                                                      
##  VeryActiveDistance    Calories      heartrate         WeightKg     
##  Min.   : 0.000     Min.   :  52   Min.   : 59.38   Min.   : 52.60  
##  1st Qu.: 0.000     1st Qu.:1834   1st Qu.: 70.57   1st Qu.: 62.50  
##  Median : 0.220     Median :2144   Median : 79.27   Median : 71.05  
##  Mean   : 1.509     Mean   :2313   Mean   : 78.94   Mean   : 78.04  
##  3rd Qu.: 2.090     3rd Qu.:2794   3rd Qu.: 85.33   3rd Qu.: 85.80  
##  Max.   :21.920     Max.   :4900   Max.   :109.79   Max.   :133.50  
##                                    NA's   :693      NA's   :690     
##   WeightPounds        BMI       
##  Min.   :116.0   Min.   :21.45  
##  1st Qu.:137.8   1st Qu.:24.39  
##  Median :156.6   Median :26.46  
##  Mean   :172.0   Mean   :28.07  
##  3rd Qu.:189.2   3rd Qu.:27.45  
##  Max.   :294.3   Max.   :47.54  
##  NA's   :690     NA's   :690        


Analysis

For analysis, I will be grouping the participants into different subgroups based on their calories burned each day. The following list will be the categorized groups based on the mean(Calories);

  • 0-1834 = Not Active
  • 1835-2313 = Somewhat Active
  • 2314-2794 = Active
  • 2794-4900 = Very Active

group_by_id <- complete_daily_data %>%
  mutate(total_distance = LightActiveDistance + ModeratelyActiveDistance + VeryActiveDistance) %>%
  group_by(Id) %>%
  summarise(avg_calories = mean(Calories),
            avg_steps = mean(StepTotal),
            avg_sedentary_mins = mean(SedentaryMinutes),
            avg_light_mins = mean(LightlyActiveMinutes),
            avg_fair_mins = mean(FairlyActiveMinutes),
            avg_very_mins = mean(VeryActiveMinutes),
            avg_total_distance = mean(total_distance),
            avg_heartrate = mean(heartrate, na.rm = TRUE),
            weight_pounds = mean(WeightPounds, na.rm = TRUE)) %>%
mutate(
  activity_level = case_when(
    between(avg_calories, 0, 1834) ~ "Not Active",
    between(avg_calories, 1835, 2313) ~ "Somewhat Active", 
    between(avg_calories, 2314, 2794) ~ "Active",
    between(avg_calories, 2794, 4900) ~ "Very Active"
  )
)
group_by_act_lvl <- group_by_id %>%
  group_by(activity_level) %>%
  summarise(num_of_ids = n(),
            avg_calories =mean(avg_calories),
            avg_steps = mean(avg_steps),
            avg_sedentary_mins = mean(avg_sedentary_mins),
            avg_light_mins = mean(avg_light_mins),
            avg_fair_mins = mean(avg_fair_mins),
            avg_very_mins = mean(avg_very_mins),
            avg_total_disatance = mean(avg_total_distance),
            avg_heartrate= mean(avg_heartrate, na.rm = TRUE),
            avg_weight_pounds = mean(weight_pounds, na.rm = TRUE)
  ) %>%
  arrange(avg_calories)

            tibble(group_by_act_lvl)
## # A tibble: 4 × 11
##   activity_level  num_of_ids avg_calories avg_steps avg_sedentary_mins
##   <chr>                <int>        <dbl>     <dbl>              <dbl>
## 1 Not Active               5        1567.     5918.              1016.
## 2 Somewhat Active         15        2018.     6509.               977.
## 3 Active                   5        2540.     8037.              1088.
## 4 Very Active              8        3107.    10242.               970.
## # ? 6 more variables: avg_light_mins <dbl>, avg_fair_mins <dbl>,
## #   avg_very_mins <dbl>, avg_total_disatance <dbl>, avg_heartrate <dbl>,
## #   avg_weight_pounds <dbl>        

Now, the data is organized into different groups based on their activity level. We can now find underlying trends based on these category’s that can give us a better direction at marketing strategies.

  1. Let’s take a look to see what activity levels the users fall into in the study.

ggplot(data = group_by_act_lvl, aes(x = "", y = num_of_ids, fill = activity_level)) + 
   geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0) +
  geom_text(aes(label = paste0(round(num_of_ids/33*100), "%")), position = position_stack(vjust = 0.5)) +
  theme_void()  #Removes background, theme, grids.        
No alt text provided for this image


Based on this information, we are led to believe that a majority of the Fitbit users are somewhat active. This means that they burn between 1835-2313 calories on average each day.

There is more data that was derived from the data table grouped by activity level. The main observation i want to focus on is characteristics of the largest group, which is “somewhat active”. If Bellabeat can understand attributes of people that wear fitness trackers, then they can understand how to better target them. Based on the data below, there is a correlation to calories burned and average distance traveled. This means Bellabeat could target individuals that travel around 5 miles each day.

No alt text provided for this image


Another attribute that can be advertised to consumers is the positive relationship between sedentary minutes and high heart rate of the Fitbit users.

Many consumers are concerned with their heart rate, so using this data can show how reducing the amount of sedentary time, can lower an individuals heart rate. This demonstrates the benefit of buying a Bellabeat product.

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 19 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 19 rows containing missing values (`geom_point()`).        
No alt text provided for this image

Based on this data, the more time users spend still, the higher their average heart rate. So the marketing could focus on the benefits of staying active with Bellabeat products.


Conclusion

Overall, the data had many valuable insights that can lead to valuable marketing decisions being made. After my analysis, I have three main recommendations that could lead to possible growth within the company.

  1. Focus on individuals that burn, on average, 1835-2313 calories every day. This was the largest sub- group based on activity level and could mean they are more inclined to buy a fitness tracker.
  2. Market the advantages of the effects that physical activity has on someone’s heart rate. Advertisements that show that sedimentary minutes lead to a higher average heart rate.
  3. Target individuals that travel approximately 5 miles per day. This data could even be found in how far their commute to work is or what hobbies they enjoy. However, will need further data analysis to further support this theory.

Recommendations

  • Collect additional data internally regarding activity level and heart rate to see if there is a strong correlation between these variables.
  • Gather more personal information of the users including gender, age, and other data to make better informed decisions about demographics.


Michael Head

Integration Specialist at Mansfield Oil Co.

1 å¹´

Great work Austin. I will have to check this out too

Abigail Fuchs

Registered Behavior Technician

1 å¹´

Congratulations, Austin!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了