Bellabeat Case Study by Using R-Language
INTRODUCTION ABOUT Bellabeat:
Bellabeat is a health and wellness technology company that creates innovative products for women. They specialize in creating wearable technology devices and apps that track fitness, stress, sleep, and menstrual cycles. Their products include the Leaf Urban, a health tracker that can be worn as a necklace, bracelet, or clip, and the Bellabeat app, which allows users to track their fitness and health goals. Bellabeat is known for their focus on women's health and their use of stylish, high-quality materials in their products.
ASK?Process:
We can use?SMART?framework for Ask process to perform better analysis.
Questions???
Business Task???
In the Bellabeat dataset, the business task is to analyze smart device usage data to gain insights into women's activity and sleep patterns, and to use these insights to provide personalized wellness recommendations to improve women's overall health and wellbeing.
Prepare?Process:
Data Source:
The data source for Bellabeat is a publicly available dataset called "FitBit Fitness Tracker Data" which includes various fitness and health-related metrics collected from Fitbit devices worn by 30 Fitbit users over a period of several months. The data includes information on physical activity, sleep patterns, heart rate, weight, and other health metrics. Bellabeat used this dataset to analyze patterns and trends in women's health and wellness, with the aim of developing new products and services to improve women's health.
Data Cleaning and Transformation?– R
# install & load packages for data manipulation and visualization
library(tidyverse)
library(ggplot2)
library(lubridate)
library(ggmap)
library(ggthemes)
library(janitor)
library(readr)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
? dplyr 1.1.0 ? readr 2.1.4
? forcats 1.0.0 ? stringr 1.5.0
? ggplot2 3.4.1 ? tibble 3.1.8
? lubridate 1.9.2 ? tidyr 1.3.0
? purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
? dplyr::filter() masks stats::filter()
? dplyr::lag() masks stats::lag()
? Use the conflicted package (<https://conflicted.r-lib.org/>) to force all conflicts to become errors
? Google's Terms of Service: <https://mapsplatform.google.com>
? Please cite ggmap if you use it! Use `citation("ggmap")` for details.
Attaching package: ‘janitor’
The following objects are masked from ‘package:stats’:
chisq.test, fisher.test
Import Data to R-Studio
daily_ac <- read_csv("/kaggle/input/bellabeat/daily_ac.csv")
hourly_c <- read_csv("/kaggle/input/bellabeat/hourly_c.csv")
hourly_i <- read_csv("/kaggle/input/bellabeat/hourly_i.csv")
sleep_d <- read_csv("/kaggle/input/bellabeat/sleep_d.csv")
w_log <- read_csv("/kaggle/input/bellabeat/w_log.csv")
Rows: 940 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): ActivityDate
dbl (14): Id, TotalSteps, TotalDistance, TrackerDistance, LoggedActivitiesDi...
? Use `spec()` to retrieve the full column specification for this data.
? Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 22099 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): ActivityHour
dbl (2): Id, Calories
? Use `spec()` to retrieve the full column specification for this data.
? Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 22099 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): ActivityHour
dbl (3): Id, TotalIntensity, AverageIntensity
? Use `spec()` to retrieve the full column specification for this data.
? Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 413 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): SleepDay
dbl (4): Id, TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed
? Use `spec()` to retrieve the full column specification for this data.
? Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 67 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Date
dbl (6): Id, WeightKg, WeightPounds, Fat, BMI, LogId
lgl (1): IsManualReport
? Use `spec()` to retrieve the full column specification for this data.
? Specify the column types or set `show_col_types = FALSE` to quiet this message.
# To check imported files
head(daily_ac)
head(hourly_c)
head(hourly_i)
head(sleep_d)
head(w_log)
#checking the discrepancy in column names because sometimes error occurs.
# we can use clear_name() function also to remove discrepancies from the column names.
names(daily_ac)
names(hourly_c)
names(hourly_i)
names(sleep_d)
names(w_log)
Date Manipulation
# Date Manipulation in all data frames "Becuase date in char format".
#note tht im using here "Y" for century and "y" for just year without century.
# For Daily Activity
names(daily_ac)
daily_ac$ActivityDate<-as.POSIXct(daily_ac$ActivityDate, format= "%m/%d/%Y", tz=Sys.timezone())
daily_ac$date<-format(daily_ac$ActivityDate, format="%m/%d/%y")
# For hourly Calories
names(hourly_c)
hourly_c$ActivityHour<-as.POSIXct(hourly_c$ActivityHour, format="%m/%d/%Y %I:%M:%S %p",tz=Sys.timezone())
hourly_c$date<-format(hourly_c$ActivityHour, format="%m/%d/%y")
hourly_c$time<-format(hourly_c$ActivityHour, format="%H:%M:%S")
For hourly Intensities
names(hourly_i)
hourly_i$ActivityHour<-as.POSIXct(hourly_i$ActivityHour, format="%m/%d/%Y %I:%M:%S %p",tz=Sys.timezone())
hourly_i$date<-format(hourly_i$ActivityHour, format="%m/%d/%y" )
hourly_i$time<-format(hourly_i$ActivityHour, format="%H:%M:%S")
#For sleep_d
names(sleep_d)
sleep_d$SleepDay<-as.POSIXct(sleep_d$SleepDay, format="%m/%d/%Y %I:%M:%S %p",tz=Sys.timezone())
sleep_d$date<-format(sleep_d$SleepDay, format="%m/%d/%y")
sleep_d$time<-format(sleep_d$SleepDay, format="%H:%M:%S")
For weightlog
names(w_log)
w_log$Date<-as.POSIXct(w_log$Date, format="%m/%d/%Y %I:%M:%S %p", tz=Sys.timezone())
w_log$date<-format(w_log$Date, format="%m/%d/%y")
w_log$time<-format(w_log$Date, format="%H:%M:%S")
#To identify formats and structure.
glimpse(daily_ac)
glimpse(hourly_c)
glimpse(hourly_i)
glimpse(sleep_d)
glimpse(w_log)
Rows: 940
Columns: 16
$ Id <dbl> 1503960366, 1503960366, 1503960366, 150396036…
$ ActivityDate <dttm> 2016-04-12, 2016-04-13, 2016-04-14, 2016-04-…
$ TotalSteps <dbl> 13162, 10735, 10460, 9762, 12669, 9705, 13019…
$ TotalDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
$ TrackerDistance <dbl> 8.50, 6.97, 6.74, 6.28, 8.16, 6.48, 8.59, 9.8…
$ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ VeryActiveDistance <dbl> 1.88, 1.57, 2.44, 2.14, 2.71, 3.19, 3.25, 3.5…
$ ModeratelyActiveDistance <dbl> 0.55, 0.69, 0.40, 1.26, 0.41, 0.78, 0.64, 1.3…
$ LightActiveDistance <dbl> 6.06, 4.71, 3.91, 2.83, 5.04, 2.51, 4.71, 5.0…
$ SedentaryActiveDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ VeryActiveMinutes <dbl> 25, 21, 30, 29, 36, 38, 42, 50, 28, 19, 66, 4…
$ FairlyActiveMinutes <dbl> 13, 19, 11, 34, 10, 20, 16, 31, 12, 8, 27, 21…
$ LightlyActiveMinutes <dbl> 328, 217, 181, 209, 221, 164, 233, 264, 205, …
$ SedentaryMinutes <dbl> 728, 776, 1218, 726, 773, 539, 1149, 775, 818…
$ Calories <dbl> 1985, 1797, 1776, 1745, 1863, 1728, 1921, 203…
$ date <chr> "04/12/16", "04/13/16", "04/14/16", "04/15/16…
Rows: 22,099
Columns: 5
$ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150396036…
$ ActivityHour <dttm> 2016-04-12 00:00:00, 2016-04-12 01:00:00, 2016-04-12 02:…
$ Calories <dbl> 81, 61, 59, 47, 48, 48, 48, 47, 68, 141, 99, 76, 73, 66, …
$ date <chr> "04/12/16", "04/12/16", "04/12/16", "04/12/16", "04/12/16…
$ time <chr> "00:00:00", "01:00:00", "02:00:00", "03:00:00", "04:00:00…
Rows: 22,099
Columns: 6
$ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 15039…
$ ActivityHour <dttm> 2016-04-12 00:00:00, 2016-04-12 01:00:00, 2016-04-12…
$ TotalIntensity <dbl> 20, 8, 7, 0, 0, 0, 0, 0, 13, 30, 29, 12, 11, 6, 36, 5…
$ AverageIntensity <dbl> 0.333333, 0.133333, 0.116667, 0.000000, 0.000000, 0.0…
$ date <chr> "04/12/16", "04/12/16", "04/12/16", "04/12/16", "04/1…
$ time <chr> "00:00:00", "01:00:00", "02:00:00", "03:00:00", "04:0…
Rows: 413
Columns: 7
$ Id <dbl> 1503960366, 1503960366, 1503960366, 1503960366, 150…
$ SleepDay <dttm> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-16, 20…
$ TotalSleepRecords <dbl> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ TotalMinutesAsleep <dbl> 327, 384, 412, 340, 700, 304, 360, 325, 361, 430, 2…
$ TotalTimeInBed <dbl> 346, 407, 442, 367, 712, 320, 377, 364, 384, 449, 3…
$ date <chr> "04/12/16", "04/13/16", "04/15/16", "04/16/16", "04…
$ time <chr> "00:00:00", "00:00:00", "00:00:00", "00:00:00", "00…
Rows: 67
Columns: 10
$ Id <dbl> 1503960366, 1503960366, 1927972279, 2873212765, 2873212…
$ Date <dttm> 2016-05-02 23:59:59, 2016-05-03 23:59:59, 2016-04-13 0…
$ WeightKg <dbl> 52.6, 52.6, 133.5, 56.7, 57.3, 72.4, 72.3, 69.7, 70.3, …
$ WeightPounds <dbl> 115.9631, 115.9631, 294.3171, 125.0021, 126.3249, 159.6…
$ Fat <dbl> 22, NA, NA, NA, NA, 25, NA, NA, NA, NA, NA, NA, NA, NA,…
$ BMI <dbl> 22.65, 22.65, 47.54, 21.45, 21.69, 27.45, 27.38, 27.25,…
$ IsManualReport <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
$ LogId <dbl> 1.462234e+12, 1.462320e+12, 1.460510e+12, 1.461283e+12,…
$ date <chr> "05/02/16", "05/03/16", "04/13/16", "04/21/16", "05/12/…
$ time <chr> "23:59:59", "23:59:59", "01:08:52", "23:59:59", "23:59:…
ANALYZE:
Data has been cleaned and now we will move toward the heart of analysis process. There are four phases in the analysis process.
#I'm using distinct function to see how many values e unique().
n_distinct(daily_ac)
n_distinct(hourly_c)
n_distinct(hourly_i)
n_distinct(sleep_d)
n_distinct(w_log)
940
22099
22099
410
67
领英推荐
The above information is identify that how many member are available in each data set.
As we can identify.
Note: 67 participants in "weight log" is not significant to make any recommendations and conclusions based on this data.
For Summary
daily_ac %>%
select(TotalSteps,
TotalDistance,
SedentaryMinutes, Calories) %>%
summary()
I'm checking here active minute foe every user
daily_ac %>%
select(VeryActiveMinutes, FairlyActiveMinutes, LightlyActiveMinutes) %>%
summary()
# Average intensities
hourly_i%>% select(TotalIntensity) %>% summary()
Summary for calories
hourly_c %>%
select(Calories) %>%
summary()
#Summary for sleep
sleep_d %>% select(TotalSleepRecords, TotalMinutesAsleep, TotalTimeInBed) %>%
summary()
TotalSteps TotalDistance SedentaryMinutes Calories
Min. : 0 Min. : 0.000 Min. : 0.0 Min. : 0
1st Qu.: 3790 1st Qu.: 2.620 1st Qu.: 729.8 1st Qu.:1828
Median : 7406 Median : 5.245 Median :1057.5 Median :2134
Mean : 7638 Mean : 5.490 Mean : 991.2 Mean :2304
3rd Qu.:10727 3rd Qu.: 7.713 3rd Qu.:1229.5 3rd Qu.:2793
Max. :36019 Max. :28.030 Max. :1440.0 Max. :4900
VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes
Min. : 0.00 Min. : 0.00 Min. : 0.0
1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0
Median : 4.00 Median : 6.00 Median :199.0
Mean : 21.16 Mean : 13.56 Mean :192.8
3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0
Max. :210.00 Max. :143.00 Max. :518.0
TotalIntensity
Min. : 0.00
1st Qu.: 0.00
Median : 3.00
Mean : 12.04
3rd Qu.: 16.00
Max. :180.00
Calories
Min. : 42.00
1st Qu.: 63.00
Median : 83.00
Mean : 97.39
3rd Qu.:108.00
Max. :948.00
TotalSleepRecords TotalMinutesAsleep TotalTimeInBed
Min. :1.000 Min. : 58.0 Min. : 61.0
1st Qu.:1.000 1st Qu.:361.0 1st Qu.:403.0
Median :1.000 Median :433.0 Median :463.0
Mean :1.119 Mean :419.5 Mean :458.6
3rd Qu.:1.000 3rd Qu.:490.0 3rd Qu.:526.0
Max. :3.000 Max. :796.0 Max. :961.0
The data of weight is not enough,I'm doing my analysis with it, you can neglect it or can take any other decision.
w_log %>% select(WeightKg, BMI) %>% summary()
WeightKg BMI
Min. : 52.60 Min. :21.45
1st Qu.: 61.40 1st Qu.:23.96
Median : 62.50 Median :24.39
Mean : 72.04 Mean :25.19
3rd Qu.: 85.05 3rd Qu.:25.56
Max. :133.50 Max. :47.54
These summaries making my understanding better for next step visualization
Data Merging
merged_data<-merge(sleep_d,daily_ac, by=c('Id', 'date'))
Visualization :
ggplot(data=daily_ac,aes(x=TotalSteps,y=Calories))+geom_point(colour="Blue")+geom_line()+geom_smooth(colour="Red")+labs(title="Relation Between Calories and TotalSteps" )+ theme(plot.title = element_text(color = "Green"))
Checking relatrion betwen sleep time and bed time
ggplot(data=sleep_d, aes(x=TotalMinutesAsleep, y=TotalTimeInBed ))+geom_point(colour="Blue")+geom_line()+geom_smooth(colour="Red")+labs(title="Corelation Between Sleep Time in Minute and Total Time in Bed")+theme(plot.title = element_text(color = "Green"))
As we can see the relationship between Total Minutes Asleep and Total Time in Bed looks linear, it suggests that there is a linear association between these two variables.
Relation between total Intensities and Time(Hours).
hourly_i %>% group_by(time) %>% drop_na() %>%
summarize(mean_total_intensities=mean(TotalIntensity)) %>%
ggplot(aes(x=time,y=mean_total_intensities))+
geom_col(width=0.4,position=position_dodge(width=0.5),colour="white",fill="lightgreen")+
labs(title="Relation B/W Total Intensities and Hourly Time",x="Time",y="Average Total Intensities")+
theme(axis.text.x =element_text(angle = 90))+
theme(plot.title=element_text(colour="black"))+
theme(plot.title=element_text(face="bold"))+
theme(panel.background =element_rect(fill = "black"))+
theme(axis.text.x =element_text(face="bold",colour = "black" ))+
theme(axis.text.y =element_text(face="bold",colour = "black" ))
Relationship B/W Total Minutes Asleep and Sedentary Minutes
ggplot(data=merged_data, aes(x=TotalMinutesAsleep,y=SedentaryMinutes))+
geom_point(colour="green")+
geom_smooth(colour="white")+
geom_line(colour="white")+
theme(plot.background = element_rect(fill="grey"),
panel.background = element_rect(fill="black"),
axis.text.x=element_text(face="bold",colour="black"),
axis.text.y=element_text(face="bold",colour="black"),
plot.title = element_text(face="bold",colour="black"))+
labs(title="Correlation B/W TotalMinutesAsleep and SedentaryMinutes",x="Total Minutes Asleep",y="Sedentary Minutes")
SHARE:
This phase will be done by presentation but you can view my Kaggle Notebook ??Here
Key tasks
ACT:
This phase will be carried out by the executive team, Director of Marketing (Bellabeat) and the Marketing Analytics team based on my analysis which is called Data-driven decision making.
Data-Driven decisions:
My recommendations based on my analysis.
These are just some ideas for improving the Bellabeat data set. It's important to analyze the data further and conduct user research to determine the most effective strategies for improving user engagement, health outcomes, and overall satisfaction with the app.
Resources