Bellabeat | My Case Study to become Data Analyst by Google
daniel vanegas
people analytics | HR | people ops | ex-novartis AG | master’s candidate in Environmental Science
Introduction
Bellabeat is a high-tech company that manufactures health-focused products for women.
Founded in 2013 by Ur?ka Sr?en (Chief Creative Officer) and Sando Mur, Bellabeat has grown rapidly and quickly positioned itself as a wellness-tech company for women.
By 2016, Bellabeat had opened offices around the world and launched multiple products, including an app (Bellabeat), a wellness tracker (Leaf), a wellness smartwatch (Time), a smart bottle (Spring) and a subscription-based membership program (Bellabeat) that provides users access to personalised guidance on having a healthy lifestyle.
The co-founder Ur?ka Sr?en knows that the company have the potential to become a larger player in the global smart device market. She believes that analyzing smart device fitness data could help unlocking new growth opportunities for the company.?
Based on that, Sr?en has asked the marketing analytics team to analyze smart device data to get insights on how consumers are using their smart devices. With the output of this process, she would like to understand if there is any further opportunity to strengthen the Bellabeat marketing strategy.
Ask
Business Task
The business task is to analyze the consumer data on how they are using wellness-tech products while providing recommendations to the marketing leadership team in order to make data-driven decisions oriented to build up the business strategy.
The purpose of this research is to help Bellabeat understands the habits of its users, identifies potential new features for its products and/or inspires ideas for developing new products with a greater impact.
Stakeholers
Primary stakeholders:
Secondary stakeholders:
Prepare
Data Source:
The data comes from FitBit Fitness Tracker Data (CC0: Public Domain license), which is stored on Kaggle and made available by Mobius.
This dataset has in total 18 files in .csv format organized in long format. It contains personal fitness tracker from FitBit users, who consented the submission of their personal data so that it can be used to explore user habits.
ROCCC Analisys:
Credibility and Integrity:
Process
To process the data, the following tools will be used:
Data Cleaning
The FitBit Fitness Tracker Data (Fitabase Data 4.12.16-5.12.16) was downloaded from the Mobius profile in Kaggle and loaded into Google sheet for data cleaning.?
After reviewing format inconsistencies with Google sheet and its spreadsheet functions across the different files, the field 'time' has been separated in a different column.
The data was loaded into Google BigQuery and, after analyzing and running a set of queries to check the number of unique ‘Id’, the result was:
Unique 'Id' per table:
The query syntax used for each of them:
SELECT DISTINCT
Id
FROM
`proyecto-vX.Bellabeat_data_X.X_merged`
It also can be visualized after being loaded and analyzed with Tableau:
After loading de dataset into RStudio, the next step was:
The syntax used in each case for the set of data frames was:
```{r Dimensions (n cols & rows)}
dim(Ready_20230303_dailyActivity_merged)
```
```{r Searching for NA}
sum(is.na(Ready_20230303_dailyActivity_merged))
```
```{r Searching for Duplicates}
sum(duplicated(Ready_20230303_dailyActivity_merged))
```
```{r Removing Duplicates}
SpleeDay_WO_Dup <- Ready_20230302_sleepDay_merged[!duplicated(Ready_20230302_sleepDay_merged), ]
```
```{r Adding 'WeekDay'}
DailyActivity_Weekday <- Ready_20230303_dailyActivity_merged %>%
mutate(Weekday = weekdays(as.Date(`ActivityDate MOD`, "%a")))
```
```{r Standardizing'Date'}
DailyActivity_Weekday <- rename(DailyActivity_Weekday, Date = 'ActivityDate MOD')
```
From the above results, the number of observations for each SQL data frame is:
After having done this analysis, we can confirm that the data contained in the 'dailyCalories', 'dailyIntensities' and 'dailySteps' data frames is present in the 'dailyActivity' (Activity) data frame. So, we move on with the data available on Activity, Sleep (df: 'SleepDay') and Weight (df: 'weightLogInfo') using the rest of the information as referral purposes.
Analysis
The hypothesis was made with the data observed on Activity, Sleep and Weight.
With this parameter in mind and this being the first step taken, two variables were contrasted to visualize the relationship between them: Per Id, Total_Activity in Distance and Total_Steps reported during the period.
The code used to obtain the image was:
```r{Relationship Total Active Distance & Total Steps}
ggplot(data=DailyActSte_DistMin_Num_NoNA, aes(x=TotalSteps, y = TotalActiveDistance))+
geom_point()+
stat_smooth(method=lm)+
theme_minimal()+
labs(title='Steps & Distance',
subtitle='Total Active Distance and Total Steps per Id during the period',
caption='db Daily Activity by devanegas March, 2023',
x='Total Steps', y='Total Active Distance') +
theme(plot.title=element_text(family='Sans Serif', face='bold', size=32),
plot.subtitle=element_text(family='Sans Serif', size=16, color='grey'),
plot.caption = element_text(family='Sans Serif', size=12),
axis.title.x = element_text(family='Sans Serif', face='bold', size=22, hjust=1),
axis.title.y = element_text(family='Sans Serif', face='bold', size=22, hjust=0.5),
axis.text = element_text(size=14),
strip.text=element_text(family='Sans Serif', face='bold', size=6, hjust=1, color='white'),
strip.background=element_rect(fill='purple'))+
geom_point(color='purple', size=6)
```
We can observe a positive relationship between Activity in distance and Steps made, meaning the more steps users took, the more distance is covered. So now, we want to validate the correlation data, not only between them but also the rest of the variables. That is why we calculate a correlation matrix.
```r{Correlation matrix}
round(cor(DailyActSte_DistMin_Num_NoNA),1)
```
```r{To validate if it is statiticaly significant we calculate the p-value}
rcorr(as.matrix(DailyActSte_DistMin_Num_NoNA))
```
```r{Save the matrix as an object}
correlacion <-round(cor(DailyActSte_DistMin_Num_NoNA),1)
```
```r{Correlation Matrix Chart}
corrplot(correlacion,
method = c("number"),
type = c("upper"),
col = NULL,
col.lim = NULL,
bg = "white",
title = "Correlation Matrix",
is.corr = TRUE,
add = FALSE,
diag = TRUE,
outline = FALSE,
mar = c(0, 0, 0, 0),
addgrid.col = 'grey',
addCoef.col = NULL,
addCoefasPercent = FALSE,
order = c("original"),
hclust.method = c("complete"),
addrect = NULL,
rect.col = "black",
rect.lwd = 7,
tl.pos = NULL,
tl.cex = 0.6,
tl.col = "red",
tl.offset = 0.5,
tl.srt = 45,
cl.pos = NULL,
cl.length = 11,
cl.cex = 2,
cl.ratio = 0.2,
cl.align.text = "c",
cl.offset = 1,
number.cex = 1.25,
number.font = 12,
number.digits = 1,
addshade = c("all"),
shade.lwd = 1,
shade.col = "white",
p.mat = NULL,
sig.level = 0.05,
insig = c("label_sig"),
pch = 4,
pch.col = "black",
pch.cex = 3,
plotCI = c("n"),
lowCI.mat = NULL,
uppCI.mat = NULL,
na.label = "?",
na.label.col = "black",
win.asp = 1)
```
Once validated the high correlation (r = 0.8) between Steps and Activity | Distance, we have to go deeper to understand if this relationship could help us to validate our hypothesis.
Share
Activity Distance effect
We assume that as far as Steps and Activity | Distance are related, the variable 'Calories' (Calories) should be affected somehow, this is because according to studies the more steps taken (with their consequent physical activity), the more calories are burned.
To validate this assumption, we used the Pivot Table tool of Google Spreadsheet to calculate the average values of Steps & Calories per Weekday to observe their behavior.
Although we observe that the 2 variables have similar behavior, the slope of the Steps trend line is steeper up versus Calories, which means that both are not affected to the same extent.
The correlation factor is r=0.6, which demonstrate a positive quantitative relationship between them.
Once analyzed Activity | Distance versus Calories burned per weekday, now we want to understand the Activity | Minutes characteristics in order to evaluate its possible relevance in this reasoning.
Uncovering Activity Minutes
What would be the proportional distribution of each level of Activity in minutes? To answer this question we used the Pivot Table tool of Google Spreadsheet to create a pie chart showing each category and its proportional participation of the total Activity | Minutes
领英推荐
We can observe the Activity | Minutes data in four categories: Very Active, Fairly Active, Lightly Active and Sedentary. In these blocks we can notice that users spent 81% of their daily activity in Sedentary Minutes, reporting the rest 19% in Active Minutes (grouping Very Active, Fairly Active and Lightly Active categories).
Once we have quantified and segmented the user behavior in terms of the quality of time spent on physical activity, we are interested in inquiring about the characteristics of the Active Minutes group to analyze its relevance in the process. To do so, we started by contrasting it under the lens of the normal distribution curve chart (Gauss Bell).
To obtain the necessary data, we calculated the quartiles and the sample mean, which, from the code shown below, gave us the consequent graphic display.
```r{Total Active Minutes Mean, Max, Min, Quartiles}
mean_Act_Minutes <- mean(Act_Minutes$Total_Active)
max_Act_Minutes <- max(Act_Minutes$Total_Active)
min_Act_Minutes <- min(Act_Minutes$Total_Active)
median_Act_Minutes <- median(Act_Minutes$Total_Active)
p25_Act_Minutes <- quantile(Act_Minutes$Total_Active,.25)
p50_Act_Minutes <- quantile(Act_Minutes$Total_Active,.50)
p75_Act_Minutes <- quantile(Act_Minutes$Total_Active,.75)
summary(Act_Minutes$Total_Active)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 146.8 247.0 227.5 317.2 552.0
```
The resulting values of Total Active Minutes during the period are:
We used this info to calculate and analyze the Gaussanian distribution.
```r{Histogram and Gaussian Distribution}
hist(Act_Minutes$Total_Active, prob = TRUE,
main = "Distribution Minutes Active",
ylab = "Density", xlab = "Minutes Active | Grouping Very Fairly & Lightly Active", col = "darkgoldenrod1", ylim=c(0,0.004))
x <- seq(min(Act_Minutes$Total_Active), max(Act_Minutes$Total_Active), length = 600)
f <- dnorm(x, mean = mean(Act_Minutes$Total_Active), sd = sd(Act_Minutes$Total_Active))
grid(nx = NA, ny = NULL, lty = 10, col = "white", lwd = 1)
hist(Act_Minutes$Total_Active, prob = TRUE, add = TRUE, col = "darkgoldenrod1")
curve(dnorm(x, mean = mean(Act_Minutes$Total_Active), sd = sd(Act_Minutes$Total_Active)), from=-600,to=600,
add=TRUE, col="red", lwd=2)
legend("topleft", col=c("red"), legend =c("Estimated Normal Density"), lwd=2, bty="n", size=4)
```
```r{Skewness}
skewness(Act_Minutes$Total_Active)
[1] -0.3617311
′′′
```r{Kurtosis}
kurtosis(Act_Minutes$Total_Active)
[1] 2.443673
```
In this normal distribution we can note some asymmetry, so that we proceed to calculate the Skewness and Kurtosis factors in order to understand the level of concentration of data and degree of asymmetry.
The result of Skewness was -0.36 which means that the data points are more concentrated towards the right-hand side of the distribution. In this case (a negative Skewness distribution), the Mean and Median bend towards the right having a relationship between them as follows, Median > Mean (247 > 228).
Kurtosis returned a value of 2.44, which leads us to deduce that it is a Platicurtic distribution; it means that the data points are heavy-tailed (there are more values in the tail than a normal distribution) and presents a low degree of concentration (greater dispersion) of values around their Mean.
Now that we have found out how the data set could get distorted, what its properties are and the concentration degree within its points, we will now investigate the relationship of these variables with trends.
Physical activity habits, contrast
According to the World Health Organization Fact Sheet for Physical Activity, adults aged between 18–64 years "should do at least 150–300 minutes of moderate-intensity aerobic physical activity; or at least 75–150 minutes of vigorous-intensity aerobic physical activity; or an equivalent combination of moderate- and vigorous-intensity activity throughout the week" to improve health. Since we do not have age data in the sample, we assume the population surveyed belong to this range (18-64 years old).
Based on that, to know how many users practice healthy habits (aligned with the WHO guidelines), we have created a code that count those individuals who daily registered at least 21.4 minutes of Fairly Active or 10.7 minutes of Very Active.
The result is 30 users who met WHO parameters in terms of time invested in physical activity. That would be the lower limit of the range.
Now, what about the upper limit? That is, the people who daily registered at least 42.8 minutes of Fairly Active or 21.4 minutes of Very Active. Well, the result was 28 users.
```r{Users with Daily Physical Activity | Low Limit}
Fair_VeryAct_InfUsers <- DailyActivity_Weekday %>%
filter(FairlyActiveMinutes >= 21.4 | VeryActiveMinutes>=10.7) %>%
group_by(Id) %>%
count(Id)
```
```r{Users with Daily Physical Activity | Upper Limit}
Fair_VeryAct_UppUsers <- DailyActivity_Weekday %>%
filter(FairlyActiveMinutes >= 42.8 | VeryActiveMinutes>=21.4) %>%
group_by(Id) %>%
count(Id)
```
This is interesting, because even when the 81% of the time reported by users was dedicated to Sedentary Activity, the 91% of individuals meets the minimal parameters recommended by WHO to improve good health by investing time in more intense activity and at least 85% of users spend more minutes than necessary.
The relevance of this information, together with what we checked previously finding out the relationship between Calories & Activity | Distance, gives us the opportunity to analyze the behavior of burned Calories and Activity | Minutes with new glasses.
The graph above resulting from Tableau, reflects the very similar behavior between Total Minutes spent in physical activity and Total Calories burned among the weekdays.
The highest correlation between Calories and any of the categories inside the Total Activity | Minutes variable is r=0.6. This value describes the positive quantitative relationship between Total Calories burned and Total Very Active Minutes invested in physical exercise.
Calories are a pragmatic measure to track and identify the effect of personal efforts to obtain desired results and thus achieve a healthy life. According to Center for Desease Control and Prevention (CDC) the burning of calories through physical activity is specially important to maintain a healthy weight which ultimatly contributes to good health.
Once identified how Activity | Distance (analyzed through the Steps variable) and Activity | Minutes are related to Calories, we have a clearer insight to go deeper and estimate the next steps in business matter.
An Active point of view
The inquiries in our dataset have yield some relevant insights so far:
Having said that and based on the available information, we could argue the user profile is a person with an active life style oriented to improve him/her health well-being relying on smart-technology to measure progress.
Now, where could we find a business opportunity in this escenario? Let's zoom in on the data.
How many users spend more than the minutes set in the lower limit of WHO recommendations for each Activity | Minutes category alone without having reported any other category.
```r{Low Limit WHO Guidelines | Very Active Only | Minutes}
FairNO_VeryActYES_LightNO_InfUsers <- DailyActivity_Weekday %>%
filter(VeryActiveMinutes>=10.7 & FairlyActiveMinutes <1 & LightlyActiveMinutes <1) %>%
group_by(Id) %>%
count(Id)
```
```r{Low Limit WHO Guidelines | Fairly Active Only | Minutes}
FairYES_VeryActNO_LightNO_InfUsers <- DailyActivity_Weekday %>%
filter(FairlyActiveMinutes >= 21.4 & VeryActiveMinutes<1 & LightlyActiveMinutes <1) %>%
group_by(Id) %>%
count(Id)
```
```r{Low Limit WHO Guidelines | Lightly Active Only | Minutes}
FairNO_VeryActNO_LightYES_InfUsers <- DailyActivity_Weekday %>%
filter(LightlyActiveMinutes >= 42.8 & VeryActiveMinutes<1 & FairlyActiveMinutes <1) %>%
group_by(Id) %>%
count(Id)
```
That would be a total of 4 users (12% of the sample) that reported Activity | Minutes in only one Activity category without reporting any other one.
These results tell us that 88% of users (n=29) use their smart devices not only to report specific categories of Activity | Minutes, but also a combination, distinguishing between them according to personal criteria or their device setting for tracking progress.
Now, how many users spend less minutes than the value set in the lower limit of WHO recommendations for the categories Lightly Activity | Minutes and Fairly Activity | Minutes without having reported Very Active | Minutes?
```r{Low Limit WHO Guidelines | Fairly and Lightly Active Only | Minutes}
Oportunidad_FairYES_VeryActNO_LightYES_InfUsers <- DailyActivity_Weekday %>%
filter(FairlyActiveMinutes < 21.4 & VeryActiveMinutes<1 & LightlyActiveMinutes <42.8) %>%
group_by(Id) %>%
count(Id)
```
The answer is 21 individuals that have reported less Activity | Minutes than the ones recommended by WHO for Fairly | Minutes or Lightly | Minutes. That represent the 63% of the sample.
Taking into account that 11 users (33% of the sample) reported more than zero Lightly | Minutes and less Activity | Minutes than the ones recommended by WHO, we can assume that there is a potential conversion opportunity for Lightly | Activity users to reach the minimum numbers of health wellness standards in terms of Fairy and Very Active | Minutes and quantity of users per category.
```r{Low Limit WHO Guidelines | Fairly and Lightly>0 Active Only | Minutes}
Oport_LighMayorCero_FairYES_VeryActNO_InfUsers <- DailyActivity_Weekday %>%
filter(FairlyActiveMinutes < 21.4 & VeryActiveMinutes<1 & LightlyActiveMinutes <42.8 & LightlyActiveMinutes >0) %>%
group_by(Id) %>%
count(Id)
```
Given that the user profile is a person with an active life style using smart-technology to track his/her well-health evolution, there is a high possibility of conversion.
This being the starting point, we use the summary function in RStudio to calculate the detail of each category of Activity | Minutes.
```r{Summary Daily Physical Activity | Minutes}
summary(Act_Minutes)
```
VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes Total_Active
Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.:127.0 1st Qu.:146.8
Median : 4.00 Median : 6.00 Median :199.0 Median :247.0
Mean : 21.16 Mean : 13.56 Mean :192.8 Mean :227.5
3rd Qu.: 32.00 3rd Qu.: 19.00 3rd Qu.:264.0 3rd Qu.:317.2
Max. :210.00 Max. :143.00 Max. :518.0 Max. :552.0
As we already know our sample has an assymetric distribution with heavy-tailed concentration towards the right-hand side of the curve, which means a great dispersion of values around their Mean.
This can be reflected in the summary table above, where the 3rd Qu of Very Active & Fairly Active triples (and even more) the Median value, and the 1st Qu result is zero in both cases.
Looking at Total Active values in P75 (3rd Qu) while comparing it against Lightly Active, a range of minutes can be observed that could be susceptible to analysis by calculating target conversion times or any other relevant measure indicator; In the same way, this calculation can be supported with the detailed demographic segmentation and the specification of the physical activities performed.
Act
Bellabeat is a successful high-tech manufacturer of health-focused products for women with the potential to become a larger player in the global smart device market.
This research has been requested by the leadership team to have a better understanding about the habits of its users, identify potential new features for its products and/or inspire ideas for developing new ones with a greater impact.
Below are some findings and recommendations that could help the commercial team of Bellabeat to make data-driven decisions on company future products / functionalities and business growth.
Findings:
Recommendations:
Thanks for your attention, I hope this information is helpful to you.
If you want to contact me, find me here:??
LinkedIn: danielvanegas | my site: devanegas | twitter: devanegas
kaggle: danielvanegas | github: de-vanegas