From Data to Insights: Understanding Obesity through Statistical and Exploratory Data Analysis

Introduction

As a business analyst, analyzing datasets to uncover valuable business insights is critical. In this article, we will explore a dataset from the UCI Machine Learning Repository, which estimates obesity levels based on eating habits and physical conditions of individuals from Colombia, Peru, and Mexico. This dataset contains 2111 instances and 17 attributes. 77% of the data was synthetically generated using the Weka tool and the SMOTE filter, while 23% was collected directly from users via a web platform. Through a detailed step-by-step exploration, this article aims to help junior analysts understand methodologies, techniques, and reasoning essential for uncovering valuable insights from complex data.

Dataset Overview

The chosen dataset contains detailed information on individuals' demographics, eating habits, and physical activity levels, making it suitable for exploring various analytical approaches. The data includes attributes such as age, gender, height, weight, and other lifestyle factors that contribute to obesity levels. The objective is to analyze this dataset using Python and derive meaningful insights that can help understanding obesity trends and patterns.

Analytical Approach

To conduct a comprehensive analysis, we will employ several steps and techniques. These steps include:

Data Preprocessing: Handling missing values and converting categorical variables to numerical values and vice-versa.
Exploratory Data Analysis (EDA): Visualizing data attributes through histograms, bar plots, and scatter plots to identify patterns and trends.
Statistical Analysis: Creating a heatmap to visualize correlations between different attributes.

NOTE: ChatGPT was used as a consulting tool to build the codes properly when needed

Data Preprocessing

We start by importing the libraries necessary for our analysis and, to avoid the need to write the whole name of the library every time we need to call a function from it, we set an abbreviation to them.

pandas (pd) for data manipulation
matplotlib.pyplot (plt) and seaborn (sns) for data visualization

Next, we load our dataset using pd.read_csv, which reads the CSV file into a DataFrame (df).

We then check the shape of the DataFrame to confirm its dimensions.

Next, we use df.info() to get an overview of the DataFrame, including the data types and non-null counts of each column.

?In pandas, the object dtype typically means that the column contains string data, but it can also include any Python object. Columns with this dtype can hold any kind of data, making them very flexible but less efficient in terms of memory and performance compared to more specific data types.
The float64 dtype is precise and can store a wide range of values, making it suitable for numerical data that requires decimal precision, such as measurements, calculations, and other numeric values with decimal points.

?We also use df.describe() to get statistical summaries of the numerical columns and identify potential outliers right away.

?As nothing calls our attention, to get a sense of the data, we display the first few rows using df.head().

The data appears consistent, but since we know that 77% of it was synthetically generated, it's prudent to inspect the last rows as well with df.tail().

?

Looking at the last few rows, we notice that the data includes non-integer numbers in columns where it's appropriate, and the float variables have more decimal places, reflecting the synthetic nature of the data.

To enhance readability and consistency, we round the 'Height' and 'Weight' columns to two and one decimal places, respectively.

Given that many columns represent ordinal data originally from surveys, we convert synthetic floats and floats to integers to better represent this data.

We then create a new column for BMI (Body Mass Index). We calculate the BMI by dividing 'Weight' by the square of 'Height' and rounding the result to one decimal place for consistency.

We aim to reclassify the obesity levels in our dataset based on BMI according to the WHO Classification, as the original data includes an additional classification for overweight.

?First, we create a new column, 'Weight_Classification', based on the BMI values. This involves defining a function, classify_obesity, which categorizes BMI into various classes:

?Underweight: BMI less than 18.5
Normal: BMI between 18.5 and 24.9
Overweight: BMI between 25.0 and 29.9
Obesity I: BMI between 30.0 and 34.9
Obesity II: BMI between 35.0 and 39.9
Obesity III: BMI 40 or higher

We apply this function to the BMI column to create the 'Weight_Classification' column and remove the old 'NObeyesdad' column as it is now redundant.

To better reflect the original survey responses, we map numerical values into descriptive categories for certain columns.

Finally, we convert object/text variables to category variables for efficient processing and better analysis.

We verify the changes using df.info(), which provides an overview of the DataFrame, including the data types and non-null counts of each column.

Upon reviewing the data, we notice that there are instances with missing values in the 'Number of Main Meals' column ('NCP'). This is because the original survey had 3 answer options and the dataset had values from 1 to 4. To ensure consistency, we will drop these rows and reset the index of the DataFrame.

?

Now the data is clean and ready to be analyzed.

Exploratory Data Analysis (EDA)

Numerical Data

To visualize the distribution of the numerical columns in our dataset, we will create histograms for each column. The numerical columns we are focusing on are 'Age', 'Height', 'Weight', and 'BMI'.

We start by defining the columns we want to plot.
Then, we define the figure size for our plots to ensure that they are large enough to be easily readable. In this case, we used 12 inches wide and 6 inches tall.
We use a loop to create a subplot for each numerical column.

The ‘i’ represents the index of the current element in the created list.
The enumerate function helps us keep track of the index and the column name simultaneously.

Within the loop, we use plt.subplot to specify the position of the current plot.

This line specifies that each histogram will be placed in a 2x2 grid of subplots. The position of each subplot is determined by i+1 (since subplot positions start at 1 and the Python index at 0).

We then use plt.hist to create the histogram for the current column. The edgecolor argument adds a border around each bar in the histogram, and alpha makes the bars semi-transparent.
We set the title of each subplot to the column name and label the y-axis as 'Frequency' to indicate that the histogram shows the frequency distribution of the values.
We use plt.tight_layout to automatically adjust the subplot parameters so that the plots fit into the figure area without overlapping.
Finally, we use plt.show() to display all the histograms.

The age distribution is heavily skewed towards younger individuals, with the majority of the data points clustered between ages 18 and 30. There are very few instances of individuals above the age of 40. This indicates that the survey from which the data was collected likely targeted or attracted a younger demographic. This age concentration should be considered when generalizing findings to the broader population, as it may not be representative of older age groups.
The height distribution appears to be roughly normal, centered around 1.65 meters. Most individuals fall within the range of 1.55 to 1.75 meters, indicating a fairly standard height distribution for an adult population.

The weight distribution shows a wider spread, with a peak around 80 kg. There is a noticeable drop in frequency for weights above 100 kg and below 50 kg, indicating that most individuals in the dataset have a weight between 50 kg and 100 kg. This could be due to the influence of various factors such as lifestyle, diet, and physical activity. The dataset includes a considerable number of individuals with higher weights, which may need special attention in health-related analyses.

However, just like the age variable, this can indicate that the survey targeted or attracted individuals with a higher body weight/larger body size.

The BMI distribution is also roughly normal, with a peak around 25-30, which falls within the 'Overweight' category according to WHO standards. There is a significant number of individuals with BMI values indicating overweight and obesity, suggesting a prevalence of higher BMI in the dataset. The presence of many individuals with BMI values indicating overweight and obesity highlights potential public health concerns related to weight management and associated health risks.

Categorical Data

To visualize the distribution of the categorical variables in our dataset, we will create bar plots for each column. The categorical columns we are focusing on are 'Gender', 'family_history_with_overweight', 'FAVC', 'FCVC', 'NCP', 'CAEC', 'SMOKE', 'CH2O', 'SCC', 'FAF', 'TUE', 'CALC', 'MTRANS' and 'Weight_Classification'.

We start by listing all the categorical columns we want to plot.
To ensure that our plots are visually appealing, we define a color palette using seaborn.
For some columns, we want to display the categories in a specific order, so we create a dictionary to hold these orders.
Similar to the histograms, we use a loop to create a subplot for each categorical column.
The plt.subplot function specifies the position of the current plot.
We use value_counts to get the frequency of each category and reorder the categories if a specific order is defined.
We plot the bar charts, setting the color, title, and labels.
Finally, we adjust the layout with plt.tight_layout() and display the plots with plt.show().

The dataset has a nearly equal representation of males and females, ensuring balanced gender representation. This balance is beneficial for ensuring that any conclusions drawn from the data are applicable to both genders.

The majority of individuals have a family history of being overweight, suggesting a potential genetic or environmental influence on weight.
Most individuals frequently consume high-calorie food (FAVC), indicating dietary habits that might contribute to higher BMI values.
The frequency of vegetable consumption (FCVC) is high, with most individuals consuming vegetables sometimes or always, indicating awareness of healthy eating habits. However, most of the population consumes vegetables infrequently, indicating room for improvement in dietary education and promotion.

In terms of Number of Main Meals (NCP), most individuals eat between one and two main meals, with fewer individuals having three or more meals. This pattern could influence overall caloric intake and metabolic health, and further analysis could explore its impact on weight status.

A substantial number of people consume food between meals (CAEC) sometimes or frequently.

Most of the dataset consists of non-smokers, which might positively influence their overall health status.

Individuals have varying levels of daily water intake (CH2O). However, a significant amount of individuals drink less than a liter of water a day.

Most individuals do not monitor their caloric intake (SCC), indicating a lack of dietary awareness or discipline.

Physical activity frequency (FAF) varies, with a notable portion engaging in some form of weekly exercise, though many also report never exercising.

The usage of technology devices (TUE) is mixed, with most individuals spending between 0-2 hours and some spending more time.

Alcohol consumption (CALC) is generally low, with most individuals reporting no or infrequent consumption.

Public transportation is the most common mode of transport (MTRANS), followed by automobiles, indicating urban living conditions.

The dataset shows a balanced distribution across different weight classifications, with a significant number of individuals categorized as overweight or obese, emphasizing the prevalence of higher BMI values.

Gender Distribution

To analyze the relationship between height and weight by gender, as well as the distribution of weight classification per gender, we will create scatter and count plots, respectively.

Scatter Plot of Height vs Weight

We specify different markers and colors for males and females to distinguish them clearly in the scatter plot.
We create a scatter plot using seaborn's scatterplot function. The hue and style parameters differentiate the points by gender using the specified markers and colors.
We add a title and labels for the axes to make the plot informative.

Males generally have higher weights for a given height compared to females. This confirms a difference in body composition between genders, with males typically having more muscle mass.
The distribution of points shows that males have a wider range of both height and weight compared to females. This suggests greater variability in body size among males.

Distribution of Weight Classification per Gender

We use seaborn's countplot function to create a bar plot that shows the frequency of each weight classification, separated by gender.
We specify the order of weight classifications and use the custom color palette defined earlier.

There are more underweight females than males. This could be due to different societal pressures or biological factors influencing weight.

The normal weight category has a relatively equal distribution between males and females, suggesting that both genders have similar proportions of individuals within a healthy weight range.

The overweight category shows a higher frequency of males compared to females. This may reflect differences in lifestyle, diet, or physical activity levels between genders.

The distribution of individuals in the obesity categories reveals that males tend to have higher frequencies in Obesity I and II, while females have higher frequencies in Obesity III. This indicates that while males are more frequently overweight or in the lower obesity categories, females who are obese are more likely to fall into the more severe category (Obesity III).

?Correlation Analysis

To analyze the statistical relationships between various features in our dataset, we will convert categorical variables to numerical ones and then create a correlation heatmap.

We define dictionaries to map categorical text values to numerical values.
We apply the mappings to convert text values to numerical values.

Convert categorical columns to integers and check the DataFrame structure.

For the Correlation Heat Map, we will drop the columns 'Height', 'Weight', and 'BMI' as they are used in the Weight_Classification calculation and can bias our correlation analysis.
Calculate the correlation matrix and plot the heatmap.

Gender:

Gender shows a moderate positive correlation with FCVC (0.28) indicating that females are more likely to consume vegetables frequently.
Gender also has a moderate negative correlation with FAF (-0.18), suggesting that males are more likely to engage in frequent physical activity.

Age:

Age shows a positive correlation with family_history_with_overweight (0.2) and Weight_Classification (0.21), indicating that older individuals are more likely to have a family history of being overweight and are more likely to fall into higher weight classifications.

Family History with Overweight:

family_history_with_overweight shows a strong positive correlation with Weight_Classification (0.5), highlighting the genetic or familial predisposition to higher weight categories.

High-Calorie Food Consumption (FAVC):

FAVC is positively correlated with Weight_Classification (0.28), indicating that individuals who frequently consume high-calorie foods are more likely to have higher weight classifications.

Vegetable Consumption (FCVC):

FCVC has a positive correlation with Weight_Classification (0.23), which could suggest a relationship between vegetable consumption and weight, though it might be influenced by other dietary habits.

Physical Activity (FAF):

FAF has a negative correlation with Weight_Classification (-0.14), indicating that more frequent physical activity is associated with lower weight classifications.

Frequency of Food Consumption between Meals (CAEC):

The negative correlation between CAEC and Weight_Classification (-0.32) suggests that individuals who snack more frequently tend to have lower weight classifications. This likely indicates that healthier, frequent snacking helps manage hunger and prevents overeating during main meals, thereby supporting better weight management.

Technology Use (TUE):

TUE shows a negative correlation with ?Weight_Classification ?(-0.09), suggesting that increased use of technology (more sedentary behavior) is associated with higher weight classifications.

Key Insights and Recommendations

?The concentration of younger individuals may influence the interpretation of health-related variables such as BMI and weight, as younger populations tend to have different health profiles compared to older populations.
The presence of higher BMI values suggests that interventions or studies focused on weight management, physical activity, and diet could be particularly relevant for this population.
High-calorie food consumption is prevalent. Educational campaigns to promote healthier eating are essential. Provide calorie monitoring tools.
Varied physical activity levels, but many report no exercise. Community programs and workplace fitness initiatives are recommended.
Strong correlation between family history of overweight and higher weight classifications. Genetic predispositions should guide personalized health plans.
Many people drink less than a liter of water daily. Promote hydration and continue low alcohol consumption trends.
High use of automobiles and public transportation suggests an urban lifestyle. Urban planning should promote walkability and cycling, increasing the physical activity and, consequently, reducing Obesity risk.

References

UCI Machine Learning Repository. (n.d.). https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition

Palechor, F. M., & De La Hoz Manotas, A. (2019). Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico.?Data in Brief,?25, 104344.?https://doi.org/10.1016/j.dib.2019.104344

De-La-Hoz-Correa, E., Mendoza-Palechor, F. E., De-La-Hoz-Manotas, A., Morales-Ortega, R. C., & Adriana, S. H. B. (2019). Obesity Level Estimation Software based on Decision Trees.?Journal of Computer Sciences/Journal of Computer Science,?15(1), 67–77.?https://doi.org/10.3844/jcssp.2019.67.77

OpenAI. (2024, June 26). ChatGPT. https://www.openai.com/chatgpt

From Data to Insights: Understanding Obesity through Statistical and Exploratory Data Analysis

Camila Stanize, MBA

Pharmacovigilance | Market Access | Medical Affairs | Drug Safety | Compliance | Pharmaceutical & Healthcare | Data Analytics

Introduction

Dataset Overview

Analytical Approach

Data Preprocessing

Exploratory Data Analysis (EDA)

Numerical Data

领英推荐

Categorical Data

Gender Distribution

?Correlation Analysis

Key Insights and Recommendations

References

社区洞察

其他会员也浏览了

Exploratory Data Analysis

Statistics for Data Science by CloudyML

Descriptive Statistics vs Inferential Statistics

Data science vs. statistics? (convo w/Perplexity)

20 Key Questions in Data Science Interviews

Understanding Descriptive Statistics Made Easy: Average, Spread, and More!

CLUSTER ANALYSIS

What Is Hypothesis Testing in Data Science

Concise Basic Stats - Part II: Summary Statistics & Basic Exploratory Analysis

Mastering Statistical Analysis: Hypothesis Testing and Key Tests