My first python file.

https://colab.research.google.com/drive/1PWZb2sVuoXr81XKvFLd6u8ldaWIdeeC8?usp=drive_link

Context and Objective

As part of my coursework for an Intro to Python summer class, I had the opportunity to analyze a dataset with demographic and lifestyle information from a few Latin American countries, including my own: Colombia. The data was synthetically generated from ML algorithms from the Weka Tool and a smaller percentage was collected from individuals in the three countries studied: Mexico, Peru and Colombia.

Stage 1

Dataset: overview and preprocessing

The study incorporated 17 variables with demographic details, dietary habits, measures of physical activity and obesity classifications for both men and women of different age groups for a total of 2111 instances. The aim of my analysis was to explore a connection between these attributes and the diagnosed obesity levels (I, II and III). A list of the variables and the role they play in this data set is shared below for a better interpretation of the attributes and their impact on the obesity levels. No preliminary cleaning was required as the data had no missing values. The observations were also equitably distributed among genders, which provided for a more accurate overall population representation as did the heights, which exhibited minimal skewedness. In order to prepare the data for the forthcoming analysis, certain values including gender, family_history_overweight and FAVC, as well as the target variable (Obesity_levels) among others were encoded using OneHotEncoder.


Stage 2

Exploratory Data Analysis (EDA)

Identifying patterns and trends The first part of my analysis was focused on determining the measures of central tendency for each numerical variable. Each feature had a total count of 2111, reinforcing the fact that there were no missing values within the dataset. The average age of the sample population was 24 years old, with a mean height and weight of 1.70 mts and 86.58 kgs respectively.

Among all variables, age exhibited the largest skewedness with a minimum value of 14 and a maximum value of 61 and a standard deviation of 6.3. To better represent the statistical findings, I complimented the summary of averages with plots to visualize the distribution of the features gender, weight and height. Although the lowest age was 14, the histogram indicated that most individuals fell between the range of 18 and 26 years of age.


#Univariate analysis

- FCVC: Most of the individuals sampled consumed either 2 or 3 vegetables in their meals.

- NCP: Similarly, more than half of the respondents reported eating 3 meals per day, with slightly over 200 individuals indicating consumption of only 1 meal per day.

- FAF: In terms of physical activity, the data revealed that the majority of participants reported no physical activity at all. - CAEC: Nearly 1750 individuals reported a frequency of “sometimes” for snack consumption, with very few (less than 100) reporting “no” or “always”.

- Family_history_overweight: This variable indicated that close to 1750 individuals in the sample had a family members with a history of overweight.

- Obesity_Levels: As the target variable, the results were relatively evenly balanced among different obesity levels, including underweight, normal weight and all other overweight levels. Obesity type 1 displayed a slightly higher count, but otherwise the results were consistent.

#Bivariate analysis

When exploring the relationships between every variable through bivariate analysis, I failed to find clear associations within the data. One of the few exceptions was the relationship between age and means of transportation. Despite the low representation of older individuals in the analysis, the correlation coefficient of -0.6 on the heatmap revealed that older individuals tended to rely less on public transportation and walking than their younger counterparts.

Explanation for the target variable through bivariate analysis: The distribution of obesity levels suggested that obesity_type_III was exclusively present among females, while males were the only gender affected by obesity_type_II. Additional insights obtained on weight and gender included a higher prevalence of underweight individuals among the female population, with a count of nearly 50% more females than males deemed underweight and a more equitable distribution of normal weight among both genders. This stage of the analysis was crucial because steered my attention towards understanding the gender dynamics in relation to obesity levels.

#Multivariate Analysis

The multivariate analysis consisted in running 289 scatter plots run across the entire dataset, with an outcome of few discernible patterns, among which the most relevant was the positive correlation between height and weight.

Being particularly drawn towards gender demographics, my primary interest laid in understanding the thresholds within the different obesity levels for women. To perform this portion of the analysis, I excluded the outliers within the data that corresponded to age, restricting to only females under the age of 50.

I identified that obesity type III usually manifested at approximately 100 kgs or 220.46 lbs. for women. Additionally, I noticed that most females reaching a weight of 80 kgs regardless of height were generally categorized on an overweight level, closely approaching the first level of obesity. I observed small clusters of women between 155 and 165 cm in height who weighed about 80 kgs, other weight and height combinations were less prevalent throughout the data set. It was also impactful to observe the contrast in weight among the two tallest women who were just about 185 cm or 6 ft tall yet exhibited a weight difference of nearly 120 kgs or 265 lbs

Stage 3. Estimation Models:

Building on the initial exploration of the data and the impact of all the 16 variables on obesity levels, I attempted to create a more robust predictive model that could integrate the complexity of the multiple variable types on the obesity level output. On this last stage I applied 3 classification algorithms: Logistic Regression, Decision Trees and Random Forest.

Logistic Regression: The first is a regression model that enabled me to predict results for multiple levels of obesity, otherwise known as multinomial regression (UCLA). The results were of the confusion matrix output indicated the model displayed 70 true positives for obesity level I and 63 true positives for obesity level III, it did not have the same predictive capability for normal weight (40) and underweight (56). This model was not as successful in predicting overweight levels as accurately as the obesity levels, possibly because the weight categories being very closely clustered together. Overall, this model achieved an accuracy for 87% on the dataset.

Decision Tree: I used this as a secondary measure to interpret numerical and categorical data, which closely resembled the structure of this dataset. The precision of this model for estimating obesity type III was 100, meaning it correctly estimated the output 100% of the time. Other obesity levels also showed his precision with 95% for both obesity level II and level I. However, the model′s overall performance of 83%, indicated it was a less well-fitted model than the logistic regression, which reflected a performance of 87%.

Random Forest: Displayed lower explainability but more accurate. Like the decision tree model, the Random Forest estimator achieved 100% prediction accuracy for obesity type III. This can be attributed to the nature of the model which leverages multiple decision trees in its predictions and aggregates their results to improve prediction accuracy and reduce overfitting (IBM, 2024). However, unlike the previous two models, it also had a higher success rate for the two other obesity types: with 99% and 97%., this is also reflected on the overall accuracy rate of 93%.

Limitations

1.The dataset lacks precise measurements to understand certain variables. Examples: measures of water intake could refer to frequency of consumption within a specified timeframe or consumption amount per day. This applies to other variables such as vegetable consumption in meals and physical activity, among others. The lack of definition in these measures can lead to inconsistencies in results and impact the reliability of conclusions.

2. Overfitting of the data, which can also skew conclusions and lose applicability of models to real world scenarios.


References

IBM. (n.d.). What is random forest? Retrieved June 26, 2024, from What Is Random Forest? | IBM

OpenAI. (2024, June 27). Rewrite statements about Random Forest model performance [Chat with ChatGPT]. Retrieved from ChatGPT UCI Machine Learning Repository. (2019, August 26). Estimation of obesity levels based on eating habits and physical condition [Data set]. Retrieved June 13, 2024, from Estimation of Obesity Levels Based On Eating Habits and Physical Condition -

UCI Machine Learning Repository UCLA. (n.d.). Multinomial logistic regression | Stata data analysis examples. Retrieved June 20, 2024, from from Multinomial Logistic Regression | Stata Data Analysis Examples (ucla.edu



要查看或添加评论,请登录

社区洞察

其他会员也浏览了