The Science of Cholesterol Prediction- An MLR Analysis

The Science of Cholesterol Prediction- An MLR Analysis

Cardiovascular diseases are the leading cause of deaths worldwide. They lead to poor quality of life, disability and death. They can often be prevented if one can control its risk factors like High BP and cholesterol levels.

Cholesterol is a type of lipid which is prevalent in all of higher animals. It is distributed in body tissues, especially the brain and spinal cord. It helps our body perform many important functions, however, too much cholesterol in the blood is bad for health as it can enter the?artery walls and damage its integrity leading to formation of hardened deposits (atherosclerotic plaque). It is a silent killer since an individual can take years to notice its presence, by then the plaque can cause serious problems like CAD and strokes. To curb the levels of cholesterol it is necessary to take preventive measures in time before any such problems can take root. This is where the power of statistics and ML modeling comes into play. In this article, I will be explaining how MLR can be used to predict total cholesterol levels using a sample dataset for chronic heart disease.

Machine learning has a huge impact on health through an effective analysis of chronic diseases for accurate diagnosis and proper treatment. In the field of healthcare this kind of prediction plays a major role to find out the risk of the disease in the patient. The only way to overcome the mortality due to chronic diseases is to predict it earlier so that the disease prevention can be done.

What is multiple linear regression?

Multiple linear regression is a technique that can be used to understand the association between several multiple independent variables and one continuous dependent (or outcome) variable. For example, if we are cooking, the flavor of the dish depends on several factors like type of ingredients used, their quantity and methods of preparation. Here, the ingredients are the various independent variables affecting the outcome of the dependent variable to be predicted i.e. the dish. Similarly Total cholesterol levels should predicted by taking different factors into account rather than only taking a single variable. For my Dataset I have taken the various independent variables and generated two iterations of regression which can be used to predict the values for total cholesterol along with the significance level of the results.

Y = a + m1X1?+ m2X2

(Above given is the multiple linear regression equation where the coefficients act as weights of how much each independent variable contributes to the predicted cholesterol level,Y)

The Prediction Model

I have utilized two platforms for performing MLR, so that there is an extensive approach to predictive modeling:

  • Using Excel
  • Using Python

Given below is a small introduction to the Dataset which was utilized in this analysis.

  1. Gender - Sex of the patient
  2. Age- Age of the patient
  3. Education- Education level of the patient (1- Primary, 2- Lower Secondary, 3- Upper Secondary 4- Post Secondary non tertiary education)
  4. CurrentSmoker- Information if patient is currently partaking in smoking or not
  5. Cigsperday- If Patient is partaking is smoking, the average number of cigarettes smoked per day
  6. Bpmeds- Information regarding if the patient is on BP medications or not
  7. Prevalentstroke- Information regarding if the patient is prevalent to stroke
  8. Prevalenthyp- Information regarding if the patient is prevalent to hypertension
  9. Diabetes- Information regarding if the patient has diabetes or not
  10. totChol- Total cholestrol levels of the patient
  11. sysBP- Systolic BP level of the patient
  12. diaBP- Diastolic BP level of the patient
  13. BMI- Body Mass Index ratio of the patient
  14. Heartrate- Number of times the patient's heart beats per minute
  15. Glucose- Glucose amount/ sugar levels in blood of the patient
  16. TenYearCHD- If the patient has a chronic heart disease ongoing for past 10 years.

Data Dictionary

The process to be followed involves following steps: Data Cleaning(Removing missing values, inconsistencies etc.), Data Preparation (Handling the Catagorical Values), Feature selection on the basis of significance levels of each independent variable, Model evaluation on the basis of Several metrics like R-squared and Mean Squared Error (MSE) tell us how well the model predicts values on unseen data. A high R-squared indicates a good fit between the predicted and actual cholesterol values. A low MSE means the model's predictions are close to the actual values. Once a well-performing model is obtained, one can interpret the β coefficients and understand how factors like age, gender, or dietary habits influence cholesterol levels based on their positive or negative values and their magnitude.

Outputs

Excel MLR Output
Python MLR output

The Excel output gives us an analysis of the multiple linear regression model generated. We obtain an insight into the coefficients and significance of each independent variable with respect to the field to be predicted on Y-axis. The intercept value of 115.7674 represents the predicted value of the dependent variable (e.g., total cholesterol) when all the independent variables are equal to zero. R-squared (R2) is a statistical measure which signifies the proportion of variance in the dependent variable that can be explained by the independent variables included in the model. The R2 value is for this model is 0.096, indicating that the given independent variables may not necessarily be a good fit for predicting total cholesterol since it only accounts for 9.6% of the total variance. Examining the individual coefficients (β), we found that Age, Diastolic BP levels, BMI, Heart Rate had a positive and statistically significant effect (β > 0, p-value < 0.05), suggesting that their higher values are associated with an increase in total cholesterol.

Both Python and Excel projections reflected the overall pattern of the data, albeit to varied degrees of precision. However, there are some cases where the disparities between actual and anticipated values are significant, indicating that the models may not be aligning at an optimum level.

Significance of the study

This study provided an insight on how one can leverage the ML modeling tools to combine and figure out the significant relationships to be utilized for prediction of Cholesterol levels. ML models can also be used to identify subgroups of individuals who are unlikely to develop very high risk levels of cholesterol and formulate schedules for screening individuals.?Such strategies can also be used to develop personalized care plans for managing cholesterol levels however there should be a heavy emphasis on the quality of data provided to the model for training since MLR cannot provide perfect predictions but can aid in providing insights to achieve optimal heart health.



Can't wait to dive into your insightful blog post on cholesterol prediction using MLR!

Swarn Prabha

Hardware Services Delivery Specialist - Nokia India at Nokia

7 个月

True, off late cardiovascular condition has become so prevalent. Predictive analysis along with correct implementation can play crucial support in saving / improving lives. In any Family if there is +ve outcome even for a single member, it helps the whole family. Better Health -> Better Quality of Life ->Better Mental peace

要查看或添加评论,请登录

Prashansa Gupta的更多文章

  • Australian National Digital Health Strategy

    Australian National Digital Health Strategy

    Overview of Australian Healthcare System Australian Healthcare is a hybrid of public and private sectors. Medicare is…

    3 条评论
  • Medical Product Distribution for a Sample Dataset

    Medical Product Distribution for a Sample Dataset

    As a healthcare professional, tracking medical devices supply to Healthcare organizations is crucial. It empowers one…

    5 条评论

社区洞察

其他会员也浏览了