登录查看更多内容

Investigating the Correlation Between %Diabetics, %Inactivity, and %Obesity: A Comprehensive Regression Analysis

shalabh singh yadav

Data Analyst | M.S. in Data Science Candidate | Google-Certified | Driving 30% Revenue Growth with Automated ETL Workflows

发布日期: 2023年9月26日

After making a separate file in which we club all the values from different files like “diabetic”, “Obesity”, and “Inactivity” and named that file as a final.csv in this we contained the ‘county_state’ column which we obtained by merging the counties based on state.

After which with the help of the inner join function we obtain the d=related data from all the files.

After we collected the data, we made a linear model with two independent variables (obesity and inactivity) and one dependent variable(diabetics).

For this, we split the data into two sets.

Training set
Testing set

Here I decided to put 60% of the data into a training set and then the remaining set.

After this, we tried to calculate the performance of our model by looking at its accuracy, but my model was only 22% accurate.

After that, I try to plot some scatter plots like predicted vs. actual values and then the residual plot.

In this plot we observe that the data points are sort of clubbed in the middle, which is not good for a model, it should be scattered.

领英推荐

Why Data means nothing, unless it has a story to tell.

Korshine 2 年前

Outliers: The Odd Ones Out in Data Analytics ????

Vivek Tyagi 1 个月前

Cpk and Ppk: Process Capability Insights

Md. Abdur Rakib 1 年前

The same can be observed for the residual plot also, here the residuals should be scattered all over the places, but they are all in the middle(mostly)

After this, I tried to apply k-fold cross-validation which we learned today in class.

Cross-validation

According to Wikipedia

Cross-validation is a?resampling?method that uses different portions of the data to test and train a model on different iterations. It is mainly used in settings where the goal is prediction, and one wants to estimate how?accurately?a?predictive model?will perform in practice.

In simpler words,

In this, we divide the data into several chunks, then try to train the model with one chunk at a time and take all the remaining chunks to test the model.

So, after I applied k-fold cross-validation on my model with 10 folds I got an r2 score of 33%

要查看或添加评论，请登录

shalabh singh yadav的更多文章

"Unveiling K-Fold Cross-Validation: Exploring the Choice of Folds and Grasping Training and Testing Errors"

2023年9月27日

"Unveiling K-Fold Cross-Validation: Exploring the Choice of Folds and Grasping Training and Testing Errors"

In the previous blog, I mentioned that I tried k-fold cross-validation with 10 folds. Here a question arises why 10…
The Significance of P-Value in Statistical Analysis

2023年9月13日

The Significance of P-Value in Statistical Analysis

Today we learned one of the most important topics of statistics which was p-value. At the start of the class, we were…
"How Our First Day of Class Went: A Recap “

2023年9月12日

"How Our First Day of Class Went: A Recap “

On the first day of my journey in data science, I attended my first “Advanced Statistics” class. In this class, my…

Investigating the Correlation Between %Diabetics, %Inactivity, and %Obesity: A Comprehensive Regression Analysis

shalabh singh yadav

Data Analyst | M.S. in Data Science Candidate | Google-Certified | Driving 30% Revenue Growth with Automated ETL Workflows

领英推荐

Cross-validation

shalabh singh yadav的更多文章

社区洞察

其他会员也浏览了

Unlocking the Power of the Central Limit Theorem in Data Analysis

How can you handle missing values in a time-series dataset?

Why use t-stat?

Evaluation of Out Of Trend Results – OOT’s

Is it wise to use CART technique when the dependent variable is skewed towards one of the class?

How to Lie About Model Accuracy

More Data Doesn't Equal Better Decisions

Understanding and compilations of Out Of Trend Results (OOT’s)

Principal Component Analysis (PCA METHOD)

Non parametric statistical tests

领英推荐

Cross-validation

shalabh singh yadav的更多文章

"Unveiling K-Fold Cross-Validation: Exploring the Choice of Folds and Grasping Training and Testing Errors"

The Significance of P-Value in Statistical Analysis

"How Our First Day of Class Went: A Recap “

社区洞察

其他会员也浏览了

Unlocking the Power of the Central Limit Theorem in Data Analysis

How can you handle missing values in a time-series dataset?

Why use t-stat?

Evaluation of Out Of Trend Results – OOT’s

Is it wise to use CART technique when the dependent variable is skewed towards one of the class?

How to Lie About Model Accuracy

More Data Doesn't Equal Better Decisions

Understanding and compilations of Out Of Trend Results (OOT’s)

Principal Component Analysis (PCA METHOD)

Non parametric statistical tests