Investigating the Correlation Between %Diabetics, %Inactivity, and %Obesity: A Comprehensive Regression Analysis
shalabh singh yadav
Data Analyst | M.S. in Data Science Candidate | Google-Certified | Driving 30% Revenue Growth with Automated ETL Workflows
After making a separate file in which we club all the values from different files like “diabetic”, “Obesity”, and “Inactivity” and named that file as a final.csv in this we contained the ‘county_state’ column which we obtained by merging the counties based on state.
After which with the help of the inner join function we obtain the d=related data from all the files.
After we collected the data, we made a linear model with two independent variables (obesity and inactivity) and one dependent variable(diabetics).
For this, we split the data into two sets.
Here I decided to put 60% of the data into a training set and then the remaining set.
After this, we tried to calculate the performance of our model by looking at its accuracy, but my model was only 22% accurate.
After that, I try to plot some scatter plots like predicted vs. actual values and then the residual plot.
?
In this plot we observe that the data points are sort of clubbed in the middle, which is not good for a model, it should be scattered.
领英推荐
The same can be observed for the residual plot also, here the residuals should be scattered all over the places, but they are all in the middle(mostly)
After this, I tried to apply k-fold cross-validation which we learned today in class.
Cross-validation
According to Wikipedia
Cross-validation is a?resampling?method that uses different portions of the data to test and train a model on different iterations. It is mainly used in settings where the goal is prediction, and one wants to estimate how?accurately?a?predictive model?will perform in practice.
In simpler words,
In this, we divide the data into several chunks, then try to train the model with one chunk at a time and take all the remaining chunks to test the model.
So, after I applied k-fold cross-validation on my model with 10 folds I got an r2 score of 33%