Checking for the Assumptions of Linear Regression using the mtcars dataset ????

Checking for the Assumptions of Linear Regression using the mtcars dataset ????

Linear regression is an instrumental tool in data science. Like all statistical methods, it rests on certain assumptions. When these assumptions aren't met, the model's validity can be compromised. In this article, we'll explore these assumptions and demonstrate how to verify them using the mtcars dataset.


1. Linearity ??: The relationship between the independent and dependent variables should be linear.

  • Check Using mtcars: Plot the residuals against predicted values. If the plot displays a non-random pattern, consider transforming the data or using non-linear models.

import seaborn as sns
import statsmodels.api as sm

data = sns.load_dataset('mpg')
X = data.drop('mpg', axis=1)
y = data['mpg']
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
residuals = model.resid

plt.scatter(predictions, residuals)
plt.axhline(0, color='red')
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Values")
plt.show()        

import seaborn as sns import statsmodels.api as sm data = sns.load_dataset('mpg') X = data.drop('mpg', axis=1) y = data['mpg'] X = sm.add_constant(X) model = sm.OLS(y, X).fit() predictions = model.predict(X) residuals = model.resid plt.scatter(predictions, residuals) plt.axhline(0, color='red') plt.xlabel("Predicted Values") plt.ylabel("Residuals") plt.title("Residuals vs. Predicted Values") plt.show()

2. Independence ??: Residuals should be independent of each other.

  • Check Using mtcars: Use the Durbin-Watson test. Values close to 2 suggest no autocorrelation.

from statsmodels.stats.stattools import durbin_watson

dw = durbin_watson(residuals)
print(f"Durbin-Watson: {dw}")        

3. Homoscedasticity (Constant Variance) ??: The variance of errors should be consistent.

  • Check Using mtcars: Again, check the plot of residuals vs. predicted values.

plt.scatter(predictions, residuals)
plt.axhline(0, color='red')
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Checking for Homoscedasticity")
plt.show()
        


4. Normality of Errors ??: The residuals should be normally distributed.

  • Check Using mtcars: Use a Q-Q plot. Points following the diagonal line indicate normality.

sm.qqplot(residuals, line='s')
plt.title("Normality Q-Q plot")
plt.show()        

5. No Multicollinearity ????: Independent variables shouldn't be highly correlated.

  • Check Using mtcars: Calculate the Variance Inflation Factor (VIF). A VIF > 10 indicates high multicollinearity.

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
for col, val in zip(X.columns, vif):
    print(f"{col}: VIF = {val}")        

In Conclusion ??: Ensuring the assumptions of linear regression are met is crucial for robust and valid results. Always test these assumptions before drawing insights from your models. Remember, understanding the nuances of your data is as crucial as mastering the tools and techniques you apply to it.

Happy modeling! ????

Dave Wellsted

Regional Manager (Southwest Louisiana) for Michael Rectenwald, PhD for President 2024

1 年

Getting back to the basics, eh Chirag? Good old linear regression! Wow. That brings back some memories for me.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了