Checking for the Assumptions of Linear Regression using the mtcars dataset ????
Linear regression is an instrumental tool in data science. Like all statistical methods, it rests on certain assumptions. When these assumptions aren't met, the model's validity can be compromised. In this article, we'll explore these assumptions and demonstrate how to verify them using the mtcars dataset.
1. Linearity ??: The relationship between the independent and dependent variables should be linear.
import seaborn as sns
import statsmodels.api as sm
data = sns.load_dataset('mpg')
X = data.drop('mpg', axis=1)
y = data['mpg']
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
residuals = model.resid
plt.scatter(predictions, residuals)
plt.axhline(0, color='red')
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Values")
plt.show()
import seaborn as sns import statsmodels.api as sm data = sns.load_dataset('mpg') X = data.drop('mpg', axis=1) y = data['mpg'] X = sm.add_constant(X) model = sm.OLS(y, X).fit() predictions = model.predict(X) residuals = model.resid plt.scatter(predictions, residuals) plt.axhline(0, color='red') plt.xlabel("Predicted Values") plt.ylabel("Residuals") plt.title("Residuals vs. Predicted Values") plt.show()
2. Independence ??: Residuals should be independent of each other.
from statsmodels.stats.stattools import durbin_watson
dw = durbin_watson(residuals)
print(f"Durbin-Watson: {dw}")
3. Homoscedasticity (Constant Variance) ??: The variance of errors should be consistent.
领英推荐
plt.scatter(predictions, residuals)
plt.axhline(0, color='red')
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Checking for Homoscedasticity")
plt.show()
4. Normality of Errors ??: The residuals should be normally distributed.
sm.qqplot(residuals, line='s')
plt.title("Normality Q-Q plot")
plt.show()
5. No Multicollinearity ????: Independent variables shouldn't be highly correlated.
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
for col, val in zip(X.columns, vif):
print(f"{col}: VIF = {val}")
In Conclusion ??: Ensuring the assumptions of linear regression are met is crucial for robust and valid results. Always test these assumptions before drawing insights from your models. Remember, understanding the nuances of your data is as crucial as mastering the tools and techniques you apply to it.
Happy modeling! ????
Regional Manager (Southwest Louisiana) for Michael Rectenwald, PhD for President 2024
1 年Getting back to the basics, eh Chirag? Good old linear regression! Wow. That brings back some memories for me.