The Sanity Check: How to Plot Your Linear Model Fit by Groups in Python + R
We continue our multiple part series on linear regression this week by demonstrating the importance of plotting your model fit with the data. This will include showing how to plot multiple linear model fits to different groups using the classic Anscombe's quartet dataset. This dataset contains 4 very different groups but that have almost identical descriptive statistics when fitted with a linear regression model. As shown in the figure above each dataset is very different and in only group 1 would you say the linear model captures the trend in the data.
We'll start with the Python example this week using the seaborn package.
The resulting image is shown below:
The Anscombe dataset is also available in R but required a little reshaping to get it into a long format for plotting. This is the code that was used for the cover image for this article and leverages the powerful ggplot2 API.
领英推荐
To demonstrate how such different datasets can still have very similar fitted model statistics you can run the code below in R to fit a simple linear regression model to each group and extract the intercept, slope and R-squared model parameters for each.
The R-squared values, intercepts and slopes for each group are virtually the same for practical purposes. The only way you can really tell which model fit is reasonable and which require more data cleaning, investigation or a different model type by plotting the fitted model with the target variables. So before selecting models, fine tuning or anything else, plot your fitted model to your data. This is your sanity check.
Each week I publish a new article in Data Science Code in Python + R. The articles focus on how to do one data science job in both languages and can be digested in 5 minutes or less. You can subscribe to the newsletter to get a simple reminder in your LinkedIn notifications. Keep progressing toward building your Python and R data science skills by building on the skills you already have.
~ Matt
Data Analyst/Certified Machine Learning & Artificial Intelligence/Electrical Engineer/Mathematician
2 年great models
Actuarial Analyst
2 年Ernst Wagner
Consultor Fiduciario BID - /Rstudio/Stats/PowerBi
2 年Matt Rosinski Thaks for sharing. In the case of image number 4, it seems that it is classified as a linear regression, something similar to a logistic regression.
Sharing my journey to becoming a Generative AI Data Scientist. Join 1,000+ in my next free workshop. ??
2 年Love that ggplot tidyquant theme!