The Sanity Check: How to Plot Your Linear Model Fit by Groups in Python + R

The Sanity Check: How to Plot Your Linear Model Fit by Groups in Python + R

We continue our multiple part series on linear regression this week by demonstrating the importance of plotting your model fit with the data. This will include showing how to plot multiple linear model fits to different groups using the classic Anscombe's quartet dataset. This dataset contains 4 very different groups but that have almost identical descriptive statistics when fitted with a linear regression model. As shown in the figure above each dataset is very different and in only group 1 would you say the linear model captures the trend in the data.

We'll start with the Python example this week using the seaborn package.

plotting a linear model fit using seaborn in Python

The resulting image is shown below:

Plot of linear model fit using seaborn

The Anscombe dataset is also available in R but required a little reshaping to get it into a long format for plotting. This is the code that was used for the cover image for this article and leverages the powerful ggplot2 API.

Plotting linear model fits by group using ggplot in R

To demonstrate how such different datasets can still have very similar fitted model statistics you can run the code below in R to fit a simple linear regression model to each group and extract the intercept, slope and R-squared model parameters for each.

extracting some linear model statistics for multiple models in R

The R-squared values, intercepts and slopes for each group are virtually the same for practical purposes. The only way you can really tell which model fit is reasonable and which require more data cleaning, investigation or a different model type by plotting the fitted model with the target variables. So before selecting models, fine tuning or anything else, plot your fitted model to your data. This is your sanity check.

Each week I publish a new article in Data Science Code in Python + R. The articles focus on how to do one data science job in both languages and can be digested in 5 minutes or less. You can subscribe to the newsletter to get a simple reminder in your LinkedIn notifications. Keep progressing toward building your Python and R data science skills by building on the skills you already have.

~ Matt

Afeworki Ytbarek

Data Analyst/Certified Machine Learning & Artificial Intelligence/Electrical Engineer/Mathematician

2 年

great models

Freddy Sanchez Vallejos

Consultor Fiduciario BID - /Rstudio/Stats/PowerBi

2 年

Matt Rosinski Thaks for sharing. In the case of image number 4, it seems that it is classified as a linear regression, something similar to a logistic regression.

?? Matt Dancho ??

Sharing my journey to becoming a Generative AI Data Scientist. Join 1,000+ in my next free workshop. ??

2 年

Love that ggplot tidyquant theme!

要查看或添加评论,请登录

Matt Rosinski的更多文章

社区洞察

其他会员也浏览了