Delve deeper into R-squared.
CHETAN SALUNKE
Data Scientist| Globally Certified Tensorflow Developer |Silver Medal in Master Of Statistics |ML| DL| NLP|LLM| Gen AI| Promt Engineering IBM Certified Data Professional| Python| SQL| Power BI| Statistics.
A good model can have a low R2 value. On the other hand, a biased model can have a high R2 value!
R-squared is a goodness-of-fit measure for linear regression models. This statistic indicates the percentage of the variance in the dependent variable which is explained by independent variables.
After fitting a linear regression model, you need to determine how well the model fits the data. Does it do a good job of explaining changes in the dependent variable? There are several key goodness-of-fit statistics for regression analysis. In this post, we’ll examine R-squared (R2-Score), highlight some of its limitations, and discover some surprises. For instance, small R-squared values are not always a problem, and high R-squared values are not necessarily good!
R-squared is always between 0 and 100%:
In Simple Linear Regression:
is also a measure of the linear relationship between X and Y. Recall that correlation, defined as
This suggests that we might be able to use r = Cor(X, Y ) instead of R2 in order to access the fit of the linear model. It can be shown that in the simple linear regression setting, R2 = r2. In other words, the squared correlation and the R2 statistic are identical. However, in the next section, we will discuss the multiple linear regression problem, in which we use several predictors simultaneously to predict the response. The concept of correlation between the predictors and the response does not extend automatically to this setting, since correlation quantifies the association between a single pair of variables rather than between a larger number of variables. We will see that R2 fills this role.
In Multiple Linear Regression:
In multiple linear regression, it turns out that it equals Cor(Y, Y? )^2, the square of the correlation between the response and the fitted linear model; in fact one property of the fitted linear model is that it maximizes this correlation among all possible linear models.
-reference ISLR page no 79.
领英推荐
Visual Representation of R-squared
To visually demonstrate how R-squared values represent the scatter around the regression line, you can plot the fitted values by observed values.
The R-squared for the regression model on the left is 15%, and for the model on the right it is 85%. When a regression model accounts for more of the variance, the data points are closer to the regression line. In practice, you’ll never see a regression model with an R2 of 100%. In that case, the fitted values equal the data values and, consequently, all the observations fall exactly on the regression line.
R-squared has Limitations
You cannot use R-squared to determine whether the coefficient estimates and predictions are biased, which is why you must assess the residual plots.
R-squared does not indicate if a regression model provides an adequate fit to your data. A good model can have a low R2 value. On the other hand, a biased model can have a high R2 value!
The value of R2 always lies between 0 and 1. However, it can still be challenging to determine what is a good R2 value, and in general, this will depend on the application. For instance, in certain problems in physics, we may know that the data truly comes from a linear model with a small residual error. In this case, we would expect to see an R2 value that is extremely close to 1, and a substantially smaller R2 value might indicate a serious problem with the experiment in which the data were generated. On the other hand, in typical applications in biology, psychology, marketing, and other domains, the linear model is at best an extremely rough approximation to the data, and residual errors due to other unmeasured factors are often very large. In this setting, we would expect only a very small proportion of the variance in the response to be explained by the predictor, and an R2 value well below 0.1 might be more realistic!
R-squared Is Not Always Straightforward
At first glance, R-squared seems like an easy to understand statistic that indicates how well a regression model fits a data set. However, it doesn’t tell us the entire story. To get the full picture, you must consider R2 values in combination with residual plots, other statistics, and in-depth knowledge of the subject area.