R Linear Regression
Malini Shukla
Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist
Basics of Linear Regression
Regression analysis is a statistical tool to determine relationships between different types of variables. Variables that remain unaffected by changes made in other variables are known as independent variables, also known as a predictor orexplanatory variables while those that are affected are known as dependent variablesalso known as the response variable.
Learn more about features & applications of R for better understanding.
Linear regression is a statistical procedure which is used to predict the value of a response variable, on the basis of one or more predictor variables.
There are two types of linear regressions in R:
- Simple Linear Regression – Value of response variable depends on a single explanatory variable.
- Multiple Linear Regression – Value of response variable depends on more than 1 explanatory variables.
Some common examples of linear regression are calculating GDP, CAPM, oil and gas prices, medical diagnosis, capital asset pricing etc.
Let us see how to Install R.
Simple Linear Regression in R
R Simple linear regression enables us to find a relationship between a continuous dependent variable Y and a continuous independent variable X. It is assumed that values of X are controlled and not subject to measurement error and corresponding values of Y are observed.
The general simple linear regression model to evaluate the value of Y for a value of X:
yi = β0 + β1x + ε
Here, the ith data point, yi, is determined by the variable xi;
β0 and β1 are regression coefficients;
εi is the error in the measurement of the ith value of x.
Regression analysis is implemented to do the following:
- Establish a relationship between independent (x) and dependent (y) variables.
- Predict the value of y based on a set of values of x1, x2…xn.
- Identify independent variables to understand which of them are important to explain the dependent variable, and thereby establishing a more precise and accurate causal relationship between the variables.
Multiple Linear Regression in R
In the real world, you may find situations where you have to deal with more than 1 predictor variable to evaluate the value of response variable. In this case, simple linear models cannot be used and you need to use R multiple linear regressions to perform such analysis with multiple predictor variables.
Let us see best data scientist certifications
R multiple linear regression models with two explanatory variables can be given as:
yi = β0 + β1x1i + β2x1i + εi
Here, the ith data point, yi, is determined by the levels of the two continuous explanatory variables x1i and x1i’ by the three parameters β0, β1, and β2 of the model, and by the residual ε1 of point i from the fitted surface.
General Multiple regression models can be represented as:
yi = Σβ1x1i + εi
Least Square Estimation
A simple or multiple regression models cannot explain a non-linear relationship between the variables.
Multiple regression equations are defined in the same way as single regression equation by using the least square method. Values of unknown parameters are calculated by least square estimation method.
Least square estimation method minimizes the sum of squares of errors to best fit the line for the given data. These errors are generated due to the deviation of observed points from proposed line. This deviation is called as residual in regression analysis.
The sum of squares of residuals (SSR) is calculated as follows:
SSR=Σe2=Σ(y-(b0+b1x))2
Where e is the error, y and x are the variables, and b0 and b1 are the unknown parameters or coefficients.
Learn more about R Nonlinear Regression Analysis
Checking Model Adequacy
Regression models are used for predictions. For appropriate predictions, it is important to check first the adequacy of these models.
R Squared and Adjusted R Squared methods are used to check the adequacy of models.
High values of R-Squared represent a strong correlation between response and predictor variables while low values mean that developed regression model is not appropriate for required predictions.
The value of R between 0 and 1 where 0 means no correlation between sample data and 1 mean exact linear relationship.
One can calculate R Squared using the following formula:
R2 = 1 – (SSR/SST)
Here, SST(Sum of Squares of Total) and SSR(Sum of Squares of Regression) are the total sums of the squares and the sum of squares of errors, respectively.
To add a new explanatory variable in an existing regression model, use adjusted R-squared. So adjusted R-squared method depends on a number of explanatory variables. However, it includes a statistical penalty for each new predictor variable in the regression model. These are the 2 properties of Adjusted R-Squared value.
Similar to R-squared adjusted R-squared is used to calculate the proportion of the variation in the dependent variable caused by all explanatory variables.
We can calculate the Adjusted R Squared as follows:
R2 = R2 – [k(1-R2)/(n-k-1)]
Here, n represents the number of observations and k represents the number of parameters.
Let us see Logistic Regression in R
Regression Assumptions
When building a regression model, statisticians make some basic assumptions to ensure the validity of the regression model. These are:
- Linearity – Assumes a linear relationship between the dependent and independent variables. Because it treats the predictor variables as fixed values (see above), linearity is really only a restriction on the parameters.
- Independence – This assumes that the errors of the response variables are uncorrelated with each other.
- Homoscedasticity – This means that different response variables have the same variance in their errors, regardless of the values of the predictor variables. In practice this assumption is invalid (i.e. the errors are heteroscedastic) if the response variables can vary over a wide scale.
- Normality – Assumes normal distribution of errors in the collected samples.
The regression model may be insufficient for making predictions if we violate any of these assumptions.
Note: Complexity of a regression model increases with increasing number of parameters.
Multicollinearity
A Multicollinearity refers to redundancy. It is a non-linear relationship between two explanatory variables, leading to inaccurate parameter estimates. Multicollinearity exists when two or more variables represent an exact or approximate linear relationship with respect to the dependent variable.
One can detect the Multicollinearity by calculating VIF with the help of the following formula:
VIF = 1/ (1-Ri2)
Here, Ri is the regression coefficient for the explanatory variable xi, with respect to all other explanatory variables.
In regression model, Multicollinearity is identified when significant change is observed in estimated regression coefficients while adding or deleting explanatory variables or when VIF is high(5 or above) for the regression model.
Following are some impacts of Multicollinearity:
- Wrong estimation of regression coefficients
- Inability to estimate standard errors and coefficients.
- High variance and covariance in ordinary least squares for closely related variables, making it difficult to assess the estimation precisely.
- Relatively large standard errors present more chances for accepting the null hypothesis.
- Deflated t-test and degradation of model predictability.
We have seen effects of Multicollinearity. So if possible remove it using following manner:
- Specifying the regression model again.
- Using prior information or restrictions while estimating the coefficients.
- Collecting new data or increasing the sample size.
Get the best R Books to become a master in R programming.
See Also-