Linear regression

Linear regression

n

Regressio

Regression analysis is a statistical methodology that allows us to determine the strength and relationship of two variables. Regression is not limited to two variables, we could have 2 or more variables showing a relationship. The results from the regression help in predicting an unknown value depending on the relationship with the predicting variables. For example, someone’s height and weight usually have a relationship. Generally, taller people tend to weigh more. We could use regression analysis to help predict the weight of an individual, given their height.

When there is a single input variable, the regression is referred to as?Simple Linear Regression. We use the single variable (independent) to model a linear relationship with the target variable (dependent). We do this by fitting a model to describe the relationship. If there is more than predicting variable, the regression is referred to as?Multiple Linear Regression.

Ordinary Least Squares

You may have heard of?Ordinary Least Squares Regression. When we are attempting to find the “best fit line”, the regression model can sometimes be referred to as Ordinary Least Squares Regression. This just means that we’re using the smallest sum of squared errors. The error is the difference between the predicted y value subtracted from the actual y value. The difference is squared so there is an absolute difference, and summed.

error = y_actual - y_predicted
“Hat” Notation

Take a look at the graph to the left. When using the data points (green dots) to draw a regression line, we’re actually working with estimations. Our goal here is to find the line that best describes the data. When working with estimations, we can make use of the “hat” notation (^). The formula for drawing the “best fit line” when working with estimations will be the same as the straight line formula, but with hat notation.

Straight Line Formula: y = mx + c
        - Where {m} is the slope and {c} is the intercept

Straight Line Formula with estimations:

??? =??? ??+???

Or

??? =??? 0+??? 1??

We’re just estimating a proper intercept and slope. When we have drawn the “best fit line” we are ready to make some predictions. However, since our prediction is based on the parameter values we estimate, when we predict new?y?values given?x, we will have error or vertical offset (blue lines). This error is denoted as?|??? ???|?where?????is on our regression line and?y?is the actual observed value (green dot).

Regression Coefficients

When performing simple linear regression, the four main components are:

Dependent Variable?— Target variable / will be estimated and predicted
Independent Variable?— Predictor variable / used to estimate and predict
Slope?— Angle of the line / denoted as?m?or???1
Intercept?—?Where function crosses the y-axis / denoted as????or???0

The last two, slope and intercept, are the coefficients/parameters of a linear regression model, so when we calculate the regression model, we’re just calculating these two. In the end, we’re trying to find the best-fit line describing the data, out of an infinite number of lines. To find the slope of a line, we can choose a random part of the line, and divide the?change in x by the change in y.

Δ?? — Change in y
Δ?? — Change in x

We need to calculate some statistical measures before calculating the “best fit line”:

Image by author
Slope Formula With the Least Squares Method
Intercept Formula Using the Slope

The formula above is?multiplying?the?slope?by the?mean of?x?and?subtracting?that?value?from the?mean of?y.

Image by author

Take another look at the plot above, we can see that:

Δ???= 4.66
Δ???= 8.64
4.66(change in y) / 8.64(change in x) = 0.54 (slope)
c?(Intercept)?= 6.38, where the line intersects with the y-axis
m?(Slope)?=?0.54
Hand Coding Slope with Numpy
Δ?? = ((np.mean(X) * np.mean(Y)) - np.mean(X*Y))
Δ?? = ((np.mean(X)**2) - np.mean(X*Y))
m = Δ?? / Δ??
Hand Coding Intercept with Numpy
c = np.mean(Y) - m * np.mean(X)
Prediction With Best Fit Line

If we have a new value (x), we can calculate the prediction (y) with the data we already have.

new_x_value = 9
y_predicted = (m * new_x_value) + c
y_predicted
#Output:
11.233796296296294
R-Squared / Coefficient of Determination

We know that the goal of linear regression is to find the “best fit line” that describes the data. However, we saw above that the line won’t fully represent the relationship between variables. There will also be some error (y_actual?-?y_predicted). The?R-Squared?measure can be used to determine a how well a model fits the data. This measure is also known as the?Coefficient of Determination.

R-Squared takes a simple model which uses the mean of the?actual_y?values to predict?new_y?values. The model will always predict this mean as the?new_y?regardless of the?x?value. This simple model is compared to a fit regression model to determine how well its fit.

R-Squared Formula
Image by author

The formula above can be read as:

Image by author
Assumptions for Linear Regression
Image by?GoodIdeas?on Shutterstock

Regression, being a parametric technique, relies on parameters learned from the data. This also means that the data must fulfill certain assumptions. These assumptions are necessary for obtaining reliable results. If the assumptions aren’t fulfilled, our predictions may be biased.

The plots used below were created using the?Advertising dataset from Kaggle.

1.) Linearity

There is a linear relationship between the dependent variable (y) and the independent variable (x). We can check for linearity by creating a scatter plot.

Image by author

The plot above is of TV advertisement spending and sales. This plot shows that there is a linear relationship between the two variables. We can interpret the plot to the right by saying “As TV advertisement spending increases, so do the sales. If you think you’re violating this assumption, try log-scaling your data.

2.) Normality

This assumption states that the residuals (difference between?actual_y?and?predicted_y) of a model are normally distributed. This assumption can be checked by created histograms or Q-Q-Plots.

Q-Q-Plots?(quantile-quantile-plots) are scatterplots of two sets of quantiles plotted against each other.

To check the normality assumption using qq-plots, we’re looking for a pretty straight line. It is worth noting that this is only a visual check. Another method of checking the normality assumption is the?Jarque-Bera (JB) test.

Image by author

The plot above is a Q-Q-Plot of TV advertisement spending and sales. The “straight line” indicates that the residuals of the model are normally distributed.

3.) Little/No Multicollinearity (Multiple Linear Regression)

Multicollinearity describes the correlation between the predictor variables. This assumption states that the predictor variables are independent. We can check this assumption by creating?pair plots?and/or?heat maps. Another method would be to calculate the?Variance Inflation Factor (VIF).

The variance inflation factor is a measure for the increase of the variance of the parameter estimates if an additional variable, given by exog_idx is added to the linear regression. It is a measure for multicollinearity of the design matrix, exog.

— Statsmodels User Guide

#Pairplot
sns.pairplot(sales, vars = ['TV', 'radio', 'newspaper']);
Image by author

Looking at the pairplot above, we can see that out of the three features in the “sales” dataset, there isn’t a high correlation between them.

#Heatmap (correlation matrix)
sns.heatmap(sales.drop('sales', axis=1).corr(), annot=True, cmap='plasma');
Image by author

Looking at the heatmap above, we can see the actual correlation coefficients if we set the?annot=?parameter to True. Radio and newspaper seem to show a correlation coefficient of 0.35 which tells us the features are not highly correlated.

#Variance Inflation Factor (VIF)
from statsmodels.stats.outliers_influence import variance_inflation_factor

r = sales[["newspaper", "radio"]].values
vif_df = pd.DataFrame()
vif_df["VIF"] = [variance_inflation_factor(r, i) for i in range(2)]
vif_df["feature"] = ["newspaper", "radio"]
vif_df
Image by author

In general, a VIF value of 5 is too high. Looking at the dataframe to the left we see little multicollinearity between the?newspaper?and?radio?features.

What would it mean if we did have features showing multicollinearity? For example, let’s say that the?newspaper?and?radio?features show multicollinearity. This would make it difficult for us to separate the effects of just?newspaper?on?sales.

4.) Homoscedasticity

Homoscedasticity refers to the variability of the dependent variable being equal across the independent variable values. We can check this assumption by creating a scatterplot of the model predictions and residuals, we’re looking for the residuals to equal across the regression line. We could also use a significance test such as the?Breusch-Pagan test.

sns.scatterplot(fsm.predict(), fsm.resid);
Image by author

Looking at the plot above, we can see the values do not look to be forming a pattern on the right side but are forming a pattern on the left side. This plot shows that the error increases with the predicted values, so it is heteroscedastic.
        

#Breusch-Pagan test

lm, lm_p_value, fvalue, f_p_value = het_breuschpagan(fsm_resids, sales[["TV"]])
print("F-statistic p-value:", f_p_value)
Image by author

The Breusch-Pagan test returning a p-value this low tells us we can reject the null hypothesis (homoscedasticity), and therefore, we’re violating the Homoscedasticity assumption.

5.) Little/No Autocorrelation in Residuals

Autocorrelation refers to the model residuals not being independent. If there were to be a correlation in the error terms, our model’s accuracy would decrease. This assumption can be checked using the?Durbin-Watson test?or creating an error plot. For the Durbin-Watson (DB) test, we’re looking for a value between 1.5–2.5. A few things to know regarding the DB test:

2: No Autocorrelation
0–1.9: Positive Autocorrelation
2.1–4: Negative Autocorrelation        

#Durbin-Watson test in OLS summary

fsm.summary()
Image by author

Looking at the?OLS Summary, we can see the Durbin-Watson score is 1.935, this score tells us there is no correlation between the model residuals.

Linear Regression Analysis

Now that we’ve checked the assumptions, let’s fit a linear regression model and evaluate the summary table.

Importing necessary libraries and data
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats
from statsmodels.formula.api import ols
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
sales = pd.read_csv('data/Advertising.csv', index_col=0)
sales = sales[['TV', 'sales']]
sales.head()
Image by author

The?sales?feature is our target (dependent) and?TV?is our predictor (independent).

Creating formula for OLS model
formula = 'sales~TV'
Fitting the model
model = ols(formula=formula, data=sales).fit()
Viewing the model summary
model.summary()
The left part?of the top table provides information on the data and model
The right part?of the top table provides information on how well the model is fit
The middle table?is a coefficient report
The bottom table?provides information on residuals, autocorrelation, and multicollinearity
Image by author

R-Squared:?Percent of variance explained by the model.

Adj. R-Squared:?R-Squared where additional independent variables are penalized

F-statistic:?Significance of fit

Prob (F-statistic):?Probability of seeing F-statistic from a sample

Log-likelihood:?Log of the likelihood function

AIC: Akaike Information Criterion, penalizes model when more independent variables are added.

BIC: Bayesian Information Criterion, similar to AIC but with higher penalties

Image by author

coef: Estimated coefficient value

std err: Standard error of the coefficient estimate

t: Measure of statistical significance for coefficient

P>|t|: Probability value that the coefficient is equal to 0

[0.025 0.975]:?Lower and upper halves of 95% confidence interval

Image by author

Omnibus: Omnibus D’Angostino’s test,?statistical test for skewness and kurtosis

Prob(Omnibus): Omnibus statistic as a probability

Skew: Measure of data mean symmetry

Kurtosis: Measure of shape of the distribution

Durbin-Watson: Test for autocorrelation

Jarque-Bera (JB): Test for skewness & kurtosis

Prob (JB): Jarque-Bera statistic as a probability

Cond. No.: Test for multicollinearity

Conclusion

There you have it, a breakdown of linear regression analysis. Regression analysis is one of the first modeling techniques to learn as a data scientist. It can helpful when forecasting continuous values, e.g., sales, temperature. There are quite a few formulas to learn but they’re necessary to understand what’s happening “under the hood” when we run linear regression models. As you saw above there are many ways to check the assumptions of linear regression, hopefully you now have a better understanding of them. Thanks so much for taking the time to check out this post!

Referencesn        

要查看或添加评论,请登录

Darshika Srivastava的更多文章

  • API

    API

    What is an API? APIs are mechanisms that enable two software components to communicate with each other using a set of…

  • Apple Intelligence

    Apple Intelligence

    What Is Apple Intelligence? Apple Intelligence is an artificial intelligence developed by Apple Inc. Relying on a…

  • Snowflake

    Snowflake

    What is the Snowflake Data Platform? While data is a core asset for modern enterprises, technology’s ability to scale…

  • Business Intelligence

    Business Intelligence

    What is Business Intelligence? Business intelligence (BI) is a technology-driven process for analyzing data and…

  • Azure Databricks

    Azure Databricks

    What is Azure Databricks? Azure Databricks is a unified, open analytics platform for building, deploying, sharing, and…

  • self-service data

    self-service data

    What is self-service data? What are its key characteristics? Self-service data refers to a set of processes, tools, and…

  • Graphical User Interface

    Graphical User Interface

    What is Graphical User Interface (GUI)? A system of interactive visual components for a computer or system software is…

  • Insight Generation

    Insight Generation

    What is Insight Generation? Insight generation involves analyzing data to uncover valuable insights that drive…

  • Fraud Detection

    Fraud Detection

    What is Fraud Detection? Fraud detection is the process of identifying suspicious activity that indicates criminal…

  • Unsupervised Learning

    Unsupervised Learning

    What is Unsupervised Learning? As the name suggests, unsupervised learning is a machine learning technique in which…

社区洞察

其他会员也浏览了