Python Residual Sum Of Squares: Tutorial & Examples

Python Residual Sum Of Squares: Tutorial & Examples

Residual sum of squares (RSS) is a statistical method that calculates the variance between two variables that a regression model doesn’t explain. It measures the distance between a regression model’s predictions and ground truth variables. Therefore, the higher the RSS, the worse a model is performing..

Python residual sum of squares — which uses the Python programming language to calculate RSS — is useful for applications where validating a model’s predictive capabilities is essential. For example, financial analysis and financial modeling are typical applications for Python RSS.?

There are multiple ways to implement RSS using Python. This article will explain how to implement RSS in pure Python and use the statsmodels API on a real-world dataset. We will also summarize the different ways to implement RSS in Python, and the best practices for utilizing RSS in a real-world project.

Python residual sum of squares key concepts

Below is a summary of key RSS concepts that will help you better understand the following sections.

No alt text provided for this image

RSS in regression analysis and time series analysis

The residual sum of squares is a key metric that gives a numerical representation of how well your regression model fits the data. Specifically, RSS indicates how well an independent variable predicts a dependent variable. This technique works for long term as well as forecasting with limited data (DeepCasting).

Residual sum of squares is used a lot in time series analysis, especially in finance and FinTech. RSS helps firms assess and iterate their financial models to gain a competitive advantage and improve the quality of their predictions.?

For example, RSS supports investors and financial analysts tracking an asset’s price over time and attempting to predict its future value by indicating how accurate a regression model’s predictions are.

RSS can also be used in other time-series analysis applications, like sales forecasting, predicting real estate prices over time, and modeling biological signals such as electrical activity in the brain.

How is RSS calculated?

The term “residual” is just another word for “error”. Error is the distance between two data points, usually the predicted and ground truth values. As the name suggests, RSS is calculated by summing up the squared residuals of the two variables, the formula is as follows:

No alt text provided for this image

The relationship between RSS, SSR, and SST

There is often some confusion between the sum of squares regression (SSR), the sum of squares total (SST), and the residual sum of squares. While the names are similar, they each play a different role and are commonly used in regression analysis.

SST is the squared error between the dependent variable (target variable) and its mean. SST provides insight into the overall variance of the target variable.

SSR is the squared error between the predictions and the mean of the dependent variable. It provides a measure of the total distance between the predictions and the center of the dependent variable.

The relationship between the three metrics is as follows:

SST = SSR + RSS

To put it in words, the total variance of the data is equal to the variability explained by the line in addition to the unexplained variability in the dataset (noise)

How to calculate RSS in Python

In the sections below, we’ll provide sample code to help you get started with Python residual sum of squares calculations.?

Calculating RSS in pure Python

def RSS(y, y_hat)
total_RSS = 0.
for y_, y_hat_ in zip(y, y_hat):
	total_RSS+= (y_ - y_hat_)**2
return total_RSS

y = [2.2, 4.5, 5.1, 8.0, 12.9, 14.3] # Dependent Variable
y_hat = [2.2, 5.1, 5.3, 9.1, 11.9, 14.3] # Model's Predictions


print(RSS(y, y_hat))
	

	Out [1]: 2.609999999999999
        

Calculating RSS from a linear regression model

import numpy as np
 
def RSS(y, y_hat):
	total_RSS = 0.
	for y_, y_hat_ in zip(y, y_hat):
		total_RSS+= (y_ - y_hat_)**2
	return total_RSS
 
x = np.array([1.2, 5.0, 6.1, 9.1, 12.3, 11.2, 15.0]).reshape(-1, 1)
y = np.array([2.1, 4.2, 5.5, 10.1, 12.3, 11.2, 14.9]).reshape(-1, 1)
 
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x,y)
y_hat = model.predict(x)
 
print(RSS(y, y_hat))
	 

	Out [1]: [2.78149043

]        

Calculating RSS using the statsmodels API

import statsmodels.api as s
 
x = [1.2, 5.0, 6.1, 9.1, 12.3, 11.2, 15.0]
y = [2.1, 4.2, 5.5, 10.1, 12.3, 11.2, 14.9]
 
x = sm.add_constant(x) # This is just used to add the constant term to the dataset, OLS requires a constant “1” column as the first term.
 
          

#fit linear regression model

model = sm.OLS(y, x).fit()
         

#display model summary

print(model.summary())
print('\n', '='*20, '\n')
# residual sum of squares
print('RSS = ', model.ssr)m        

The output when using statsmodels will include a general summary of the OLS model, but what’s relevant to this article is the RSS score in the last line

	RSS =  2.7814904265783675        


Out [1]:                            OLS Regression Results                           
==============================================================================
Dep. Variable:                      y   R-squared:                       0.991
Model:                            OLS   Adj. R-squared:                  0.990
Method:                 Least Squares   F-statistic:                     794.4
Date:                Mon, 05 Sep 2022   Prob (F-statistic):           1.82e-08
Time:                        07:00:55   Log-Likelihood:                -7.4864
No. Observations:                   9   AIC:                             18.97
Df Residuals:                       7   BIC:                             19.37
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
				 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.1438      0.440      0.327      0.753      -0.896       1.184
x1             0.9920      0.035     28.185      0.000       0.909       1.075
==============================================================================
Omnibus:                        0.044   Durbin-Watson:                   2.319
Prob(Omnibus):                  0.978   Jarque-Bera (JB):                0.196
Skew:                           0.115   Prob(JB):                        0.907
Kurtosis:                       2.315   Cond. No.                         26.3
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

 ==================== 

RSS =  2.7814904265783675         
No alt text provided for this image

Python residual sum of squares in practice

For this example, we’ll apply a regression model using statsmodels on the Swedish auto insurance dataset. This small and simple dataset is found?here .

The dataset has one dependent variable and one independent variable. The independent variable is the number of insurance claims by the person. The dependent variable is the total payment for all claims in thousand Swedish Krones.

First, import the dataset and clean it up.

import pandas as p
 
url='https://raw.githubusercontent.com/hargurjeet/MachineLearning/Swedish-Auto-Insurance-Dataset/insurance.csv'
 
df_raw=pd.read_csv(url,sep='delimiter', header=None,  engine='python')
 
df = df_raw.drop([0, 1, 2, 3], axis=0).reset_index(drop=True).rename(columns={0:'N_Claims'})
df = df.N_Claims.str.split(',',expand=True).rename(columns={0:'N_Claims', 1:'TotalPayment'})
## Convert the columns from str to Numberic values
df.N_Claims = df.N_Claims.astype('float')
df.TotalPayment = df.TotalPayment.astype('float')
df.head()d        
No alt text provided for this image
No alt text provided for this image

Next, plot the dataset with Plotly express.

import plotly.express as p
fig = px.scatter(df, x = 'N_Claims', y = 'TotalPayment')
fig.show()x        
No alt text provided for this image

Next, the regression model and RSS calculation using statsmodels. We won’t be doing a train/test split right now just for simplicity's sake, but you should always test your model out with a test set and ideally a validation set as well.

import statsmodels.api as s
x = df.N_Claims
y = df.TotalPayment
 
x = sm.add_constant(x) # This is just used to add the constant term to the dataset, OLS requires a constant “1” column as the first term.        

#fit linear regression model

 
model = sm.OLS(y, x).fit()        

#display model summary

print(model.summary())
print('\n', '='*20, '\n')
# residual sum of squares
print('RSS = ', model.ssr)m        


Out [1]: OLS Regression Results                           
==============================================================================
Dep. Variable:           TotalPayment   R-squared:                       0.833
Model:                            OLS   Adj. R-squared:                  0.831
Method:                 Least Squares   F-statistic:                     305.0
Date:                Tue, 06 Sep 2022   Prob (F-statistic):           2.05e-25
Time:                        09:08:49   Log-Likelihood:                -314.04
No. Observations:                  63   AIC:                             632.1
Df Residuals:                      61   BIC:                             636.4
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
				 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         19.9945      6.368      3.140      0.003       7.261      32.728
N_Claims       3.4138      0.195     17.465      0.000       3.023       3.805
==============================================================================
Omnibus:                        1.613   Durbin-Watson:                   1.199
Prob(Omnibus):                  0.446   Jarque-Bera (JB):                1.429
Skew:                           0.364   Prob(JB):                        0.489
Kurtosis:                       2.875   Cond. No.                         45.8
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

 ==================== 

RSS =  78796.74155103254
	         


Note that here the RSS is 78796.74, much larger than our previous example. Since we’re squaring up the residuals, RSS is heavily influenced by the variables’ magnitude.

If you test one model, the resulting RSS may not tell you directly how well a model performs in contrast to other metrics such as the Mean Absolute Error (MAE). Standard practice is to find the regression line with the lowest RSS score, as that is the best performing line.

Next, plot the regression line and data using matplotlib.

import matplotlib.pyplot as pl
%matplotlib inline
 
fig = plt.figure(figsize = (20, 4))
x_vals = np.arange(130)
y_vals = model.params[0] + model.params[1] * x_vals
 
#fig = plt.figure(figsize = (20, 4))
plt.scatter(df.N_Claims, df.TotalPayment)
plt.plot(x_vals, y_vals, '--')
plt.show()t
        


No alt text provided for this image

Tips for optimizing RSS calculations

Our basic code is a practical way to get started with Python RSS. However, you’ll need to do more data processing to achieve optimal performance. Required processing includes cleaning the dataset of any null values, duplicates, infinite values, and other unusable entries.

Finally, perform hyperparameter optimization using grid search and cross-validation methods that iterate over and test out a grid of hyperparameters and save the highest-performing set of parameters.

Additionally, it is best practice to evaluate your regression model with multiple metrics, such as MAE, Mean Squared Error (MSE), and R2?score. Each metric gives you an idea of the model’s performance from a different point of view.

Conclusion?

We went over what RSS is, why and where it is used, how to calculate RSS in?

Python, and the relationship between RSS and other similar metrics.

While the conceptual knowledge you build writing code from scratch is valuable, it is usually better to leverage existing frameworks. Popular data analysis frameworks are often optimized to run calculations faster, handle exceptions, and have well-written documentation that can help you hit the ground running in real-world projects.?

This article was originally published at https://www.ikigailabs.io/multivariate-time-series-forecasting-python/python-residual-sum-of-squares



要查看或添加评论,请登录

社区洞察

其他会员也浏览了