Polynomial Regression
Harry Thapa
Build Predictive Models | Analyst | Smart Digital Solutions for Agencies, Start-Up & B2B | AI Strategies & Tech Innovations
Table of content
Introduction
Polynomial Regression represents a sophisticated approach to modelling the relationship between one or more independent variables and a dependent variable, wherein this relationship is expressed as an nth degree polynomial. This technique is particularly useful for capturing the nuances of curvilinear relationships that linear regression may not adequately represent.
The core methodology for fitting Polynomial Regression models is the method of least squares. This method, grounded in the principles of the Gauss-Markov Theorem, aims to minimise the sum of the squares of the differences between observed and predicted values. By doing so, it ensures the estimation of coefficients that bring about the best possible fit to the data under the constraints of the theorem.
What sets Polynomial Regression apart as a special case of Linear Regression is its ability to model data with a curvilinear relationship through the use of polynomial equations. This adaptability allows it to capture complex patterns and trends in the data, making it a powerful tool for predictive modelling where the relationship between variables is not strictly linear.
lets, take an example of quadratic polynomial model:
convex, when a > 0 and concave when a < 0.
Polynomial Model Assumption
Assumptions underlying Polynomial Regression elucidate the conditions necessary for its effective application. These assumptions provide a framework for understanding how the dependent variable's behaviour is influenced by one or more independent variables. Key assumptions include:
Linear Interpretation
Linear Interpretation is achieve by fitting two line between two data points.
The formula for known points x1 and x2 and the value of the dependent function is
Matrix Representation of Polynomial Regression Equations
In polynomial regression, we aim to approximate the relationship between the independent variable x and the dependent variable y using a polynomial function of degree n, where n is the degree of the polynomial.
The general form of a polynomial function is:
Matrix Notation :
Least Square Solution:
Matrix Operations:
Why do we need Polynomial Regression?
When we develop a predictive model and observe its performance, we might find that it significantly underperforms. This discrepancy often becomes apparent when comparing the actual values to the predicted outcomes, especially if we notice that the actual data points form a curved pattern, while our prediction model—assuming it's linear—generates a straight line that fails to approximate the central tendency of these points accurately. This misalignment signals the need for a more sophisticated approach, and that's where polynomial regression comes into play.
Polynomial regression extends beyond the limitations of linear regression by accommodating the curved relationships between the independent (predictors) and dependent (outcome) variables. Unlike linear regression, which presupposes a straight-line relationship between variables, polynomial regression can model data with inherent curvature. This capability makes it a powerful alternative when linear models fall short, unable to capture the complex dynamics within the dataset.
The fundamental advantage of polynomial regression lies in its flexibility to fit a wide range of curvatures, thereby providing a more accurate representation of the data. This method does not require the relationship between variables to be linear, marking a critical distinction from linear regression. Consequently, polynomial regression is particularly valuable in scenarios where the data points deviate from a linear pattern, offering a clearer, more precise modeling of intricate relationships.
Feature Engineering
Loading libraries and our customer advertisement dataset.
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
df = pd.read_csv('advertising.csv')
df[:5]
we are using same dataset, which we have used earlier in Linear regression model.
#feature selection
X = df['TV'].values.reshape(-1, 1)
X[:5]
#target
y = df['Sales'].values
y[:5]
# splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Now, we have selected feature as TV and Target as Sales from our columns. similarly, we going to split data into train and test. where, 80 percent training and 20 percent testing with random state value of 42. It is used as a seed to the random number generator. This seed controls the shuffling applied to the data before it is split.
Model Selection
# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
# predictions for plotting linear regression
X_train_range = np.arange(X_train.min(), 300, 1)[:, np.newaxis]
y_linear = lr.predict(X_train_range)
We are going to create a line from the minimum value of the training set to 300, where 300 represents the maximum value on the X-axis scale. This means predictions will be made within the range from the minimum value of the training set to the maximum value of the X-axis scale.
pr_quad = LinearRegression()
pr_cubic = LinearRegression()
pr_quartic = LinearRegression()
pr_quintic = LinearRegression()
quadratic = PolynomialFeatures(degree=2)
cubic = PolynomialFeatures(degree=3)
quartic = PolynomialFeatures(degree=4)
quintic = PolynomialFeatures(degree=5)
X_train_quad = quadratic.fit_transform(X_train)
X_train_cubic = cubic.fit_transform(X_train)
X_train_quartic = quartic.fit_transform(X_train)
X_train_quintic = quintic.fit_transform(X_train)
X_test_quad = quadratic.transform(X_test)
X_test_cubic = cubic.transform(X_test)
X_test_quartic = quartic.transform(X_test)
X_test_quintic = quintic.transform(X_test)
pr_quad.fit(X_train_quad, y_train)
pr_cubic.fit(X_train_cubic, y_train)
pr_quartic.fit(X_train_quartic, y_train)
pr_quintic.fit(X_train_quintic, y_train)
y_quad_train = pr_quad.predict(X_train_quad)
y_cubic_train = pr_cubic.predict(X_train_cubic)
y_quartic_train = pr_quartic.predict(X_train_quartic)
y_quintic_train = pr_quintic.predict(X_train_quintic)
plt.figure(figsize=(10, 5))
plt.scatter(X_train, y_train, label="Training data points", color='black')
plt.scatter(X_test, y_test, label="Testing data points", color='red')
X_train_range = np.arange(X_train.min(), X_train.max(), 1)[:, np.newaxis]
y_quad_range = pr_quad.predict(quadratic.fit_transform(X_train_range))
y_cubic_range = pr_cubic.predict(cubic.fit_transform(X_train_range))
y_quartic_range = pr_quartic.predict(quartic.fit_transform(X_train_range))
y_quintic_range = pr_quintic.predict(quintic.fit_transform(X_train_range))
plt.plot(X_train_range, y_quad_range, label="Quadratic fit", color='red', lw=2)
plt.plot(X_train_range, y_cubic_range, label="Cubic fit", color='blue', lw=2)
plt.plot(X_train_range, y_quartic_range, label="Quartic fit", color='green', lw=2)
plt.plot(X_train_range, y_quintic_range, label="Quintic fit", color='orange', lw=2)
plt.xlabel("TV")
plt.ylabel("Sales")
plt.legend()
plt.title("Polynomial Regression")
plt.grid(True, ls='--', alpha=0.5)
plt.show()
Quartic and quintic models may visually appear to provide a better fit to the data, we must be careful with overfitting. Further statistical analysis, including cross-validation and performance metrics (like R-squared, RMSE, or others), is necessary to choose the model that provides the best trade-off between bias and variance, thus ensuring both accuracy and generalisable.
Model Evaluation
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
Now, we going to perform post processing which mean how accurate and strong is our predictions.
r2_quad = r2_score(y_train, y_quad_train)
mse_quad = mean_squared_error(y_train, y_quad_train)
rmse_quad = np.sqrt(mse_quad)
mae_quad = mean_absolute_error(y_train, y_quad_train)
r2_cubic = r2_score(y_train, y_cubic_train)
mse_cubic = mean_squared_error(y_train, y_cubic_train)
rmse_cubic = np.sqrt(mse_cubic)
mae_cubic = mean_absolute_error(y_train, y_cubic_train)
r2_quartic = r2_score(y_train, y_quartic_train)
mse_quartic = mean_squared_error(y_train, y_quartic_train)
rmse_quartic = np.sqrt(mse_quartic)
mae_quartic = mean_absolute_error(y_train, y_quartic_train)
r2_quintic = r2_score(y_train, y_quintic_train)
mse_quintic = mean_squared_error(y_train, y_quintic_train)
rmse_quintic = np.sqrt(mse_quintic)
mae_quintic = mean_absolute_error(y_train, y_quintic_train)
print("Polynomial Regression Metrics:")
print("Quadratic:")
print(f"R-squared: {r2_quad:.4f}")
print(f"MSE: {mse_quad:.4f}")
print(f"RMSE: {rmse_quad:.4f}")
print(f"MAE: {mae_quad:.4f}")
print("\nCubic:")
print(f"R-squared: {r2_cubic:.4f}")
print(f"MSE: {mse_cubic:.4f}")
print(f"RMSE: {rmse_cubic:.4f}")
print(f"MAE: {mae_cubic:.4f}")
print("\nQuartic:")
print(f"R-squared: {r2_quartic:.4f}")
print(f"MSE: {mse_quartic:.4f}")
print(f"RMSE: {rmse_quartic:.4f}")
print(f"MAE: {mae_quartic:.4f}")
print("\nQuintic:")
print(f"R-squared: {r2_quintic:.4f}")
print(f"MSE: {mse_quintic:.4f}")
print(f"RMSE: {rmse_quintic:.4f}")
print(f"MAE: {mae_quintic:.4f}")
Cross Validation
The concept of k-fold cross-validation, where k is specifically set to 5, follows the same principles as general k-fold cross-validation, but with k fixed at 5.
k=5 cross-validation offers a robust estimate of model performance by addressing variability in evaluation caused by random data partitioning. It ensures that each data point is used for both training and testing, reducing bias in the evaluation process.
Compared to leave-one-out cross-validation (LOOCV), which can be computationally expensive for large datasets, 5-fold cross-validation strikes a balance between computational cost and reliability of the estimation.
from tabulate import tabulate
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, r2_score
model_names = ["Quadratic", "Cubic", "Quartic", "Quintic"]
models = [pr_quad, pr_cubic, pr_quartic, pr_quintic]
results = []
for name, model in zip(model_names, models):
cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')
results.append([name] + list(cv_scores))
headers = ["Model", "CV 1", "CV 2", "CV 3", "CV 4", "CV 5"]
print(tabulate(results, headers=headers))
Residual Analysis
residuals_linear = y_train - lr.predict(X_train)
residuals_quad = y_train - y_quad_train
residuals_cubic = y_train - y_cubic_train
residuals_quartic = y_train - pr_quartic.predict(X_train_quartic)
residuals_quintic = y_train - pr_quintic.predict(X_train_quintic)
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(10, 10))
ax1.scatter(y_quad_train, residuals_quad, color='black', alpha=0.5)
ax1.set_xlabel('Fitted values')
ax1.set_ylabel('Residuals')
ax1.set_title('Residual Plot for Quadratic Regression')
ax1.axhline(y=0, color='red', linestyle='--')
ax1.grid(True, ls='--', alpha=0.5)
ax2.scatter(y_cubic_train, residuals_cubic, color='blue', alpha=0.5)
ax2.set_xlabel('Fitted values')
ax2.set_ylabel('Residuals')
ax2.set_title('Residual Plot for Cubic Regression')
ax2.axhline(y=0, color='red', linestyle='--')
ax2.grid(True, ls='--', alpha=0.5)
ax3.scatter(pr_quartic.predict(X_train_quartic), residuals_quartic, color='green', alpha=0.5)
ax3.set_xlabel('Fitted values')
ax3.set_ylabel('Residuals')
ax3.set_title('Residual Plot for Quartic Regression')
ax3.axhline(y=0, color='red', linestyle='--')
ax3.grid(True, ls='--', alpha=0.5)
ax4.scatter(pr_quintic.predict(X_train_quintic), residuals_quintic, color='orange', alpha=0.5)
ax4.set_xlabel('Fitted values')
ax4.set_ylabel('Residuals')
ax4.set_title('Residual Plot for Quintic Regression')
ax4.axhline(y=0, color='red', linestyle='--')
ax4.grid(True, ls='--', alpha=0.5)
plt.tight_layout()
plt.show()
Residual Plots
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(10, 8))
sns.histplot(residuals_quad, kde=True, color='black', alpha=0.5, ax=ax1)
ax1.set_title('Histogram Plot of Residuals for Quadratic Regression')
ax1.set_xlabel('Residuals')
ax1.set_ylabel('Frequency')
ax1.grid(True, ls='--', alpha=0.5)
sns.histplot(residuals_cubic, kde=True, color='blue', alpha=0.5, ax=ax2)
ax2.set_title('Histogram Plot of Residuals for Cubic Regression')
ax2.set_xlabel('Residuals')
ax2.set_ylabel('Frequency')
ax2.grid(True, ls='--', alpha=0.5)
sns.histplot(residuals_quartic, kde=True, color='green', alpha=0.5, ax=ax3)
ax3.set_title('Histogram Plot of Residuals for Quartic Regression')
ax3.set_xlabel('Residuals')
ax3.set_ylabel('Frequency')
ax3.grid(True, ls='--', alpha=0.5)
sns.histplot(residuals_quintic, kde=True, color='orange', alpha=0.5, ax=ax4)
ax4.set_title('Histogram Plot of Residuals for Quintic Regression')
ax4.set_xlabel('Residuals')
ax4.set_ylabel('Frequency')
ax4.grid(True, ls='--', alpha=0.5)
plt.tight_layout()
plt.show()
Quadratic to Quintic Regressions: All box plots show a relatively symmetric distribution of residuals around the median. However, there are outliers present in all models, as evidenced by the points beyond the whiskers. The box plot for the quintic regression seems to indicate a larger spread of residuals compared to others.
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(8, 8))
stats.probplot(residuals_quad, dist="norm", plot=ax1)
ax1.set_title('Q-Q Plot of Residuals for Quadratic Regression')
ax1.set_xlabel('Theoretical Quantiles')
ax1.set_ylabel('Sample Quantiles')
ax1.grid(True, ls='--', alpha=0.5)
stats.probplot(residuals_cubic, dist="norm", plot=ax2)
ax2.set_title('Q-Q Plot of Residuals for Cubic Regression')
ax2.set_xlabel('Theoretical Quantiles')
ax2.set_ylabel('Sample Quantiles')
ax2.grid(True, ls='--', alpha=0.5)
stats.probplot(residuals_quartic, dist="norm", plot=ax3)
ax3.set_title('Q-Q Plot of Residuals for Quartic Regression')
ax3.set_xlabel('Theoretical Quantiles')
ax3.set_ylabel('Sample Quantiles')
ax3.grid(True, ls='--', alpha=0.5)
stats.probplot(residuals_quintic, dist="norm", plot=ax4)
ax4.set_title('Q-Q Plot of Residuals for Quintic Regression')
ax4.set_xlabel('Theoretical Quantiles')
ax4.set_ylabel('Sample Quantiles')
ax4.grid(True, ls='--', alpha=0.5)
plt.tight_layout()
plt.show()
All Regressions: The Q-Q plots should ideally show all points falling on the diagonal line, which would indicate that the residuals are normally distributed. In all polynomial models, most points follow the line, but there are deviations at the ends, suggesting the presence of outliers or heavy tails in the distribution of residuals.
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(8, 10))
sns.boxplot(y=residuals_quad, color='black', ax=ax1)
ax1.set_title('Box Plot of Residuals for Quadratic Regression')
ax1.set_xlabel('Residuals')
ax1.grid(True)
sns.boxplot(y=residuals_cubic, color='blue', ax=ax2)
ax2.set_title('Box Plot of Residuals for Cubic Regression')
ax2.set_xlabel('Residuals')
ax2.grid(True)
sns.boxplot(y=residuals_quartic, color='green', ax=ax3)
ax3.set_title('Box Plot of Residuals for Quartic Regression')
ax3.set_xlabel('Residuals')
ax3.grid(True)
sns.boxplot(y=residuals_quintic, color='orange', ax=ax4)
ax4.set_title('Box Plot of Residuals for Quintic Regression')
ax4.set_xlabel('Residuals')
ax4.grid(True)
plt.tight_layout()
plt.show()
Quadratic to Quintic Regressions: All box plots show a relatively symmetric distribution of residuals around the median. However, there are outliers present in all models, as evidenced by the points beyond the whiskers. The box plot for the quintic regression seems to indicate a larger spread of residuals compared to others.
Conclusion and Recommendations
References
polynomial models https://www.slideserve.com/ifama/polynomial-models
polynomial features https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html
Researcher
4 个月Nice share Harry Thapa. Could you please point us to the or post the jupyter notebook for the article, please.
IT Risk Audit and Governance
7 个月Thanks Hemant, this is an excellent way of explaining Polynomial regression using Python. Do you have similar for Tableau?