Evaluating Linear Regression Models
Rany ElHousieny, PhD???
Generative AI ENGINEERING MANAGER | ex-Microsoft | AI Solutions Architect | Generative AI & NLP Expert | Proven Leader in AI-Driven Innovation | Former Microsoft Research & Azure AI | Software Engineering Manager
Linear regression is a powerful and commonly used technique in machine learning and statistics. It helps us understand the relationship between a dependent variable (the one we want to predict) and one or more independent variables (the ones we use for prediction). But how do we know if our linear regression model is doing well? In this article, we'll explore how to measure the performance of a linear regression model with Python using a practical example.
Note: To learn more about Linear Regression, you can read the following article:
Understanding the Data
For our example, we will work with a dataset that contains information about houses, such as square footage, the number of bedrooms and bathrooms, the number of offers made, and other factors that can influence the price of a house. You can download the dataset from this link. Let's load the data and explore it a bit:
import pandas as pd
# Load the dataset
csv_url = "https://raw.githubusercontent.com/ywchiu/riii/master/data/house-prices.csv"
data = pd.read_csv(csv_url)
# Let's take a look at the first few rows of the data
data.head()
Preparing the Data
Before we can build a linear regression model, we need to prepare the data. This involves converting categorical variables into a numerical format, and defining our independent (features) and dependent (target) variables.
One-Hot Encoding for Categorical Variables
Categorical variables like 'Brick' and 'Neighborhood' need to be converted into a numerical format. Categorical variables like 'Brick' and 'Neighborhood' are converted to a numerical format, often using one-hot encoding, for several important reasons:
Keep in mind that one-hot encoding, which creates binary (0 or 1) columns for each category, is just one method to convert categorical variables. Other techniques, such as label encoding or ordinal encoding, may be used depending on the nature of the categorical data and the specific problem you are trying to solve.
We can achieve this using one-hot encoding. Here's how we do it:
# Convert categorical variables to numerical using one-hot encoding
data = pd.get_dummies(data, columns=['Brick', 'Neighborhood'], drop_first=True)
data.head()
Now, our dataset contains numerical representations of these categorical variables.
Defining Features and Target
We need to define our independent variables (features) and our dependent variable (target). In this case, we want to predict the house 'Price' based on other features. Here's how we do it:
# Define the independent variables (features)
X = data[['SqFt', 'Bedrooms', 'Bathrooms', 'Offers', 'Brick_Yes', 'Neighborhood_North', 'Neighborhood_West']]
# Define the dependent variable
Y = data['Price']
Now we're ready to build and evaluate our linear regression model.
Building and Evaluating the Model
We'll use Python's scikit-learn library to build and evaluate our linear regression model. Here's the code:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
# Create a Linear Regression model
model = LinearRegression()
# Fit the model to our data
model.fit(X, Y)
# Predict house prices using the model
predicted_prices = model.predict(X)
# Calculate R-squared (R^2)
r_squared = r2_score(Y, predicted_prices)
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(Y, predicted_prices)
# Print model coefficients
coefficients = model.coef_
intercept = model.intercept_
print("Model Coefficients:")
for feature, coef in zip(X.columns, coefficients):
print(f"{feature}: {coef:.2f}")
print(f"\nIntercept: {intercept:.2f}")
print(f"R-squared (R^2): {r_squared:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
Let's break down what we've done:
Evaluating the Model:
Measuring the performance or accuracy of a multiple regression model with multiple features involves using various evaluation metrics to assess how well the model fits the data and makes predictions. Here are some common methods to evaluate the performance of a multiple regression model:
R-squared (R^2):
# Calculate R-squared (R^2)
r_squared = r2_score(Y, predicted_prices)
Adjusted R-squared (Adjusted R^2):
# Calculate R-squared (R^2)
r_squared = model.score(X, Y)
# Calculate the total number of data points and the number of independent variables
n = len(Y) # Total number of data points
k = X.shape[1] # Number of independent variables
# Calculate Adjusted R-squared
adjusted_r_squared = 1 - (1 - r_squared) * (n - 1) / (n - k - 1)
print("R-squared:", r_squared)
print("Adjusted R-squared:", adjusted_r_squared)
R-squared: 0.8686210289688724
Adjusted R-squared: 0.8609572556587233
This code will provide you with the adjusted R-squared value using scikit-learn's LinearRegression model and the manually calculated adjusted R-squared based on the model's R-squared value and the number of features.
Why do we use Adjusted R^2 Over R^2?
R-squared (R^2) and adjusted R-squared (adjusted R^2) are both metrics used in regression analysis to evaluate the goodness of fit of a regression model. However, they serve slightly different purposes and have some key differences:
In summary, R-squared and adjusted R-squared both provide information about the goodness of fit of a regression model. R-squared is straightforward and can be influenced by the number of predictors, while adjusted R-squared adjusts for model complexity by considering the number of independent variables in the model. Adjusted R-squared is often preferred when comparing models with different numbers of predictors, as it penalizes the addition of irrelevant variables that do not improve model performance.
Insights on the previous numbers:
Let's explore what does this mean with the results we had for the house prices model:
R-squared: 0.8686210289688724
Adjusted R-squared: 0.8609572556587233
The R-squared (R^2) and adjusted R-squared values provide important insights into the performance and validation of the linear regression model. Let's interpret these values:
Interpreting these values in the context of model validation:
Overall, these metrics show that the model performs well in explaining house prices but also indicates that there is potential for model improvement or simplification.
Other Methods of Evaluation:
Here are some other ways to evaluate the model
When assessing the performance of a multiple regression model, it's often advisable to consider multiple metrics and techniques to gain a comprehensive understanding of how well the model fits the data and makes accurate predictions. The choice of evaluation metrics depends on the specific goals of the analysis and the characteristics of the dataset.
Visualizing the Model
To understand how our model performs visually, we can create scatter plots to compare actual prices to predicted prices. You can use libraries like Matplotlib to create these plots.
pythonCopy code
import matplotlib.pyplot as plt # Create a scatter plot plt.scatter(Y, predicted_prices) plt.xlabel("Actual Prices") plt.ylabel("Predicted Prices") plt.title("Actual Prices vs. Predicted Prices") plt.show()
This scatter plot helps us see how well our model's predictions align with the actual prices. Ideally, the points should form a straight line.
Conclusion
Measuring the performance of a linear regression model is crucial in understanding how well it predicts outcomes. R-squared and Mean Squared Error are two essential metrics for this purpose. By following the steps outlined in this article and using Python, you can build, evaluate, and visualize a linear regression model, making data analysis and predictions more accessible to beginners and non-technical individuals. Linear regression provides a solid foundation for understanding more advanced machine learning techniques, so mastering it is an excellent starting point for anyone interested in the field.