A Practical Example for Improving ML Models with Multiple Linear Regression
Rany ElHousieny, PhD???
Generative AI ENGINEERING MANAGER | ex-Microsoft | AI Solutions Architect | Generative AI & NLP Expert | Proven Leader in AI-Driven Innovation | Former Microsoft Research & Azure AI | Software Engineering Manager
This article is a continuation of the following article:
Let's create a new Multiple Linear Regression model using additional relevant features from the dataset, including 'SqFt,' 'Bedrooms,' 'Bathrooms,' 'Offers,' 'Brick,' and 'Neighborhood.' We will also evaluate the model's performance using Python and discuss performance evaluation metrics.
Here's how to proceed step by step:
# Load the dataset
# Step 1: Download the CSV file from GitHub (raw URL)
csv_url = "https://raw.githubusercontent.com/ywchiu/riii/master/data/house-prices.csv"
# Step 2: Load the CSV data into a Pandas DataFrame
data = pd.read_csv(csv_url)
# Explore the loaded data (e.g., check the first few rows)
data.head()
# Convert categorical variables ('Brick' and 'Neighborhood') to numerical using one-hot encoding
data = pd.get_dummies(data, columns=['Brick', 'Neighborhood'], drop_first=True)
# Define the independent variables (features)
X = data[['SqFt', 'Bedrooms', 'Bathrooms', 'Offers', 'Brick_Yes', 'Neighborhood_North', 'Neighborhood_West']]
# Define the dependent variable
Y = data['Price']
# Fit a Multiple Linear Regression model
model = LinearRegression()
model.fit(X, Y)
# Predict house prices using the model
predicted_prices = model.predict(X)
# Calculate R-squared (R^2)
r_squared = r2_score(Y, predicted_prices)
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(Y, predicted_prices)
Here's a Python example that demonstrates these steps:
领英推荐
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import OneHotEncoder
# Load the dataset
# Step 1: Download the CSV file from GitHub (raw URL)
csv_url = "https://raw.githubusercontent.com/ywchiu/riii/master/data/house-prices.csv"
# Step 2: Load the CSV data into a Pandas DataFrame
data = pd.read_csv(csv_url)
# Explore the loaded data (e.g., check the first few rows)
data.head()
# Convert categorical variables ('Brick' and 'Neighborhood') to numerical using one-hot encoding
data = pd.get_dummies(data, columns=['Brick', 'Neighborhood'], drop_first=True)
# Define the independent variables (features)
X = data[['SqFt', 'Bedrooms', 'Bathrooms', 'Offers', 'Brick_Yes', 'Neighborhood_North', 'Neighborhood_West']]
# Define the dependent variable
Y = data['Price']
# Fit a Multiple Linear Regression model
model = LinearRegression()
model.fit(X, Y)
# Predict house prices using the model
predicted_prices = model.predict(X)
# Calculate R-squared (R^2)
r_squared = r2_score(Y, predicted_prices)
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(Y, predicted_prices)
# Print model coefficients
coefficients = model.coef_
intercept = model.intercept_
print("Model Coefficients:")
for feature, coef in zip(X.columns, coefficients):
print(f"{feature}: {coef:.2f}")
print(f"\nIntercept: {intercept:.2f}")
print(f"R-squared (R^2): {r_squared:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
Model Coefficients:
SqFt: 52.99
Bedrooms: 4246.79
Bathrooms: 7883.28
Offers: -8267.49
Brick_Yes: 17297.35
Neighborhood_North: 1560.58
Neighborhood_West: 22241.62
Intercept: 598.92
R-squared (R^2): 0.87
Mean Squared Error (MSE): 94105539.95
In this code:
This new model considers multiple features, allowing it to potentially provide better predictions of house prices based on a combination of factors.
Insights:
The results from the Multiple Linear Regression model indicate a significant improvement in model performance compared to the previous simple linear regression. Let's analyze the coefficients, R^2 value, and MSE and provide insights and recommendations:
Insights and Recommendations:
In summary, the Multiple Linear Regression model with the selected features shows promising performance, explaining a significant portion of the variance in house prices. Further refinements and validations can enhance its accuracy for practical use in the real estate market.
To learn more about model evaluation, please refer to the following article: