登录查看更多内容

A Practical Example for Improving ML Models with Multiple Linear Regression

Rany ElHousieny, PhD???

Generative AI ENGINEERING MANAGER | ex-Microsoft | AI Solutions Architect | Generative AI & NLP Expert | Proven Leader in AI-Driven Innovation | Former Microsoft Research & Azure AI | Software Engineering Manager

发布日期: 2023年10月12日

This article is a continuation of the following article:

Let's create a new Multiple Linear Regression model using additional relevant features from the dataset, including 'SqFt,' 'Bedrooms,' 'Bathrooms,' 'Offers,' 'Brick,' and 'Neighborhood.' We will also evaluate the model's performance using Python and discuss performance evaluation metrics.

Here's how to proceed step by step:

Data Preprocessing: Load the dataset into a Pandas DataFrame.Convert categorical variables like 'Brick' and 'Neighborhood' into numerical representations (e.g., one-hot encoding) to include them in the model.

# Load the dataset
# Step 1: Download the CSV file from GitHub (raw URL)
csv_url = "https://raw.githubusercontent.com/ywchiu/riii/master/data/house-prices.csv"

# Step 2: Load the CSV data into a Pandas DataFrame
data = pd.read_csv(csv_url)

# Explore the loaded data (e.g., check the first few rows)
data.head()

Feature Selection: Choose 'SqFt,' 'Bedrooms,' 'Bathrooms,' 'Offers,' 'Brick,' and 'Neighborhood' as independent variables (features) for the Multiple Linear Regression model.

# Convert categorical variables ('Brick' and 'Neighborhood') to numerical using one-hot encoding
data = pd.get_dummies(data, columns=['Brick', 'Neighborhood'], drop_first=True)

# Define the independent variables (features)
X = data[['SqFt', 'Bedrooms', 'Bathrooms', 'Offers', 'Brick_Yes', 'Neighborhood_North', 'Neighborhood_West']]

Model Building: Fit a Multiple Linear Regression model using the selected features to predict 'Price' (house prices).

# Define the dependent variable
Y = data['Price']

# Fit a Multiple Linear Regression model
model = LinearRegression()
model.fit(X, Y)

# Predict house prices using the model
predicted_prices = model.predict(X)

Performance Evaluation:Use performance evaluation metrics such as R^2 (Coefficient of Determination) and Mean Squared Error (MSE) to assess the model's accuracy.

# Calculate R-squared (R^2)
r_squared = r2_score(Y, predicted_prices)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(Y, predicted_prices)

Here's a Python example that demonstrates these steps:

领英推荐

CROPLAND's top picks from the rstudio conf 2022:…

CROPLAND 2 年前

The Nixtlar library, Gaussian Processes with PyMC…

Rami Krispin 2 个月前

Mastering XGBoost: From Basics to Advanced Techniques…

Nick Gupta 1 年前

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import OneHotEncoder

# Load the dataset
# Step 1: Download the CSV file from GitHub (raw URL)
csv_url = "https://raw.githubusercontent.com/ywchiu/riii/master/data/house-prices.csv"

# Step 2: Load the CSV data into a Pandas DataFrame
data = pd.read_csv(csv_url)

# Explore the loaded data (e.g., check the first few rows)
data.head()



# Convert categorical variables ('Brick' and 'Neighborhood') to numerical using one-hot encoding
data = pd.get_dummies(data, columns=['Brick', 'Neighborhood'], drop_first=True)

# Define the independent variables (features)
X = data[['SqFt', 'Bedrooms', 'Bathrooms', 'Offers', 'Brick_Yes', 'Neighborhood_North', 'Neighborhood_West']]

# Define the dependent variable
Y = data['Price']

# Fit a Multiple Linear Regression model
model = LinearRegression()
model.fit(X, Y)

# Predict house prices using the model
predicted_prices = model.predict(X)

# Calculate R-squared (R^2)
r_squared = r2_score(Y, predicted_prices)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(Y, predicted_prices)

# Print model coefficients
coefficients = model.coef_
intercept = model.intercept_

print("Model Coefficients:")
for feature, coef in zip(X.columns, coefficients):
    print(f"{feature}: {coef:.2f}")

print(f"\nIntercept: {intercept:.2f}")
print(f"R-squared (R^2): {r_squared:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")

Model Coefficients:
SqFt: 52.99
Bedrooms: 4246.79
Bathrooms: 7883.28
Offers: -8267.49
Brick_Yes: 17297.35
Neighborhood_North: 1560.58
Neighborhood_West: 22241.62

Intercept: 598.92
R-squared (R^2): 0.87
Mean Squared Error (MSE): 94105539.95

In this code:

We load the dataset and convert categorical variables ('Brick' and 'Neighborhood') into numerical form using one-hot encoding.
We select the independent variables (features) 'SqFt,' 'Bedrooms,' 'Bathrooms,' 'Offers,' 'Brick_Yes,' 'Neighborhood_North,' and 'Neighborhood_West' for the Multiple Linear Regression model.
We fit the model, make predictions, and calculate R^2 and MSE as performance metrics.
Finally, we print the model coefficients, intercept, R^2, and MSE to evaluate the model's performance.

This new model considers multiple features, allowing it to potentially provide better predictions of house prices based on a combination of factors.

Insights:

The results from the Multiple Linear Regression model indicate a significant improvement in model performance compared to the previous simple linear regression. Let's analyze the coefficients, R^2 value, and MSE and provide insights and recommendations:

Model Coefficients:The coefficients represent the estimated change in house price for a one-unit change in each respective feature, holding all other features constant.Insights from the coefficients:'SqFt' has a positive coefficient of approximately 52.99, suggesting that for each additional square foot of living space, the house price is estimated to increase by $52.99.'Bedrooms' has a positive coefficient of approximately 4,246.79, indicating that each additional bedroom is associated with an estimated increase of $4,246.79 in house price.'Bathrooms' has a positive coefficient of approximately 7,883.28, implying that each additional bathroom is associated with an estimated increase of $7,883.28 in house price.'Offers' has a negative coefficient of approximately -8,267.49, suggesting that each additional offer received is associated with a decrease of approximately $8,267.49 in house price.'Brick_Yes' (presence of brick construction) has a positive coefficient of approximately 17,297.35, indicating that houses with brick construction are estimated to have higher prices by $17,297.35 compared to houses without brick construction.'Neighborhood_North' and 'Neighborhood_West' are dummy variables for neighborhood locations. 'Neighborhood_West' has the highest positive coefficient, indicating that houses in the 'West' neighborhood are estimated to have higher prices compared to the reference neighborhood.
R^2 (Coefficient of Determination):The R^2 value of approximately 0.87 indicates that the model explains approximately 87% of the variance in house prices based on the selected features. This is a significant improvement over the previous model, suggesting that the model fits the data well and captures a substantial portion of the variability.
Mean Squared Error (MSE):The MSE of approximately 94,105,539.95 represents the average squared difference between the model's predicted prices and the actual prices.A lower MSE is desirable, but the specific value should be considered in the context of the dataset and the application. The MSE indicates that, on average, the model's predictions are off by approximately $94 million.

Insights and Recommendations:

Feature Importance: The coefficients reveal the importance of each feature in influencing house prices. Bedrooms, bathrooms, square footage, and the presence of brick construction have substantial positive effects on house prices, while the number of offers has a negative effect.
Model Fit: The high R^2 value suggests that the model fits the data well and explains a significant portion of the variability in house prices. This is a positive indication of the model's performance.
Outliers: Investigate potential outliers in the dataset that might be affecting model performance. Outliers can have a disproportionate impact on coefficients and predictions.
Further Exploration: Continue exploring and refining the model by considering additional features, interactions between features, or polynomial regression if there are nonlinear relationships.
Validation: Validate the model's performance on a separate test dataset to ensure it generalizes well to unseen data.
Business Application: Consider how the model can be applied to real estate business operations, such as pricing houses accurately and providing insights to clients and buyers.

In summary, the Multiple Linear Regression model with the selected features shows promising performance, explaining a significant portion of the variance in house prices. Further refinements and validations can enhance its accuracy for practical use in the real estate market.

To learn more about model evaluation, please refer to the following article:

AI Synergy Insights

545 位关注者

要查看或添加评论，请登录

Rany ElHousieny, PhD???的更多文章

Getting Started with LangChain.js: A Hello World Example

2025年2月18日

Getting Started with LangChain.js: A Hello World Example

LangChain.js is a powerful library that enables seamless interaction with Large Language Models (LLMs) in JavaScript…
LangChain Chains: Powering AI with Structured Execution ????

2025年2月16日

LangChain Chains: Powering AI with Structured Execution ????

When building AI-powered applications, we often need to process user inputs, format prompts, retrieve relevant data…
LangChain Memory in a React AI Joke Generator: A Beginner’s Guide ????

2025年2月16日

LangChain Memory in a React AI Joke Generator: A Beginner’s Guide ????

Wouldn’t it be cool if your AI remembered what it told you before? Imagine asking an AI for a joke, and instead of…
Mastering LangChain.js Prompt Templates: A Beginner's Guide for Frontend Developers

2025年2月16日

Mastering LangChain.js Prompt Templates: A Beginner's Guide for Frontend Developers

?? What if you could customize AI responses dynamically in your React app? Instead of sending hardcoded prompts to…
Getting Started with LangChain.js: Calling OpenAI to Tell a Joke

2025年2月15日

Getting Started with LangChain.js: Calling OpenAI to Tell a Joke

Artificial Intelligence is becoming more accessible for frontend developers, thanks to LangChain.js.
AI Development for Frontend Developers with React and LangChain: Hands-On project

2025年2月15日

AI Development for Frontend Developers with React and LangChain: Hands-On project

In my previous article, I explained how to build a Resume Coach application that helps job seekers optimize their…

3 条评论
Getting Started with OpenHands Code Assistance on Mac

2025年2月14日

Getting Started with OpenHands Code Assistance on Mac

OpenHands is an AI-powered code assistance tool designed to streamline development workflows. This guide will walk you…

1 条评论
CodiumAI Windsurf Code Assistant: Getting Started

2025年2月6日

CodiumAI Windsurf Code Assistant: Getting Started

In the ever-evolving landscape of software development, integrating advanced tools can significantly enhance…
Deploying DeepSeek-R1 on Azure

2025年2月6日

Deploying DeepSeek-R1 on Azure

DeepSeek-R1 is a powerful reasoning model designed for complex tasks like language processing, scientific reasoning…
Getting Started with LocalStack: A Beginner's Guide

2025年1月10日

Getting Started with LocalStack: A Beginner's Guide

LocalStack is an open-source tool that emulates AWS services locally, enabling you to develop and test your…

See all articles

A Practical Example for Improving ML Models with Multiple Linear Regression

Rany ElHousieny, PhD???

Generative AI ENGINEERING MANAGER | ex-Microsoft | AI Solutions Architect | Generative AI & NLP Expert | Proven Leader in AI-Driven Innovation | Former Microsoft Research & Azure AI | Software Engineering Manager

领英推荐

Insights:

AI Synergy Insights

545 位关注者

Rany ElHousieny, PhD???的更多文章

社区洞察

其他会员也浏览了

Document Splitting

Simple Linear Regression Practical Example

Exploring foundational machine learning algorithms: Linear regression, decision trees, and K-nearest neighbors

Decision Trees: A Guide to Understanding and Building

Seaborn

Machine Learning Roadmap

Logistic Regression with deciles made simple

GPT-Python Pulse: Creating a Family Tree

AI_Part_5_K-NN

10 Machine Learning Regressors in Python

领英推荐

Insights:

AI Synergy Insights

545 位关注者

Rany ElHousieny, PhD???的更多文章

Getting Started with LangChain.js: A Hello World Example

LangChain Chains: Powering AI with Structured Execution ????

LangChain Memory in a React AI Joke Generator: A Beginner’s Guide ????

Mastering LangChain.js Prompt Templates: A Beginner's Guide for Frontend Developers

Getting Started with LangChain.js: Calling OpenAI to Tell a Joke

AI Development for Frontend Developers with React and LangChain: Hands-On project

Getting Started with OpenHands Code Assistance on Mac

CodiumAI Windsurf Code Assistant: Getting Started

Deploying DeepSeek-R1 on Azure

Getting Started with LocalStack: A Beginner's Guide

社区洞察

其他会员也浏览了

Document Splitting

Simple Linear Regression Practical Example

Exploring foundational machine learning algorithms: Linear regression, decision trees, and K-nearest neighbors

Decision Trees: A Guide to Understanding and Building

Seaborn

Machine Learning Roadmap

Logistic Regression with deciles made simple

GPT-Python Pulse: Creating a Family Tree

AI_Part_5_K-NN

10 Machine Learning Regressors in Python