A Practical Example for Improving ML Models with Multiple Linear Regression

A Practical Example for Improving ML Models with Multiple Linear Regression

This article is a continuation of the following article:

Let's create a new Multiple Linear Regression model using additional relevant features from the dataset, including 'SqFt,' 'Bedrooms,' 'Bathrooms,' 'Offers,' 'Brick,' and 'Neighborhood.' We will also evaluate the model's performance using Python and discuss performance evaluation metrics.

Here's how to proceed step by step:

  • Data Preprocessing: Load the dataset into a Pandas DataFrame.Convert categorical variables like 'Brick' and 'Neighborhood' into numerical representations (e.g., one-hot encoding) to include them in the model.

# Load the dataset
# Step 1: Download the CSV file from GitHub (raw URL)
csv_url = "https://raw.githubusercontent.com/ywchiu/riii/master/data/house-prices.csv"

# Step 2: Load the CSV data into a Pandas DataFrame
data = pd.read_csv(csv_url)

# Explore the loaded data (e.g., check the first few rows)
data.head()        


  • Feature Selection: Choose 'SqFt,' 'Bedrooms,' 'Bathrooms,' 'Offers,' 'Brick,' and 'Neighborhood' as independent variables (features) for the Multiple Linear Regression model.

# Convert categorical variables ('Brick' and 'Neighborhood') to numerical using one-hot encoding
data = pd.get_dummies(data, columns=['Brick', 'Neighborhood'], drop_first=True)

# Define the independent variables (features)
X = data[['SqFt', 'Bedrooms', 'Bathrooms', 'Offers', 'Brick_Yes', 'Neighborhood_North', 'Neighborhood_West']]        


  • Model Building: Fit a Multiple Linear Regression model using the selected features to predict 'Price' (house prices).

# Define the dependent variable
Y = data['Price']

# Fit a Multiple Linear Regression model
model = LinearRegression()
model.fit(X, Y)

# Predict house prices using the model
predicted_prices = model.predict(X)        

  • Performance Evaluation:Use performance evaluation metrics such as R^2 (Coefficient of Determination) and Mean Squared Error (MSE) to assess the model's accuracy.

# Calculate R-squared (R^2)
r_squared = r2_score(Y, predicted_prices)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(Y, predicted_prices)
        

Here's a Python example that demonstrates these steps:

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.preprocessing import OneHotEncoder

# Load the dataset
# Step 1: Download the CSV file from GitHub (raw URL)
csv_url = "https://raw.githubusercontent.com/ywchiu/riii/master/data/house-prices.csv"

# Step 2: Load the CSV data into a Pandas DataFrame
data = pd.read_csv(csv_url)

# Explore the loaded data (e.g., check the first few rows)
data.head()



# Convert categorical variables ('Brick' and 'Neighborhood') to numerical using one-hot encoding
data = pd.get_dummies(data, columns=['Brick', 'Neighborhood'], drop_first=True)

# Define the independent variables (features)
X = data[['SqFt', 'Bedrooms', 'Bathrooms', 'Offers', 'Brick_Yes', 'Neighborhood_North', 'Neighborhood_West']]

# Define the dependent variable
Y = data['Price']

# Fit a Multiple Linear Regression model
model = LinearRegression()
model.fit(X, Y)

# Predict house prices using the model
predicted_prices = model.predict(X)

# Calculate R-squared (R^2)
r_squared = r2_score(Y, predicted_prices)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(Y, predicted_prices)

# Print model coefficients
coefficients = model.coef_
intercept = model.intercept_

print("Model Coefficients:")
for feature, coef in zip(X.columns, coefficients):
    print(f"{feature}: {coef:.2f}")

print(f"\nIntercept: {intercept:.2f}")
print(f"R-squared (R^2): {r_squared:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")

        
Model Coefficients:
SqFt: 52.99
Bedrooms: 4246.79
Bathrooms: 7883.28
Offers: -8267.49
Brick_Yes: 17297.35
Neighborhood_North: 1560.58
Neighborhood_West: 22241.62

Intercept: 598.92
R-squared (R^2): 0.87
Mean Squared Error (MSE): 94105539.95        

In this code:

  • We load the dataset and convert categorical variables ('Brick' and 'Neighborhood') into numerical form using one-hot encoding.
  • We select the independent variables (features) 'SqFt,' 'Bedrooms,' 'Bathrooms,' 'Offers,' 'Brick_Yes,' 'Neighborhood_North,' and 'Neighborhood_West' for the Multiple Linear Regression model.
  • We fit the model, make predictions, and calculate R^2 and MSE as performance metrics.
  • Finally, we print the model coefficients, intercept, R^2, and MSE to evaluate the model's performance.

This new model considers multiple features, allowing it to potentially provide better predictions of house prices based on a combination of factors.


Insights:

The results from the Multiple Linear Regression model indicate a significant improvement in model performance compared to the previous simple linear regression. Let's analyze the coefficients, R^2 value, and MSE and provide insights and recommendations:

  1. Model Coefficients:The coefficients represent the estimated change in house price for a one-unit change in each respective feature, holding all other features constant.Insights from the coefficients:'SqFt' has a positive coefficient of approximately 52.99, suggesting that for each additional square foot of living space, the house price is estimated to increase by $52.99.'Bedrooms' has a positive coefficient of approximately 4,246.79, indicating that each additional bedroom is associated with an estimated increase of $4,246.79 in house price.'Bathrooms' has a positive coefficient of approximately 7,883.28, implying that each additional bathroom is associated with an estimated increase of $7,883.28 in house price.'Offers' has a negative coefficient of approximately -8,267.49, suggesting that each additional offer received is associated with a decrease of approximately $8,267.49 in house price.'Brick_Yes' (presence of brick construction) has a positive coefficient of approximately 17,297.35, indicating that houses with brick construction are estimated to have higher prices by $17,297.35 compared to houses without brick construction.'Neighborhood_North' and 'Neighborhood_West' are dummy variables for neighborhood locations. 'Neighborhood_West' has the highest positive coefficient, indicating that houses in the 'West' neighborhood are estimated to have higher prices compared to the reference neighborhood.
  2. R^2 (Coefficient of Determination):The R^2 value of approximately 0.87 indicates that the model explains approximately 87% of the variance in house prices based on the selected features. This is a significant improvement over the previous model, suggesting that the model fits the data well and captures a substantial portion of the variability.
  3. Mean Squared Error (MSE):The MSE of approximately 94,105,539.95 represents the average squared difference between the model's predicted prices and the actual prices.A lower MSE is desirable, but the specific value should be considered in the context of the dataset and the application. The MSE indicates that, on average, the model's predictions are off by approximately $94 million.

Insights and Recommendations:

  1. Feature Importance: The coefficients reveal the importance of each feature in influencing house prices. Bedrooms, bathrooms, square footage, and the presence of brick construction have substantial positive effects on house prices, while the number of offers has a negative effect.
  2. Model Fit: The high R^2 value suggests that the model fits the data well and explains a significant portion of the variability in house prices. This is a positive indication of the model's performance.
  3. Outliers: Investigate potential outliers in the dataset that might be affecting model performance. Outliers can have a disproportionate impact on coefficients and predictions.
  4. Further Exploration: Continue exploring and refining the model by considering additional features, interactions between features, or polynomial regression if there are nonlinear relationships.
  5. Validation: Validate the model's performance on a separate test dataset to ensure it generalizes well to unseen data.
  6. Business Application: Consider how the model can be applied to real estate business operations, such as pricing houses accurately and providing insights to clients and buyers.

In summary, the Multiple Linear Regression model with the selected features shows promising performance, explaining a significant portion of the variance in house prices. Further refinements and validations can enhance its accuracy for practical use in the real estate market.

To learn more about model evaluation, please refer to the following article:


要查看或添加评论,请登录

Rany ElHousieny, PhD???的更多文章

社区洞察

其他会员也浏览了