Simple Linear Regression Practical Example
Rany ElHousieny, PhD???
Generative AI ENGINEERING MANAGER | ex-Microsoft | AI Solutions Architect | Generative AI & NLP Expert | Proven Leader in AI-Driven Innovation | Former Microsoft Research & Azure AI | Software Engineering Manager
This article is a continuation of the following article:
In this article, we will apply what we learned in the previous article using an example.
Note: We will be using Google Colaboratory Python notebooks to avoid setup and environment delays. The focus of this article is to get you up and running in Machine Learning with Python, and we can do all that we need there. The following article can help you get start with Google Colab
Note: If you are new to Python, the following article can help you start
Problem Statement:
You are a real estate analyst working for a property valuation company. Your company is tasked with predicting house prices based on various features. One of your clients, a real estate agency, has provided you with a dataset containing information about houses, including their square footage, number of bedrooms, number of bathrooms, the presence of brick construction, the number of offers received, and the neighborhood in which each house is located. The dataset can be found at:
"https://raw.githubusercontent.com/ywchiu/riii/master/data/house-prices.csv"
Your client wants to understand how the number of bedrooms affects house prices. They believe that the number of bedrooms is a critical factor in determining the price of a house. To provide your client with valuable insights and predictions, you need to perform a simple linear regression analysis to model the relationship between the number of bedrooms and house prices.
Solution:
To solve this problem, you will:
1. Load the dataset containing information about houses, including the number of bedrooms and house prices.
# Step 1: Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Step 2: Download the CSV file from GitHub (raw URL)
csv_url = "https://raw.githubusercontent.com/ywchiu/riii/master/data/house-prices.csv"
# Step 3: Load the CSV data into a Pandas DataFrame
data = pd.read_csv(csv_url)
# Explore the loaded data (e.g., check the first few rows)
data.head()
2. Preprocess the data, ensuring that it is clean and suitable for analysis.
3. Perform a simple linear regression analysis by using the number of bedrooms as the independent variable (X) and house prices as the dependent variable (Y).
# Extract the independent variable (X) and dependent variable (Y)
X = data[['Bedrooms']].values # Using 'Bedrooms' as the independent variable
Y = data['Price'].values # Using 'Price' as the dependent variable
4. Fit a linear regression model to the data, finding the best-fitting line that represents the relationship between the number of bedrooms and house prices.
# Fit a simple linear regression model
model = LinearRegression()
model.fit(X, Y)
5. Calculate the intercept and slope coefficients of the regression line to quantify the relationship.
# Calculate the coefficients (intercept and slope)
intercept = model.intercept_
slope = model.coef_[0]
# Print the coefficients
print(f"Intercept (a): {intercept:.2f}")
print(f"Slope (b): {slope:.2f}")
Intercept (a): 71574.71
Slope (b): 19465.47
6. Visualize the data and the regression line to understand how the number of bedrooms influences house prices.
# Visualize the data and regression line
plt.scatter(X, Y, c='b', marker='o', label='Data')
plt.plot(X, model.predict(X), color='red', label='Regression Line')
plt.xlabel("Number of Bedrooms")
plt.ylabel("House Price")
plt.title("Linear Regression: House Price vs. Number of Bedrooms")
plt.legend()
plt.show()
7. Provide your client with insights into the impact of the number of bedrooms on house prices and offer predictions based on the linear regression model.
To provide your client with insights into the impact of the number of bedrooms on house prices and offer predictions based on the linear regression model, you can follow these steps:
Step1: Interpret the Coefficients:
Step2: Provide Insights:
Explain to your client that the intercept (a) might not be directly interpretable because houses typically have a minimum number of bedrooms greater than zero. However, it's essential to mention it for completeness. Focus on the slope (b) as the key insight. You can say something like, "For each additional bedroom in a house, we expect the house price to increase by approximately $b=$20k."
Step3: Offer Predictions:
To make predictions for house prices based on the number of bedrooms, your client can use the linear regression model's equation:
House?Price=a+b×Number?of?Bedrooms
For example, if a house has 3 bedrooms, you can calculate the predicted price as:
Predicted?Price=a+b×3 = 71574.71 + 19465.47 * 3 = $129971.12
领英推荐
Step4: Provide your client with a way to Predict:
Provide your client with code or a calculator that allows them to easily calculate predicted prices for houses with different numbers of bedrooms.
# Provided intercept and slope
intercept = 71574.71
slope = 19465.47
# Function to calculate predicted price based on the number of bedrooms
def predict_price(num_bedrooms):
predicted_price = intercept + slope * num_bedrooms
return predicted_price
# Input: Number of bedrooms from the console
num_bedrooms = float(input("Enter the number of bedrooms: "))
# Calculate the predicted price
predicted_price = predict_price(num_bedrooms)
# Print the predicted price
print(f"Predicted Price for {num_bedrooms} bedrooms: ${predicted_price:.2f}")
This program defines a function predict_price(num_bedrooms) that takes the number of bedrooms as input and calculates the predicted house price using the linear regression equation. It then takes the number of bedrooms as input from the console, calculates the predicted price, and prints the result.
Your client can run this program, enter the number of bedrooms they want to predict the price for, and receive the predicted house price as output.
Step5: Highlight Model Performance:
Mention the model's performance metrics, such as R^2 or Mean Squared Error (MSE), to indicate how well the model fits the data. A higher R^2 value indicates a better fit.
R^2 (Coefficient of Determination):
The R^2 value, also known as the coefficient of determination, measures the proportion of the variance in the dependent variable (house prices) that is explained by the independent variable (number of bedrooms). It ranges from 0 to 1, with a higher R^2 indicating a better fit. Specifically:
In simple terms, R^2 helps us understand how well the linear regression model captures the patterns in the data. A high R^2 suggests that the model's predictions closely follow the actual data points.
Mean Squared Error (MSE):
The Mean Squared Error (MSE) quantifies the average squared difference between the predicted and actual values. It measures the average prediction error, with lower values indicating a better fit. The formula for MSE is:
Where:
Now, let's calculate and interpret these performance metrics using Python based on the provided linear regression model:
from sklearn.metrics import r2_score, mean_squared_error
# Assuming you have actual prices and predictions
actual_prices = data['Price'].values
predicted_prices = predict_price(X) # Using the previously defined predict_price function
# Calculate R-squared (R^2)
r_squared = r2_score(actual_prices, predicted_prices)
print(f"R-squared (R^2): {r_squared:.2f}")
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(actual_prices, predicted_prices)
print(f"Mean Squared Error (MSE): {mse:.2f}")
R-squared (R^2): 0.28
Mean Squared Error (MSE): 518165995.28
In this code, we calculate both R^2 and MSE using scikit-learn's r2_score and mean_squared_error functions. These metrics provide an indication of how well the linear regression model fits the data:
You can interpret these metrics to communicate the performance of the model to your client. For instance, you can say that the model has an R^2 value of 0.80, indicating that it explains 80% of the variance in house prices, and an MSE of $20,000, suggesting that, on average, the model's predictions are off by $20,000 from the actual prices.
Let's analyze the metrics and discuss how to interpret them and potentially improve the model:
R-squared (R^2): 0.28
Mean Squared Error (MSE): 518165995.28
Insights:
Improvements: To improve the model's performance and increase its predictive accuracy:
Remember that linear regression is a simplification of reality, and house price prediction is a complex task influenced by numerous factors. By incorporating additional features and using more advanced modeling techniques, you can enhance the model's ability to make accurate predictions.
Note: In the following articles, we will address each of the previous sugessions and compare performance
Step6: Discuss Limitations:
Let's add more features to improve performance as explained in the following article: