LINEAR REGRESSION ON BOSTON DATASET
Giancarlo Ronci
Senior Data & Analytics Manager, Data Engineer, Business Intelligence and Data Warehouse at Soldo Ltd
The Boston dataset is a classic dataset used for regression problems, especially for predicting house prices in different areas of Boston. This dataset is commonly used for machine learning experiments and is often imported into R via the MASS package.
The dataset contains 506 observations on 14 variables, with the median house price (medv) being the target variable. Some of the other variables include:
You can load it and view it in R with the following commands:
# Load the MASS package
library(MASS)
# Load the Boston dataset
date('Boston')
# Display the first rows of the dataset
head(Boston)
Here is a preview of the first few rows of the Boston dataset:
The MEDV column represents the median home price, while the other columns describe demographic and environmental characteristics of the various areas of Boston, which in this case represent the features of the case in question.
Here is an example of how to run a linear regression using the Boston dataset in R to predict the output variable medv (median home price) from the other variables.
Here are the steps to run a linear regression in R:
Here is the step-by-step R code:
# 1. Load the MASS package and the Boston dataset
library(MASS)
data("Boston")
# 2. Look at the structure of the dataset
str(Boston)
# 3. Run the linear regression
# Here we use all the independent variables to predict medv
model <- lm(medv ~ ., data=Boston)
# 4. View the regression model results
summary(model)
Explanation:
lm(medv ~ ., data=Boston): This formula indicates that we want to run a linear regression where the dependent variable is medv and all the other variables (.) are used as predictors.
summary(model): Returns a detailed summary of the model, including the coefficients, R2R^2R2 value, p-value for each predictor, and other useful statistics.
领英推荐
This will give you a high-level view of the variables that are significant and how they affect the median home price.
Here's how you can extract and visualize linear regression coefficients in R using the Boston dataset.
After you create the linear regression model with the lm() command, you can use the coef() function to get the estimated coefficients.
Steps to extract coefficients
R
Copy code
# Load the necessary package
library(MASS)
# Create the linear regression model
model <- lm(medv ~ ., data = Boston)
# Extract the model coefficients
coefficients <- coef(model)
# Display the coefficients
print(coefficients)
Interpreting the coefficients
The output of coef(model) will show the coefficients for each predictor variable in the model. The coefficients are estimates of how much the target variable medv (median house value) changes for a unit increase in each independent variable, holding the other variables constant.
Example of coefficients summarizing linear regression:
Interpretation:
Intercept: When all other variables are zero, the predicted median home value is about 36.46.
crim: For every 1-unit increase in the crime rate, the median home value decreases by about 0.108.
rm: Each additional unit in the median number of bedrooms increases the median home value by about 3.81.
lstat: Each 1% increase in the percentage of people with low socioeconomic status decreases the median home value by about 0.525.
Each coefficient then represents the impact of that variable on the target variable medv (the median home value), taking into account the other variables in the model.
So we can say that the following R command line has created the linear regression model that best fits the data we have from the dataset provided.
model <- lm(medv ~ ., data = Boston)
The following image summarizes the linear regression activity comparing the actual values with the predicted ones.