LINEAR REGRESSION ON BOSTON DATASET

LINEAR REGRESSION ON BOSTON DATASET

The Boston dataset is a classic dataset used for regression problems, especially for predicting house prices in different areas of Boston. This dataset is commonly used for machine learning experiments and is often imported into R via the MASS package.

The dataset contains 506 observations on 14 variables, with the median house price (medv) being the target variable. Some of the other variables include:

  • crim: Crime rate by area
  • zn: Percentage of residential land parceled out to more than 25,000 square feet
  • indus: Percentage of nonresidential area by city
  • chas: Binary variable indicating whether the area is adjacent to the Charles River
  • nox: Concentration of nitrogen oxides
  • rm: Average number of rooms per dwelling
  • age: Percentage of dwellings built before 1940
  • dis: Average distance to major Boston employment centers
  • rad: Accessibility index to freeways
  • tax: Property tax rate
  • ptratio: Pupil/teacher ratio
  • b: Measure of the influence of the black population on the area
  • lstat: Percentage of low-income population

You can load it and view it in R with the following commands:

# Load the MASS package

library(MASS)

# Load the Boston dataset

date('Boston')

# Display the first rows of the dataset

head(Boston)

Here is a preview of the first few rows of the Boston dataset:

The MEDV column represents the median home price, while the other columns describe demographic and environmental characteristics of the various areas of Boston, which in this case represent the features of the case in question.

Here is an example of how to run a linear regression using the Boston dataset in R to predict the output variable medv (median home price) from the other variables.

Here are the steps to run a linear regression in R:

  • Load the dataset and necessary packages
  • View a summary of the variables
  • Run the linear regression
  • View the regression results

Here is the step-by-step R code:

# 1. Load the MASS package and the Boston dataset

library(MASS)

data("Boston")

# 2. Look at the structure of the dataset

str(Boston)

# 3. Run the linear regression

# Here we use all the independent variables to predict medv

model <- lm(medv ~ ., data=Boston)

# 4. View the regression model results

summary(model)

Explanation:

lm(medv ~ ., data=Boston): This formula indicates that we want to run a linear regression where the dependent variable is medv and all the other variables (.) are used as predictors.

summary(model): Returns a detailed summary of the model, including the coefficients, R2R^2R2 value, p-value for each predictor, and other useful statistics.

This will give you a high-level view of the variables that are significant and how they affect the median home price.

Here's how you can extract and visualize linear regression coefficients in R using the Boston dataset.

After you create the linear regression model with the lm() command, you can use the coef() function to get the estimated coefficients.

Steps to extract coefficients

R

Copy code

# Load the necessary package

library(MASS)

# Create the linear regression model

model <- lm(medv ~ ., data = Boston)

# Extract the model coefficients

coefficients <- coef(model)

# Display the coefficients

print(coefficients)

Interpreting the coefficients

The output of coef(model) will show the coefficients for each predictor variable in the model. The coefficients are estimates of how much the target variable medv (median house value) changes for a unit increase in each independent variable, holding the other variables constant.

Example of coefficients summarizing linear regression:

Interpretation:

Intercept: When all other variables are zero, the predicted median home value is about 36.46.

crim: For every 1-unit increase in the crime rate, the median home value decreases by about 0.108.

rm: Each additional unit in the median number of bedrooms increases the median home value by about 3.81.

lstat: Each 1% increase in the percentage of people with low socioeconomic status decreases the median home value by about 0.525.

Each coefficient then represents the impact of that variable on the target variable medv (the median home value), taking into account the other variables in the model.

So we can say that the following R command line has created the linear regression model that best fits the data we have from the dataset provided.

model <- lm(medv ~ ., data = Boston)

The following image summarizes the linear regression activity comparing the actual values with the predicted ones.


要查看或添加评论,请登录

Giancarlo Ronci的更多文章

社区洞察

其他会员也浏览了