登录查看更多内容

LINEAR REGRESSION ON BOSTON DATASET

Giancarlo Ronci

Senior Data & Analytics Manager, Data Engineer, Business Intelligence and Data Warehouse at Soldo Ltd

发布日期: 2024年10月6日

The Boston dataset is a classic dataset used for regression problems, especially for predicting house prices in different areas of Boston. This dataset is commonly used for machine learning experiments and is often imported into R via the MASS package.

The dataset contains 506 observations on 14 variables, with the median house price (medv) being the target variable. Some of the other variables include:

crim: Crime rate by area
zn: Percentage of residential land parceled out to more than 25,000 square feet
indus: Percentage of nonresidential area by city
chas: Binary variable indicating whether the area is adjacent to the Charles River
nox: Concentration of nitrogen oxides
rm: Average number of rooms per dwelling
age: Percentage of dwellings built before 1940
dis: Average distance to major Boston employment centers
rad: Accessibility index to freeways
tax: Property tax rate
ptratio: Pupil/teacher ratio
b: Measure of the influence of the black population on the area
lstat: Percentage of low-income population

You can load it and view it in R with the following commands:

# Load the MASS package

library(MASS)

# Load the Boston dataset

date('Boston')

# Display the first rows of the dataset

head(Boston)

Here is a preview of the first few rows of the Boston dataset:

The MEDV column represents the median home price, while the other columns describe demographic and environmental characteristics of the various areas of Boston, which in this case represent the features of the case in question.

Here is an example of how to run a linear regression using the Boston dataset in R to predict the output variable medv (median home price) from the other variables.

Here are the steps to run a linear regression in R:

Load the dataset and necessary packages
View a summary of the variables
Run the linear regression
View the regression results

Here is the step-by-step R code:

# 1. Load the MASS package and the Boston dataset

library(MASS)

data("Boston")

# 2. Look at the structure of the dataset

str(Boston)

# 3. Run the linear regression

# Here we use all the independent variables to predict medv

model <- lm(medv ~ ., data=Boston)

# 4. View the regression model results

summary(model)

Explanation:

lm(medv ~ ., data=Boston): This formula indicates that we want to run a linear regression where the dependent variable is medv and all the other variables (.) are used as predictors.

summary(model): Returns a detailed summary of the model, including the coefficients, R2R^2R2 value, p-value for each predictor, and other useful statistics.

领英推荐

Fit & predict for regression

Abu Chowdhury, PMP?, MSFE, MSCS, BSEE 6 年前

What is a Time Series

Tarek Abualkher 4 个月前

When it Makes Sense to Categorize a Continuous…

The Analysis Factor 1 年前

This will give you a high-level view of the variables that are significant and how they affect the median home price.

Here's how you can extract and visualize linear regression coefficients in R using the Boston dataset.

After you create the linear regression model with the lm() command, you can use the coef() function to get the estimated coefficients.

Steps to extract coefficients

Copy code

# Load the necessary package

library(MASS)

# Create the linear regression model

model <- lm(medv ~ ., data = Boston)

# Extract the model coefficients

coefficients <- coef(model)

# Display the coefficients

print(coefficients)

Interpreting the coefficients

The output of coef(model) will show the coefficients for each predictor variable in the model. The coefficients are estimates of how much the target variable medv (median house value) changes for a unit increase in each independent variable, holding the other variables constant.

Example of coefficients summarizing linear regression:

Interpretation:

Intercept: When all other variables are zero, the predicted median home value is about 36.46.

crim: For every 1-unit increase in the crime rate, the median home value decreases by about 0.108.

rm: Each additional unit in the median number of bedrooms increases the median home value by about 3.81.

lstat: Each 1% increase in the percentage of people with low socioeconomic status decreases the median home value by about 0.525.

Each coefficient then represents the impact of that variable on the target variable medv (the median home value), taking into account the other variables in the model.

So we can say that the following R command line has created the linear regression model that best fits the data we have from the dataset provided.

model <- lm(medv ~ ., data = Boston)

The following image summarizes the linear regression activity comparing the actual values with the predicted ones.

要查看或添加评论，请登录

Giancarlo Ronci的更多文章

Apache AirFlow

2025年1月11日

Apache AirFlow

Apache Airflow è uno scheduler open source molto popolare per la gestione di flussi di lavoro e pipeline di dati. Ecco…

1 条评论
Clustering USArrests Dataset using K-means Method

2024年11月19日

Clustering USArrests Dataset using K-means Method

URL: https://www.kaggle.
[ITA] SUPPORT VECTOR MACHINE E PYTHON

2024年11月12日

[ITA] SUPPORT VECTOR MACHINE E PYTHON

La metodologia delle Support Vector Machine (SVM) è molto diffusa in data science per problemi di classificazione e, in…
DECISION TREES AND TITANIC DATASET

2024年10月23日

DECISION TREES AND TITANIC DATASET

#MachineLearning #DecisionTree #DataScience #Classification #RProgramming Decision trees are machine learning…
[ITA] Alberi decisionali in R, e dataset TITANIC

2024年10月20日

[ITA] Alberi decisionali in R, e dataset TITANIC

Gli alberi decisionali sono algoritmi di machine learning ampiamente utilizzati sia per la classificazione che per la…
LOGISTIC REGRESSION ON DATASET BIOPSY

2024年10月14日

LOGISTIC REGRESSION ON DATASET BIOPSY

First of all, we can say that logistic regression is a supervised learning algorithm. In a supervised learning, the…
[ITA] REGRESSIONE LINEARE SU DATASET BOSTON

2024年10月5日

[ITA] REGRESSIONE LINEARE SU DATASET BOSTON

#datascience #machinelearning #R il Boston dataset, è un classico dataset utilizzato per problemi di regressione, in…
Data warehouse Guides and Tutorials

2017年7月20日

Data warehouse Guides and Tutorials
Data warehouse Guides and Tutorials

2017年7月20日

Data warehouse Guides and Tutorials

Here some interesting links about data warehousing: A discussion about several methods to retrieve data from the data…
Vantaggi nell'utilizzo di Hadoop

2016年12月14日

Vantaggi nell'utilizzo di Hadoop

I vantaggi di Hadoop MapReduce programmazione #HDFS #MapReduce

1 条评论

See all articles

LINEAR REGRESSION ON BOSTON DATASET

Giancarlo Ronci

Senior Data & Analytics Manager, Data Engineer, Business Intelligence and Data Warehouse at Soldo Ltd

领英推荐

Giancarlo Ronci的更多文章

社区洞察

其他会员也浏览了

Navigating the Future: A Quick Guide to Time Series Forecasting Theories

How to find the most important variables in R

Time series analysis with R

Delve deeper into R-squared.

Linear regression with one variable

Regularization - L1(Lasso) & L2(Ridge)

Random Forest

Anscombe's quartet - Can statistic properties describe realistic datasets?

Regression

How does the estimator hyperparameter effect the prediction in random forest algorithm

领英推荐

Giancarlo Ronci的更多文章

Apache AirFlow

Clustering USArrests Dataset using K-means Method

[ITA] SUPPORT VECTOR MACHINE E PYTHON

DECISION TREES AND TITANIC DATASET

[ITA] Alberi decisionali in R, e dataset TITANIC

LOGISTIC REGRESSION ON DATASET BIOPSY

[ITA] REGRESSIONE LINEARE SU DATASET BOSTON

Data warehouse Guides and Tutorials

Data warehouse Guides and Tutorials

Vantaggi nell'utilizzo di Hadoop

社区洞察

其他会员也浏览了

Navigating the Future: A Quick Guide to Time Series Forecasting Theories

How to find the most important variables in R

Time series analysis with R

Delve deeper into R-squared.

Linear regression with one variable

Regularization - L1(Lasso) & L2(Ridge)

Random Forest

Anscombe's quartet - Can statistic properties describe realistic datasets?

Regression

How does the estimator hyperparameter effect the prediction in random forest algorithm