登录查看更多内容

Everything you need to know about Linear Regression

Anandh Shanmugaraj

Group CEO & MD at Gladwin International & Company ?? India's leading Interim Leadership Consulting, Executive Search and Leadership Advisory Firm.

发布日期: 2016年11月23日

Ad: Learn from more than 7500 Hours of Free Data Science Video Tutorials - Start now >

Linear regression is used to predict the value of an outcome variable Y based on one or more input predictor variables X. The aim is to establish a linear relationship (a mathematical formula) between the predictor variable(s) and the response variable, so that, we can use this formula to estimate the value of the response Y, when only the predictors (Xs) values are known.

Introduction

The aim of linear regression is to model a continuous variable Y as a mathematical function of one or more X variable(s), so that we can use this regression model to predict the Y when only the X is known. This mathematical equation can be generalized as follows:

Y?=?β1?+?β2X?+??

where, β1 is the intercept and β2 is the slope. Collectively, they are called regression coefficients. ? is the error term, the part of Y the regression model is unable to explain.

Example Problem

For this analysis, we will use the cars dataset that comes with R by default. cars is a standard built-in dataset, that makes it convenient to demonstrate linear regression in a simple and easy to understand fashion. You can access this dataset simply by typing in cars in your R console. You will find that it consists of 50 observations(rows) and 2 variables (columns) – dist and speed. Lets print out the first six observations here..

head(cars)  # display the first 6 observations#>   speed dist#> 1     4    2#> 2     4   10#> 3     7    4#> 4     7   22#> 5     8   16#> 6     9   10

Before we begin building the regression model, it is a good practice to analyze and understand the variables. The graphical analysis and correlation study below will help with this.

Ad: Apply for more than 50000 Data Science Job Opportunities around the world. Register now here at www.gladwinanalytics.com to get started.

Graphical Analysis

The aim of this exercise is to build a simple regression model that we can use to predict Distance (dist) by establishing a statistically significant linear relationship with Speed (speed). But before jumping in to the syntax, lets try to understand these variables graphically. Typically, for each of the independent variables (predictors), the following plots are drawn to visualize the following behavior:

Scatter plot: Visualize the linear relationship between the predictor and response
Box plot: To spot any outlier observations in the variable. Having outliers in your predictor can drastically affect the predictions as they can easily affect the direction/slope of the line of best fit.
Density plot: To see the distribution of the predictor variable. Ideally, a close to normal distribution (a bell shaped curve), without being skewed to the left or right is preferred. Let us see how to make each one of them.

Scatter Plot

Scatter plots can help visualize any linear relationships between the dependent (response) variable and independent (predictor) variables. Ideally, if you are having multiple predictor variables, a scatter plot is drawn for each one of them against the response, along with the line of best as seen below.

scatter.smooth(x=cars$speed, y=cars$dist, main="Dist ~ Speed")  # scatterplot

The scatter plot along with the smoothing line above suggests a linearly increasing relationship between the ‘dist’ and ‘speed’ variables. This is a good thing, because, one of the underlying assumptions in linear regression is that the relationship between the response and predictor variables is linear and additive.

BoxPlot – Check for outliers

Generally, any datapoint that lies outside the 1.5 * interquartile-range (1.5?*?IQR) is considered an outlier, where, IQR is calculated as the distance between the 25th percentile and 75th percentile values for that variable.

par(mfrow=c(1, 2))  # divide graph area in 2 columns
boxplot(cars$speed, main="Speed", sub=paste("Outlier rows: ", boxplot.stats(cars$speed)$out))  # box plot for 'speed'
boxplot(cars$dist, main="Distance", sub=paste("Outlier rows: ", boxplot.stats(cars$dist)$out))  # box plot for 'distance'

Density plot – Check if the response variable is close to normality

Continue Reading here on www.gladwinanalytics.com - world's leading network of data science professionals.

Special thanks to r-statistics for the great content.

要查看或添加评论，请登录

Anandh Shanmugaraj的更多文章

Big Data in Aviation

2017年10月19日

Big Data in Aviation

We hear a lot about big data's ability to deliver usable insights - but what does this mean exactly for enterprises in…

1 条评论
Data Science Opportunities - 45000+ Roles Worldwide - 2016 Year End Update

2016年12月28日

Data Science Opportunities - 45000+ Roles Worldwide - 2016 Year End Update

Below are the list of data science opportunities with industry leading employers and highly successful startups around…
300 Hours of Free Video Tutorials on R Programming

2016年12月22日

300 Hours of Free Video Tutorials on R Programming

Ad: 50,000+ Data Science Jobs - Apply for Machine Learning, Data Mining, Analytics, Research and AI Jobs in USA, UK…

7 条评论
Big Data, IoT and Industrial Internet - Industry Uses

2016年12月4日

Big Data, IoT and Industrial Internet - Industry Uses

Ad: 50000 Data Science Jobs Globally | Over 10000 Hours of Free Data Science Video Tutorials - Only on Gladwin…
Big Data, Hadoop and Spring - Online Tutorials

2016年11月30日

Big Data, Hadoop and Spring - Online Tutorials

Ad: Over 50,000 Data Science Jobs Worldwide | 8000+ Hours of Free Data Science Tutorials…
Introduction to Bioinformatics - 40 Hours of Free Video Tutorials

2016年11月29日

Introduction to Bioinformatics - 40 Hours of Free Video Tutorials

Ad: Over 8000 Hours of Free Data Science Courses | 50000+ Data Science Jobs Worldwide…

2 条评论
R - Advanced Regression Models

2016年11月28日

R - Advanced Regression Models

Ad: Free Python Tutorials | Free R Tutorials | Free Deep Learning Tutorials | Free Machine Learning Tutorials | Free…
Learn Python Programming Free - 127 Hours of Free Tutorials from the world's expert data scientists

2016年11月27日

Learn Python Programming Free - 127 Hours of Free Tutorials from the world's expert data scientists

Ad: 8000 Hours of Data Science Tutorials - Start Learning | 50000+ Data Science Opportunities with worlds leading…

2 条评论
Learn Computer Vision - 20 Hours of Free Expert Video Tutorials

2016年11月26日

Learn Computer Vision - 20 Hours of Free Expert Video Tutorials

Ad: 7500+ Hours of Free Online Courses - Start learning for free. 50000+ Data Science Jobs Worldwide - Find and apply…
Deep Learning Demystified - 70 Hours of World's Finest Tutorials - Free

2016年11月25日

Deep Learning Demystified - 70 Hours of World's Finest Tutorials - Free

Ad: Register now to watch 5000 Hours of Free Data Science Video Tutorials Deep learning (also known as deep structured…

30 条评论

See all articles

Everything you need to know about Linear Regression

Anandh Shanmugaraj

Group CEO & MD at Gladwin International & Company ?? India's leading Interim Leadership Consulting, Executive Search and Leadership Advisory Firm.

Ad: Learn from more than 7500 Hours of Free Data Science Video Tutorials - Start now >

Introduction

Example Problem

Ad: Apply for more than 50000 Data Science Job Opportunities around the world. Register now here at www.gladwinanalytics.com to get started.

Graphical Analysis

Scatter Plot

BoxPlot – Check for outliers

Density plot – Check if the response variable is close to normality

Anandh Shanmugaraj的更多文章

社区洞察

其他会员也浏览了

Clustering: Unveiling Patterns and Relationships in Unlabeled Data

Data Science – Machine Learning Interview Questions

2017 Business Science Blog In Review

Important Questions for Data Scientist Interview Pt-2

AIML 11- Choosing the appropriate correlation coefficient

Building a logistic regression model and the ROC curve; Hyperparameter tuning with GridSearchCV

Essential Data Science Concepts from A to Z

AutoEDA with glook

Ad: Learn from more than 7500 Hours of Free Data Science Video Tutorials - Start now >

Introduction

Example Problem

Ad: Apply for more than 50000 Data Science Job Opportunities around the world. Register now here at www.gladwinanalytics.com to get started.

Graphical Analysis

Scatter Plot

BoxPlot – Check for outliers

Density plot – Check if the response variable is close to normality

Anandh Shanmugaraj的更多文章

Big Data in Aviation

Data Science Opportunities - 45000+ Roles Worldwide - 2016 Year End Update

300 Hours of Free Video Tutorials on R Programming

Big Data, IoT and Industrial Internet - Industry Uses

Big Data, Hadoop and Spring - Online Tutorials

Introduction to Bioinformatics - 40 Hours of Free Video Tutorials

R - Advanced Regression Models

Learn Python Programming Free - 127 Hours of Free Tutorials from the world's expert data scientists

Learn Computer Vision - 20 Hours of Free Expert Video Tutorials

Deep Learning Demystified - 70 Hours of World's Finest Tutorials - Free

社区洞察

其他会员也浏览了

Clustering: Unveiling Patterns and Relationships in Unlabeled Data

Data Science – Machine Learning Interview Questions

2017 Business Science Blog In Review

Important Questions for Data Scientist Interview Pt-2

AIML 11- Choosing the appropriate correlation coefficient

Building a logistic regression model and the ROC curve; Hyperparameter tuning with GridSearchCV

Essential Data Science Concepts from A to Z

AutoEDA with glook