Linear Regression : (What/Why & How)

Linear Regression : (What/Why & How)

The theory behind linear regression

Linear regression is a basic and commonly used type of predictive analysis. The overall idea of regression is to examine two things: (1) does a set of predictor variables do a good job in predicting an outcome (dependent) variable? (2) Which variables, in particular, are significant predictors of the outcome variable, and in what way do they–indicated by the magnitude and sign of the beta estimates–impact the outcome variable? These regression estimates are used to explain the relationship between one dependent variable and one or more independent variables. 

The simplest form of the regression equation with one dependent and one independent variable is defined by the formula y = c + b*x, where y = estimated dependent variable score, c = constant, b = regression coefficient, and x = score on the independent variable.

Simple linear regression : 1 dependent variable (interval or ratio), 1 independent variable (interval or ratio or dichotomous)

Multiple Linear Regression : 1 dependent variable (interval or ratio) , 2+ independent variables (interval or ratio or dichotomous)

Three major uses for regression analysis are (1) determining the strength of predictors, (2) forecasting an effect, and (3) trend forecasting.

First, the regression might be used to identify the strength of the effect that the independent variable(s) have on a dependent variable. Typical questions are what is the strength of the relationship between dose and effect, sales and marketing spending, or age and income.

Second, it can be used to forecast the effects or impact of changes. That is, the regression analysis helps us to understand how much the dependent variable changes with a change in one or more independent variables. A typical question is, “how much additional sales income do I get for each additional $1000 spent on marketing?”

Third, the regression analysis predicts trends and future values. The regression analysis can be used to get point estimates. 

Example:

Step 1: Reading and Understanding the Data

here we are using advertising dataset

No alt text provided for this image

Here columns 1,2 & 3 namely TV, Radio & Newspaper are the predictive variable and fourth column Sales is the Target variable. here shown data for each row is the company invest the amount on respective advertising channels and achieved the given sales. The aim is to find out the best advertisement channel to achieve the highest sale.

Analyze the given dataset by the following methods for the shape of the data, missing values, min/max values etc.. basically to know about the data

No alt text provided for this image

Step 2: Visualize the dataset

Ploted all the 3 advertising channels with the sales to see the relationship among each other. It is clearly visualized that a positive relationship between TV & Sales compares to the other 2 graphs.

No alt text provided for this image

Secondly, we can also visualize the co-relations between 3 variables

No alt text provided for this image

Step 3: Building the model

As this is the example to illustrate the linear regression, which accepts one predictive variable and one target variable, so we have chosen TV & Sales as they are having a good relationship between them.

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Here the equation looks like Y=6.94+.054TV ,

  • The slope is positive which indicates a positive relationship.
  • Intercept is the positive number which shows even if you spend ZERO on a TV advertisement, at least sales will be 6.94, sales will not be zero
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Residual Analysis [Nalysis on Training set]

No alt text provided for this image
No alt text provided for this image

Prediction and Evaluation on the test Set

No alt text provided for this image
No alt text provided for this image

Here we can see that the r2 value on the predictive variable is 81.57% which is the same as the value shown in the above summary table. The R2 score for the target variable is 79.21 % k, which is under 5% of the predictive variable range.. which shows that the model what learnt from the predictive variable had applied the same approach to the target variable.

No alt text provided for this image

This plot shows a good relationship between the predicted variables, which indicates our model has done a good job.


You may also like to have a look

  1. Data Exploration using Pandas
  2. Data Visualization in Python (Different types of plots)
  3. Data Engineer Vs Data Analyst Vs Data Scientist
  4. Renewable Energy optimization with Big Data, Machine Learning, and Artificial Intelligence

5. Data processing with Python

#Data #Preprocessing #missing values#python #replacingbymean#replacingbymedian#categoricalvariable#ContinousVariable#Angad #regression #linearregression #prediction #targetvriable #statistics #trainand test #dataset #statmodels














Muhammad Nasir Imam

Geodata Science | Remote Sensing | GIS | Climate Change | Disaster Mgt | AI/ML/DL

4 年

Nicely done ??

回复
Kritika Sharma

Consultant-Siemens PTI | MIE India | T&D Professional | EV Professional

4 年

Is it possible to get a copy of this?

Er. SANJAY NISHAD

Project Engineer at Vindhya Telelinks Limited

4 年

Very useful

回复
Ambuj Bhardwaj

Sr Analytics & Insights Analyst | Empowering Business with Data Products & Insights | SQL | Python | Power BI | Tableau | Business Intelligence

4 年

Very simple explanation and implementation of linear regression. Nicely done ??

要查看或添加评论,请登录

社区洞察

其他会员也浏览了