登录查看更多内容

The Regression !

Samyuktha P.

发布日期: 2018年6月20日

A technique for determining the statistical relationship between two or more variables where a change in a dependent variable is associated with, and depends on, a change in one or more independent variables.

Regression is basically a continuous supervised learning algorithm and is one of the popular algorithms.

Continuous and Discrete Data:

Discrete data are descriptive (like “fast” or “slow”), whereas continuous are numeric value calculated based on its relation with the independent variable.

Linear Regression example:

Take the following example

Let assume value of Y depends on X . The above graph is roughly drawn for the data above.

Here the red line which roughly touches all the points is called regression line and you can clearly see that line intersects at ‘c’ on Y axis (Y intercept)

And if “m” is slope of line then equation of line is given by:

Y= mX+c

Based on this formula you can predict the value of Y from X

And this is the basic idea of using Regression in ML.

From the above graph, since the relationship between the independent variables(X) and the dependant variable(Y) is linear, the regression is known as linear regression.

Training Data:

We apply regression on our data and find out slope and intercept and using line formula we solve Y for any given X or we can say our machine predicts the Y for any given X.

Errors:

Error are referred to the distance between any point and regression line, Regression lines are drawn such that they pass from mean value

I.e y’ = mx’ + c y’= ??y / n

x’= ??x/ n

n- number of dataset

Because of this, few points, in real-time data, may not fall exactly on line and these differences are errors.

Mean Squared Error:

This tells you how close your regression line is to a set of points away from the line. It takes the distance from the points to the regression line and squares them. This distance is an error. The squaring is done to remove all negative signs if any. It’s called the MSE as you’re finding the average of a set of errors.

Steps to calculate MSE :

Find the regression line
Substitute your values of X and then find the new predicted values of Y.
Find the difference in the predicted value and the actual value.
Square them
Add all of the errors and then find the mean.

MSE is used to find the line of best fit. Small the value, better the result.

Equation :

R-squared error :

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.

The definition of R-squared is fairly straight-forward. It is the percentage of the response variable variation that is explained by a linear model. Or:

R-squared = Explained variation / Total variation

R-squared is always between 0 and 100%:

0% indicates that the model explains none of the variability of the response data around its mean.
100% indicates that the model explains all the variability of the response data around its mean.

Equation of Hypothesis:

Cost Function :

A cost function is something you want to minimize. For example, your cost function might be the sum of squared errors over your training set.

Gradient Descent :

It is a method for finding the minimum of a function of multiple variables. So you can use gradient descent to minimize your cost function. If your cost is a function of N variables, then gradient is the length N vector that defines the particular direction in which the cost is increasing very rapidly.

For an excellent explanation on gradient descent, please visit the week-1 of this course by Andrew NG on Coursera. It is recommended that you take this course :)

https://www.coursera.org/learn/machine-learning

The image above is a very clear explanation for the Gradient Descent.

Credits : Andrew NG’s ML course from Coursera

Logistic Regression :

In statistics, the logistic model (or logit model) is a statistical model that is usually taken to apply to a binary dependent variable. In regression analysis, logistic regression or logit regression is estimating the parameters of a logistic model. More formally, a logistic model is one where the log-odds of the probability of an event is a linear combination of independent or predictor variables. The two possible dependent variable values are often labelled as “0” and “1”, which represent outcomes such as pass/fail, win/lose, alive/dead or healthy/sick.

It uses sigmoid function of the logistic function to predict the result.

Classification with Logistic Regression :

It is an algorithm that is quite simple and is very useful binary classification ( classifying yes/no, fast/slow, dead/alive etc ). It can be used to handle multiple classes too. Such a classification is called ‘one-vs-all’ classification. One-vs all is basically a collection of binary classifiers in which the probability is found. FInally, the one with the highest probability is chosen as the final result.

Sigmoid Function :

If value of y(0)=0.5
If y(z)>=0.5, then z >=0
If y(z)<0.5, then z<0
Hence, 0.5 is the cut off for the binary classification.
If, y(z)> = 0.5, answer is 1 (positive case). Else if y(z)<0.5, answer is 0 (negative case).

Decision Boundary :

It acts as a separation between the two classes. The line defined by the separation between the two areas of y(z)=0 and y(z)=1 is nothing but a decision boundary. If the feature variables xi are non-linear, then the decision boundary can be non-linear as well.

In the above figure, the red line acts as a decision boundary between 2 classes i.e. blue circles and the green triangles.

The above graph is a circular decision boundary which separates the green triangles and the blue circles.

One vs all classification :

We saw the cases with 2 classes above.

Now let us consider the case with 3 classes.

If we try class 1 vs all, we get a binary classification like the one below :

If we try class 2 vs all, we get a binary classification like the one below :

If we try class 3 vs all, we get a binary classification like the one below :

This way, we find the value of hypothesis on all the 3 cases and choose the highest value among them. That is the final answer we need. Since there are 3 classes, its a 3 binary classification problem.

The same can be extended to N cases. It will be a N binary classification problem

Other Regression:

Linear and Logistic Regression are rigid and do not work well when dataset has larger number of outliers, and we might have to preprocess data like feature selection or PCA, and many other type of regression are available like Lasso Regression, Elastic Net etc.

Written by Aditya Shenoy and Samyuktha Prabhu

For the link of the same on Medium, click here

#IndiaStudents #MachineLearning #ComputerScience #DataScience #DataAnalytics #ArtificialIntelligence #Computers #Engineering #Regression #LinearRegression #LogisticRegression

Anirudha Nayak

Data Scientist at Equifax

6 年

Great one !! You have taught me the entire course of my OE in one go. Carry on the good work :)

1 次回应

查看更多评论

要查看或添加评论，请登录

Samyuktha P.的更多文章

All about backtracking ! :D

2018年6月19日

All about backtracking ! :D

Imagine you are given a task of placing N number of queens on an N X N chessboard such that no queen attacks the other.…
Decision Tree it is!

2018年5月25日

Decision Tree it is!

Definition: A Decision tree is a tree-based supervised learning algorithm used in the predictive analysis of data which…
Dijkstra’s Algorithm made easy :D

2018年2月21日

Dijkstra’s Algorithm made easy :D

Hey There ! Imagine that you’re traveling to N cities and what if you want save some time and fuel? You have to…

1 条评论
Computer Vision Resources

2018年2月2日

Computer Vision Resources

Hey there! We held an introductory session as a part of ACM-W Manipal, recently, where I spoke about Computer Vision…

2 条评论

The Regression !

Samyuktha P.

Samyuktha P.的更多文章

社区洞察

其他会员也浏览了

Linear Regression(mostly asked questions) #manralai_top30

Checking for the Assumptions of Linear Regression using the mtcars dataset ????

Multicollinearity in Linear Regression

Can Machines Predict Your Future? Exploring the Power and Limits of Regression

Regularization in Regression: A Simple Guide to Lasso and Ridge

Linear Regression and Its Application in Credit Analysis and Financial Data Analytics

How logistic regression can save the day?

Proportions as Dependent Variable in Regression–Which Type of Model?

Linear Regression (Less Linear Than You Might Think)

An Introduction to Regression Modeling: Concepts, Types, and Applications

Samyuktha P.的更多文章

All about backtracking ! :D

Decision Tree it is!

Dijkstra’s Algorithm made easy :D

Computer Vision Resources

社区洞察

其他会员也浏览了

Linear Regression(mostly asked questions) #manralai_top30

Checking for the Assumptions of Linear Regression using the mtcars dataset ????

Multicollinearity in Linear Regression

Can Machines Predict Your Future? Exploring the Power and Limits of Regression

Regularization in Regression: A Simple Guide to Lasso and Ridge

Linear Regression and Its Application in Credit Analysis and Financial Data Analytics

How logistic regression can save the day?

Proportions as Dependent Variable in Regression–Which Type of Model?

Linear Regression (Less Linear Than You Might Think)

An Introduction to Regression Modeling: Concepts, Types, and Applications