SIMPLE LINEAR REGRESSION IN PYTHON :

SIMPLE LINEAR REGRESSION IN PYTHON :

In this article, I will talk about the most important machine learning algorithm.?The algorithm name is Simple Linear Regression (SLR).?Here is the tutorial, you will learn how to do simple linear regression in python.

Overview of this article :

*?What is Simple Linear Regression?

*?Why Linear?

*?Equation of straight line

*?Simple Linear Regression In Python

*?Import requiring libraries and dataset for Simple Linear Regression

*?Understanding the dataset in python

*?Split into Independent and Dependent values

*?Visualizing the dataset

*?Split the dataset into training and test set

*?Simple Linear Regression model fitting

*?Predict for test set values

*?Making single prediction

*?Residual

*?R2_Score (Co-efficient of determination)

*?Residual Sum of Square (RSS)

*?Total Sum of Square (TSS)

*?Getting the final regression equation with the values of the coefficients.

1.????Simple Linear Regression :

???????????????Simple Linear Regression is a most used important algorithm in?Supervised Machine Learning.?Simple Linear Regression is used to find the relationship between Independent values (X) and Dependent values (Y).

2.????Why Linear ?

???????????????Arranged in or extending along a straight line or nearly straight line.?So it is called as linear.

Equation of straight line :

??????????????????????????????????????????????Y = mX + c

???????????????Y = Dependent variable (outcomes)

(The variable that needs to be estimated and predicted)

X = Independent Variable (Features)

(The variable that is controllable.?It is the input)

m = Slope

(It determines what will be the angle of line.?It is the parameter denoted as β.

c = Intercept

(A constant that determines the value of y when x is 0)

3.????Simple Linear Regression in Python :

???????????????Simple Linear Regression is commonly used for predictive?analysis.?Here is the tutorial you will learn how to do simple linear regression in python.

???????????????Goal : Predict the percentage of student based on the number of study hours.

3.1 Import requiring libraries and dataset:

No alt text provided for this image

Pandas stands for “PYTHON DATA ANALYSIS” and it is used to analyzing the data and create a data frame.?

NumPy stands for “NUMERICAL PYTHON” and it is used for work with arrays.

Matplotlib and Seaborn are used to do data visualizations.

Train & Test split is used for split our dataset into training set and test set.

Linear Regression is used for fit the model on our data set.

3.2 Import the Dataset :

I’m using study data for this model.?In this dataset has only 2 variables, there are hours and scores.

Hours : Students number of hours studied per day.

Scores : Students scored the marks in exam.

No alt text provided for this image

3.3 Understanding the dataset :

No alt text provided for this image

Our dataset is a tiny because its easy to understand for beginners.?It has only 2 variables and 25 rows.?One is Hours and another one is Scores.

No alt text provided for this image

In our dataset has no null values (missing values).?If we have any missing values we can replace with mean or median values.

No alt text provided for this image

The Hours column has a float values and Scores column has a integer values.

No alt text provided for this image

In describing our dataset we got a count value, mean value, median value, standard deviation value, minimum value, maximum value, first quartile (25%) value, and third quartile (75%) value.

Split Independent and Dependent Values :

Now, we want split our dataset into independent and dependent.

X is a independent value (The variable is a controllable and it is a input value)

Y is a dependent value (The variable that needs to be predicted)

No alt text provided for this image

3.4 Visualizing the dataset :

No alt text provided for this image

Suppose, if we have a categorical values then we want to convert into numerical value.

Split our dataset into training set and test set :

???????????????The train and test split is a technique for evaluating the performance of a machine learning algorithm.

Training Set : It can be used to fit machine learning model

Test Set : It can be used to evaluate the fit machine learning model

No alt text provided for this image

We have splitted using paretto rule (80:20)

We used the paretto rule for test size that means 80% of our dataset going to training set and 20% of the data going to test set.

If we didn’t use random state we will get different train and test sets across different executions and the shuffling process is out of control. If we use some values for random state whenever we execute that line we will get same train and test sets.

No alt text provided for this image

After split our dataset into train and test we got only 20 rows in train and 5 rows in test.

3.5 Simple Linear Regression model fitting :

No alt text provided for this image

Here, we fit our simple linear regression model for independent train and dependent train data.

3.6 Predict for test set values :

No alt text provided for this image

This values are predicted values for our test data values.

3.7 Making single prediction:

No alt text provided for this image

But, how can we know our predicted results will happen or not.?So, we can use R2_Score for how much possibility for prediction will happen.

3.8 Residual :

The residual for each observations is the difference between the predicted value of y and observed value of y

How to calculate residual ?

RESIDUAL = ACTUAL Y VALUE (-) PREDICTED Y VALUE

ri = yi – yi^

3.9 R2_Score (Co-efficient of determination) :

R squared (Co-efficient of determination) is computed as

1-RSS / TSS.

RSS stands for Residual Sum of Square, and TSS stands for Total Sum of Squares.

3.9.1 RSS:

Residual sum of squares =

RSS calculates the degree of variance in a regression model.?It estimates the level of error in model’s prediction.?The smaller RSS is better.?

3.9.2 TSS:

Total sum of squares =

First we want to find y mean.

Then Actual y – Mean y

After that square that values and add those values.

The metric explained the fraction of the variance between the value predicted by the model and the value as opposed to mean of the actual.?This value between 0 and 1.

No alt text provided for this image

We got 0.94, that mean a student have 94% possibilities to score a 93.69%.

3.10 Getting the final regression equation with the values of the coefficients :

No alt text provided for this image
No alt text provided for this image

Simple Linear Regression Formula : y = b0 + b1x1

Therefore, The equation of our simple linear regression model is

???????????????Scores = regressor intercept + (regressor coefficient * Number of Study Hours)

Or

Scores = 2.018160041434683 + (9.91065648 * Number of Study Hours)

Author : SANTHOSH KUMAR M

要查看或添加评论,请登录

社区洞察

其他会员也浏览了