登录查看更多内容

Tutorial 3: Applying Linear Regression in Python

Aishwarya C Ramachandran

Director at Visa

发布日期: 2018年6月25日

GitHub Repo: Tutorial 3: Applying Linear Regression

In this tutorial, we will learn the Python way of applying models to cleaned datasets and to visualize how good our model is. In this tutorial you will learn

1. How to apply models to a cleaned dataset

2. Evaluate how good our model is

Context

The goal of our exploration was to accurately identify what factors highly influence the happiness in a country. And, what factors don’t.

GitHub : Tutorial 2: Exploratory Data Analysis

LinkedIn Article : Tutorial 2 : Exploratory Data Analysis

Conclusions from our exploratory data analysis are the below: -

Countries from Western Europe are the happiest
Economy (GDP), Health and Happiness are the biggest predictors of Happiness in a country
It is interesting that freedom in country influences the happiness only by 50%

Clean the dataset

Cleaning the dataset includes us identifying all the dependent and the independent variables. We also decide at this point the inputs (x) that you want to give your model. The code is given in the snapshot below or you can access the GitHub repo given at the top of this article.

In the above code you can see that the columns that have been removed are: -

1. Country (As we have used pd.dummies to create numerical values for countries )

2. Region (As we have used pd.dummies to create numerical values for region)

3. Happiness Rank

4. Happiness Score

5. Standard Error

By printing the shape of the matrix, we now can see the number of features of each of this dataset. This shows that there are 158 examples or rows and 16 input values. As a result, there are 16 independent variables and 1 dependent variable which is the Happiness Score.

Split the dataset to training and test data

Once we have the cleaned dataset and we decide what are the dependent and the independent variables – we need to split the data to training and test data. Training is for us to train our model on and test – is for us to check the accuracy of our model.

From the above, we are clear on the input features. We use the sklearn.cross_validation to split the training and the test data.

It is recommended that the training data be more than the test data. Some people opt for an80-20 split. This code has made a 75-25 split.

Using the statsmodel.api library to apply Linear Regression

Now as the output is continuous – the model to be applied should be linear regression. Note that Logistic Regression are for Binary outputs (0/1)

Once we have the training and the test set ready, we need to fit the data into the model as shown in the code above. We name the Linear Regression model applied with our data as ‘model’. And then use it to output the y_pred.

Viewing the accuracy of the model – RMSE (Root Mean Square Error)

In this section, we are attempting to evaluate how good our model is. The code is as shown below: -

1. The RMSE (Root Mean Squared Error) is calculated.

2. A line graph of the predicted and the test values are plotted

3. Scatter plot of the predicted and the test values are plotted

In the below example – the predicted values matches very closely to the actual value. We can see that the Root mean squared value is very very low. Indicating that our model is very good!

Sourav Nandi

Building Dezerv | IIT Kanpur | Ex- Morgan Stanley

6 年

Great article, Aishwarya! Recommending this implementation in my article relating to Linear Regression.

1 次回应

Jose Berengueres

Professor Design Thinking | CS

6 年

NICE!!

1 次回应

查看更多评论

要查看或添加评论，请登录

Aishwarya C Ramachandran的更多文章

Greenlights – Review

2021年5月23日

Greenlights – Review

This article is the sole opinion of the author and is based on her interpretation of the book: Greenlights Let me begin…

5 条评论
Rolling Average in Power BI

2019年9月16日

Rolling Average in Power BI

I am excited to write this article on rolling averages. Writing this measure would be very useful when you analyze…

5 条评论
Key Influencers in Power BI - Part 2

2019年8月31日

Key Influencers in Power BI - Part 2

In this blog post – I wanted to further explore the first AI Visual in Power BI. The Key Influencers visual.

1 条评论
Drill – Through in Power BI

2019年8月25日

Drill – Through in Power BI

This is also one of the old features in Power BI. However, super useful when attempting to go ahead and visualize the…

4 条评论
Conditional Formatting in Power BI with Icons

2019年8月17日

Conditional Formatting in Power BI with Icons

This blog would explain how to use the new conditional formatting option that is available since the July 2019…

2 条评论
Report Tooltip Pages in Power BI

2019年8月4日

Report Tooltip Pages in Power BI

In this blog article I wanted to write about the benefits of Report Tooltips. I am particularly excited to write about…

1 条评论
Identifying AND Counting duplicates in Power BI Tables

2019年7月29日

Identifying AND Counting duplicates in Power BI Tables

Solving this issue helped me learn the ALLEXCEPT DAX Query. Challenge · Identify and store a count of the number of…

8 条评论
Dynamic Labels in Power BI

2019年7月25日

Dynamic Labels in Power BI

As I create and envisage more and more insightful reports from Power BI – the more I discover from this tool! This blog…

6 条评论
Scatter plot – Size of bubbles and the play axis

2019年7月21日

Scatter plot – Size of bubbles and the play axis

Every time I discover a new visual in Power BI, I realize how many features are yet to be discovered in this massively…
KPI Cards in Power BI

2019年7月18日

KPI Cards in Power BI

This is one of the default visuals available in PBI however for some strange reason I have never come about to using it…

3 条评论

See all articles

Tutorial 3: Applying Linear Regression in Python

Aishwarya C Ramachandran

Director at Visa

GitHub Repo: Tutorial 3: Applying Linear Regression

Context

Clean the dataset

Split the dataset to training and test data

Using the statsmodel.api library to apply Linear Regression

Viewing the accuracy of the model – RMSE (Root Mean Square Error)

Aishwarya C Ramachandran的更多文章

社区洞察

其他会员也浏览了

Data Cleaning and Preprocessing in Python: Best Practices

Building a Machine Learning Model from Scratch Using?Python

SIMPLE LINEAR REGRESSION IN PYTHON :

Python MACHINE LEARNING

Everything that you should know about Linear Regression in python

Creating AI Models in Excel with Python

An introduction to simple linear regression model using python

Building a Simple Spam Detection Model Using Python and Machine Learning

Analytic Hierarchy Process (AHP) with Python??

TRAIN AND TEST A DOCUMENT CLASSIFIER IN 4 EASY STEPS USING PYTHON

GitHub Repo: Tutorial 3: Applying Linear Regression

Context

Clean the dataset

Split the dataset to training and test data

Using the statsmodel.api library to apply Linear Regression

Viewing the accuracy of the model – RMSE (Root Mean Square Error)

Aishwarya C Ramachandran的更多文章

Greenlights – Review

Rolling Average in Power BI

Key Influencers in Power BI - Part 2

Drill – Through in Power BI

Conditional Formatting in Power BI with Icons

Report Tooltip Pages in Power BI

Identifying AND Counting duplicates in Power BI Tables

Dynamic Labels in Power BI

Scatter plot – Size of bubbles and the play axis

KPI Cards in Power BI

社区洞察

其他会员也浏览了

Data Cleaning and Preprocessing in Python: Best Practices

Building a Machine Learning Model from Scratch Using?Python

SIMPLE LINEAR REGRESSION IN PYTHON :

Python MACHINE LEARNING

Everything that you should know about Linear Regression in python

Creating AI Models in Excel with Python

An introduction to simple linear regression model using python

Building a Simple Spam Detection Model Using Python and Machine Learning

Analytic Hierarchy Process (AHP) with Python??

TRAIN AND TEST A DOCUMENT CLASSIFIER IN 4 EASY STEPS USING PYTHON