Tutorial 3: Applying Linear Regression in Python

Tutorial 3: Applying Linear Regression in Python

  GitHub Repo: Tutorial 3: Applying Linear Regression

In this tutorial, we will learn the Python way of applying models to cleaned datasets and to visualize how good our model is. In this tutorial you will learn

1.      How to apply models to a cleaned dataset

2.      Evaluate how good our model is

Context

The goal of our exploration was to accurately identify what factors highly influence the happiness in a country. And, what factors don’t.

GitHub : Tutorial 2: Exploratory Data Analysis

LinkedIn Article : Tutorial 2 : Exploratory Data Analysis

Conclusions from our exploratory data analysis are the below: -

  1. Countries from Western Europe are the happiest
  2. Economy (GDP), Health and Happiness are the biggest predictors of Happiness in a country
  3. It is interesting that freedom in country influences the happiness only by 50%

Clean the dataset

Cleaning the dataset includes us identifying all the dependent and the independent variables. We also decide at this point the inputs (x) that you want to give your model. The code is given in the snapshot below or you can access the GitHub repo given at the top of this article.

In the above code you can see that the columns that have been removed are: -

1.      Country (As we have used pd.dummies to create numerical values for countries )

2.      Region (As we have used pd.dummies to create numerical values for region)

3.      Happiness Rank

4.      Happiness Score

5.      Standard Error

By printing the shape of the matrix, we now can see the number of features of each of this dataset. This shows that there are 158 examples or rows and 16 input values. As a result, there are 16 independent variables and 1 dependent variable which is the Happiness Score.

Split the dataset to training and test data

Once we have the cleaned dataset and we decide what are the dependent and the independent variables – we need to split the data to training and test data. Training is for us to train our model on and test – is for us to check the accuracy of our model.

From the above, we are clear on the input features. We use the sklearn.cross_validation to split the training and the test data.

It is recommended that the training data be more than the test data. Some people opt for an80-20 split. This code has made a 75-25 split.

Using the statsmodel.api library to apply Linear Regression

Now as the output is continuous – the model to be applied should be linear regression. Note that Logistic Regression are for Binary outputs (0/1)

Once we have the training and the test set ready, we need to fit the data into the model as shown in the code above. We name the Linear Regression model applied with our data as ‘model’. And then use it to output the y_pred.

 Viewing the accuracy of the model – RMSE (Root Mean Square Error)

In this section, we are attempting to evaluate how good our model is. The code is as shown below: -

1.      The RMSE (Root Mean Squared Error) is calculated.

2.      A line graph of the predicted and the test values are plotted

3.      Scatter plot of the predicted and the test values are plotted

In the below example – the predicted values matches very closely to the actual value. We can see that the Root mean squared value is very very low. Indicating that our model is very good!



Sourav Nandi

Building Dezerv | IIT Kanpur | Ex- Morgan Stanley

6 年

Great article, Aishwarya! Recommending this implementation in my article relating to Linear Regression.

Jose Berengueres

Professor Design Thinking | CS

6 年

NICE!!

要查看或添加评论,请登录

Aishwarya C Ramachandran的更多文章

  • Greenlights – Review

    Greenlights – Review

    This article is the sole opinion of the author and is based on her interpretation of the book: Greenlights Let me begin…

    5 条评论
  • Rolling Average in Power BI

    Rolling Average in Power BI

    I am excited to write this article on rolling averages. Writing this measure would be very useful when you analyze…

    5 条评论
  • Key Influencers in Power BI - Part 2

    Key Influencers in Power BI - Part 2

    In this blog post – I wanted to further explore the first AI Visual in Power BI. The Key Influencers visual.

    1 条评论
  • Drill – Through in Power BI

    Drill – Through in Power BI

    This is also one of the old features in Power BI. However, super useful when attempting to go ahead and visualize the…

    4 条评论
  • Conditional Formatting in Power BI with Icons

    Conditional Formatting in Power BI with Icons

    This blog would explain how to use the new conditional formatting option that is available since the July 2019…

    2 条评论
  • Report Tooltip Pages in Power BI

    Report Tooltip Pages in Power BI

    In this blog article I wanted to write about the benefits of Report Tooltips. I am particularly excited to write about…

    1 条评论
  • Identifying AND Counting duplicates in Power BI Tables

    Identifying AND Counting duplicates in Power BI Tables

    Solving this issue helped me learn the ALLEXCEPT DAX Query. Challenge · Identify and store a count of the number of…

    8 条评论
  • Dynamic Labels in Power BI

    Dynamic Labels in Power BI

    As I create and envisage more and more insightful reports from Power BI – the more I discover from this tool! This blog…

    6 条评论
  • Scatter plot – Size of bubbles and the play axis

    Scatter plot – Size of bubbles and the play axis

    Every time I discover a new visual in Power BI, I realize how many features are yet to be discovered in this massively…

  • KPI Cards in Power BI

    KPI Cards in Power BI

    This is one of the default visuals available in PBI however for some strange reason I have never come about to using it…

    3 条评论

社区洞察

其他会员也浏览了