Tutorial 3: Applying Linear Regression in Python
GitHub Repo: Tutorial 3: Applying Linear Regression
In this tutorial, we will learn the Python way of applying models to cleaned datasets and to visualize how good our model is. In this tutorial you will learn
1. How to apply models to a cleaned dataset
2. Evaluate how good our model is
Context
The goal of our exploration was to accurately identify what factors highly influence the happiness in a country. And, what factors don’t.
GitHub : Tutorial 2: Exploratory Data Analysis
LinkedIn Article : Tutorial 2 : Exploratory Data Analysis
Conclusions from our exploratory data analysis are the below: -
- Countries from Western Europe are the happiest
- Economy (GDP), Health and Happiness are the biggest predictors of Happiness in a country
- It is interesting that freedom in country influences the happiness only by 50%
Clean the dataset
Cleaning the dataset includes us identifying all the dependent and the independent variables. We also decide at this point the inputs (x) that you want to give your model. The code is given in the snapshot below or you can access the GitHub repo given at the top of this article.
In the above code you can see that the columns that have been removed are: -
1. Country (As we have used pd.dummies to create numerical values for countries )
2. Region (As we have used pd.dummies to create numerical values for region)
3. Happiness Rank
4. Happiness Score
5. Standard Error
By printing the shape of the matrix, we now can see the number of features of each of this dataset. This shows that there are 158 examples or rows and 16 input values. As a result, there are 16 independent variables and 1 dependent variable which is the Happiness Score.
Split the dataset to training and test data
Once we have the cleaned dataset and we decide what are the dependent and the independent variables – we need to split the data to training and test data. Training is for us to train our model on and test – is for us to check the accuracy of our model.
From the above, we are clear on the input features. We use the sklearn.cross_validation to split the training and the test data.
It is recommended that the training data be more than the test data. Some people opt for an80-20 split. This code has made a 75-25 split.
Using the statsmodel.api library to apply Linear Regression
Now as the output is continuous – the model to be applied should be linear regression. Note that Logistic Regression are for Binary outputs (0/1)
Once we have the training and the test set ready, we need to fit the data into the model as shown in the code above. We name the Linear Regression model applied with our data as ‘model’. And then use it to output the y_pred.
Viewing the accuracy of the model – RMSE (Root Mean Square Error)
In this section, we are attempting to evaluate how good our model is. The code is as shown below: -
1. The RMSE (Root Mean Squared Error) is calculated.
2. A line graph of the predicted and the test values are plotted
3. Scatter plot of the predicted and the test values are plotted
In the below example – the predicted values matches very closely to the actual value. We can see that the Root mean squared value is very very low. Indicating that our model is very good!
Building Dezerv | IIT Kanpur | Ex- Morgan Stanley
6 年Great article, Aishwarya! Recommending this implementation in my article relating to Linear Regression.
Professor Design Thinking | CS
6 年NICE!!