Applied Machine Learning: Linear Regression, LassoCV, ElasticNet, RidgeCV, and xgboost

Aditya Vivek Thota

Technical Content Creator for Developer Blogs | React, Nextjs, Python, Applied AI

发布日期: 2018年11月7日

Want to build a resume worthy machine learning project with actual real-life significance in just a couple of hours? Are you a beginner in this field or have no clue where to start? You are at the right place. Read on!

There is a lot of buzz going on around Artificial Intelligence and Machine Learning in the last few years. It’s not a surprise considering that our existence is full of patterns. Machine learning for me is simply pattern recognition on a fundamental level.

If you want to learn the basics, you should first learn statistics, probability, basics of programming, and then the fundamental algorithms of prediction and pattern recognition. However, this article is not about any of it. There are several online courses and articles to learn it all.

What I am going to cover here is applied machine learning. This series is not concerned about the inner workings of various algorithms but focuses on where and how to apply these models with real-life significance.

I am assuming that my reader is an absolute beginner to this field (and is using a Windows PC). So, without further delay, let’s get started in building your first project.

Step 1: Install the Anaconda Framework

Download the installation setup here. Select the setup based on your PC specification.
Install this in your system. It will install Anaconda Prompt and Spyder IDE as well.
These can be simply accessed from the search bar on Windows.
We will be using Spyder IDE to do all our programming.

Step 2: Choose Your Domain of Application

The three major areas in applied machine learning would be Computer Vision and Image Analysis, Speech Recognition and Natural Language Processing, and Prediction Analysis.
In this tutorial, I’ll cover a classic prediction problem.

Step 3: Choosing Your Dataset

This is one of the most important steps of your project.
Every machine learning project will need a dataset for training and testing purposes. Kaggle is an excellent place to pick your dataset.
Go to the above hyperlink and search for relevant datasets. You can also search for datasets elsewhere.
You can select your dataset first and then decide what to do with it if you are clueless about what kind of problem you want to solve.
One of the datasets I spotted is the ‘House Sales in King Country, USA Dataset’ which I shall use for predicting the price of a house. (Will discuss in a moment). For prediction purposes, you can choose from a wide variety of sales datasets, product prices datasets, or sports datasets to predict who will win, etc.
The above choice is completely arbitrary and you can choose any dataset that you may find interesting.

Step 4: Understanding What is Inside Your Dataset

Extract the downloaded zip files and examine the contents inside for the dataset you chose. In my case, it is as follows:
The House Sales Dataset has various features (19 to be precise) of houses along with their price mentioned. Let’s take all these features as X and the Price as Y, which we want to predict.
Now open Spyder IDE and create a new project. Copy the dataset file into the project folder.

For any project, the very first step would be importing the required libraries. This is done as follows-

import numpy as np
import pandas as pd
import matplotlib as plt #and so on...

In this way, all the required libraries can be imported. If you don’t happen to have any library pre-installed, you can launch ‘Anaconda Prompt’ from the Start menu search toolbar. With its help, you can install the required library using the simple pip command.

pip install pandas #This command would automatically install pandas

In a similar manner, you can install any library given you know its name. The Pandas library is generally used for reading the datasets.

A small tip before going further- Whenever you use Spyder to code, use ‘#%% ’ to divide your code into blocks. You can separately execute chunks or blocks of code using ‘Ctrl + Enter’. This approach will keep the code clean and it will be easy for debugging as well as shown in the sample screenshot.

Step 5: Building Your Model- Predicting the Price of a House

Importing the basic libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Reading our dataset-

data = pd.read_csv('kc_house_data.csv')

Cleaning the data- Let us remove all the unwanted or irrelevant data from our dataset. This can be based on your intuition. Here I’m removing some columns as an example. The syntax would look something like this. The following line would remove the ‘date’ and ‘zipcode’ columns from the dataset.

data = data.drop(['date', 'zipcode'], axis = 1)

Our problem has a lot of known features (X) (Like number of bedrooms, living area space, etc.) based on which the price (Y) is dependent on. We shall now plot a correlation matrix to understand which of the features highly influence the price of a house. This is visualized using the seaborn library. This step may be carried out to understand our data better.

corr = data.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(12, 9))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5})

(Dark pink blocks indicate higher correlation among the corresponding features)

Choosing an algorithm- Whenever we have a situation where ‘Y’ depends on several ‘X’, we can approach it using Linear Regression where we try to establish a linear relation between Y and all X’s i.e Y = a1X1 + a2X2 + … and so on. So, given X1, X2, … , we can predict the value of Y provided we know the values of a1, a2, … and so on. That’s exactly what we are trying to do here, finding out the best approximates for a1, a2, … and so on. Generally, the dataset is split into 70% training data and 30% testing data on which the trained model is tested to measure performance.

features = ['bedrooms', 'bathrooms', 'sqft_living', 
'sqft_lot', 'floors', 'waterfront', 'view', 'grade', 
'sqft_above', 'sqft_basement', 'condition', 'yr_built', 
'yr_renovated', 'lat', 'long', 'sqft_living15', 'sqft_lot15']

X = data[features]
y = data.price
X_train, X_test, y_train, y_test = 
train_test_split(X, y, test_size=0.30, random_state=0)

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
regressor = LinearRegression()
regressor.fit(X_train, y_train)  #This line of code trains our model
y_pred = regressor.predict(X_test)  #This line of code predicts the price

Our model is now trained and tested. We can use some standard metrics to measure how well our model performed.

df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
from sklearn import metrics 
 
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
df.head()

#Results
Mean Absolute Error:     125002.07442244422
Mean Squared Error:      43690646677.93878
Root Mean Squared Error: 209023.07690285967

#First 5 Predictions of our model
Actual     Predicted
297000.0   3.872015e+05
1578000.0  1.502715e+06
562100.0   5.274534e+05
631500.0   5.779358e+05
780000.0   9.993390e+05

Let us now visualize our predictions with actual prices for better understanding. A straight line with a slope of 45 degrees would indicate the perfect model.

plt.scatter(y_test, y_pred)
plt.xlabel("Prices")
plt.ylabel("Predicted prices")

It may seem as if we have got a lot of error. Definitely, the performance is poor but to some extent, the model was able to give an estimate of the price. As it can be seen in the first five predictions, for example, our model predicted the price of a house to be 1,502,715 but the actual price of the house was 1,578,000. This is relatively fine to give a very rough estimate but there is a huge scope for improvement.
Improving our model: Keep a note of the Root Mean Squared Error: 209023.07690285967 for our model. Let us see if we can reduce this error. Here are some bonus techniques for you to try.
I’ll be implementing three advanced techniques namely, LassoCV, ElasticNet, and RidgeCV. These are some good prediction models that take care of the shortcomings of linear regression. For now, you can just familiarize yourself with these techniques and see how they are practically implemented. Once you master linear regression, you should definitely check out the background working of these models. Here CV stands for Cross-Validation. Observe how we don’t use any X_test or y_test while training these models. That is because we use the entire dataset for training and testing. This is done through iteration of the training process using different data points within the dataset as test data and the rest of the data for training. Ultimately, the best model out of all iterations is chosen.

from sklearn.linear_model import LassoCV, RidgeCV, ElasticNet
from sklearn.model_selection import cross_val_score


#Implementation of LassoCV
lasso = LassoCV(alphas=[0.0001, 0.001, 0.01, 0.1, 1, 10, 100])
print("Root Mean Squared Error (Lasso): ", np.sqrt(-cross_val_score(lasso, X, y, cv=10, scoring='neg_mean_squared_error')).mean())


#Implementation of ElasticNet
elastic = ElasticNet(alpha=0.001)
print("Root Mean Squared Error (ElasticNet): ", np.sqrt(-cross_val_score(elastic, X, y, cv=10, scoring='neg_mean_squared_error')).mean())


#Implementation of RidgeCV
ridge = RidgeCV(alphas=[0.0001, 0.001, 0.01, 0.1, 1, 10, 100])
print("Root Mean Squared Error (Ridge): ", np.sqrt(-cross_val_score(ridge, X, y, cv=10, scoring='neg_mean_squared_error')).mean())


#Results
Root Mean Squared Error (Lasso):       203421.22072610114
Root Mean Squared Error (ElasticNet):  203442.40673916895
Root Mean Squared Error (Ridge):       203411.816202574

As you can see, the best we could reduce error to is 203,411 using RidgeCV. Still not that much impressive. Now, let me try out another method known as xgboost and see how it performs.

#Implementation of xgboost
import xgboost as xgb
regr = xgb.XGBRegressor(colsample_bytree=0.2, gamma=0.0, learning_rate=0.01, max_depth=4, min_child_weight=1.5, n_estimators=7200, reg_alpha=0.9, reg_lambda=0.6, subsample=0.2, seed=42, silent=1)
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
print("Root Mean Squared Error (Xgboost): ", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

plt.scatter(y_test, y_pred)
plt.xlabel("Prices")
plt.ylabel("Predicted prices")
plt.title("xgboost")

#Result
Root Mean Squared Error (Xgboost):  122692.65863401273

#The error is almost half the previous models!

And there you go! That’s a significant improvement from our previous models. So far, xgboost has turned out to be the best model to solve this problem. xgboost is a fast and more robust library that can give high performance in prediction problems. Also, please note that depending on the size and nature of the dataset you selected, the performance of your models could be much better than what I have got here.
Let’s check in reality how well our model works on a case to case basis by checking the first five predictions.

df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df.head()

#Result with xgboost
Actual     Predicted
297000.0   2.918043e+05
1578000.0  1.695336e+06
562100.0   5.116367e+05
631500.0   5.994548e+05
780000.0   6.963796e+05

This is much better than linear regression. For example, check the fourth prediction. 631,500 was the actual price and our model came as close as 599,454.

With that, we can conclude our first project on prediction! You can now directly use the trained model to estimate the price of a real house, given you have all the ‘X’ features as taken from the dataset by just using this line of code.

Predicted_price = model_name.predict(X)

If you have taken your own dataset and problem statement and followed the same steps, you would have reached your project conclusion with similar real-life applicability.

In case of any doubts or clarifications in applied machine learning or if you get stuck somewhere in implementing your model, feel free to ask down in the comments below.

Stay tuned for the next article where we will explore more diverse models and their application.

This article was first published by me on The Research Nest's blog here.

#IndiaStudents #StudentVoices #MachineLearning #Regression #ArtificialIntelligence

Amar Parajuli

Gen AI Engineer @ Mahfin || Business Understanding|| NLP Research || Graph Neural Network

5 年

Great !! Can we use any other algorithm other than Linear Regresssion for the same problem set. Maybe Logistic Regression. Even after using xgboost the error was 122692, isn't the error large? Will "Feature Scaling" help to reduce the error to some more extent? Thanks.