Machine Learning For Beginners
Introduction
For beginners, machine learning might seem intimidating with all the calculus, statistics, and algorithms that confuse you even before you start.
In this article, I want to demonstrate machine learning in the simplest way that machine learning is not as difficult as it seems.
Intuition
The intuition of machine learning is to train a model with historical data
Here are the basic steps to write a simple python code to make predictions about your data.
Import Data
We’ll use the famous?Titanic?dataset in this tutorial because this dataset is design to illustrate machine learning for beginners.
import pandas as pd
train_data = pd.read_csv('your-file-path/train.csv')
Data Understanding
The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this tutorial, we will build a predictive model
Let’s first check the?null values?in the data.
# train_data.isna().sum()
# Drop columns in the training set
train_data.drop(['Ticket','Cabin','Name','PassengerId'],axis=1,inplace=True)
Note that we removed ‘PassengerId’ and ‘Ticket’ from the training set. This is because the id of each passenger doesn’t tell much information about whether they will survive or not.
Most machine learning models cannot take null values as input while making predictions. Therefore we should replace the missing values with the most probable value (make your closest guess). The simplest way of making a guess is the central value of the distribution (mean, mode, median).
To fill in the missing values with their central values, we first need to calculate them. Note that we removed the null values from the central value calculations so that they don’t affect the result.
# Calculates central values
freq_port = train_data.Embarked.dropna().mode()[0]
avg_fare = round(train_data.Fare.dropna().mean(),2)
avg_age = round(train_data.Age.dropna().mean(),2)
# Replace missing values
train_data['Embarked'] = train_data['Embarked'].fillna(freq_port)
train_data["Fare"] = train_data["Fare"].fillna(avg_fare)
train_data["Age"] = train_data["Age"].fillna(avg_age)
# Check missing values
train_data.isna().sum()
Some of the models cannot take categorical variables as input. So we’ll need to convert them into integers.
# Label Encoding for "Embarked" column
train_data['Embarked'] = train_data['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
train_data['Embarked']
# Label Encoding for "Embarked" colum
train_data['Sex'] = train_data['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
train_data['Sex']n
Now that we’re ready to train our model, we need to split the data into training and testing sets. A training set is data used to train the model while a testing set is to evaluate the accuracy of the model.
领英推荐
from sklearn.model_selection import train_test_split
X_train = train_data.drop("Survived", axis=1)
y_train = train_data["Survived"]
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size = 0.2)
Model Training with LightGBM
There are many models we can consider but Gradient Boosted Trees is always my first choice since it is pretty robust to most prediction problems. In real business problems, it’s not always this easy. We will need to do feature extraction and hyperparameter tuning. However, in today’s article, we’ll keep it simple by staying with the default. `model.fit` is the one-line code that is going to train the model based on our data.
from lightgbm import LGBMClassifier
model = LGBMClassifier()
model.fit(X_train,y_train)
Make predictions with the trained model
When we use model.fit() to fit the model, “model” is now the trained model. So, we will use “model” to make predictions using the X variables in the testing set.
y_pred = model.predict(X_test)
y_pred
Note that y_pred is now a list of output from the model using X_test as the X variables. Let’s convert it back to Pandas Data Frame so that we can evaluate the output.
df_test = X_test.copy()
df_test['Survived'] = y_test
df_test['Prediction'] = y_pred
df_test
Prediction Accuracy
We need to justify how accurate is our model prediction. So we need to compare the accuracy of our prediction against the true value.
from sklearn.metrics import accuracy_score
y_true = df_test['Survived']
y_pred = df_test['Prediction']
accuracy = accuracy_score(y_true, y_pred)
print(str(round(accuracy*100,2))+'%')
The simplest evaluation metric for a classification problem is the accuracy score. However, in many situations, accuracy is insufficient to tell the true story. Anyway, let’s save this topic for another day. After all, this article is to show how machine learning is done in the simplest way.
Make Prediction
Now we have a trained model and evaluated its accuracy, now we can use it to make predictions based on given X values. Yes, the model is not a crystal ball and cannot make predictions out of thin air. We need to pass the X variables that we used to train the model into it to make predictions.
Note that the values in the X variables need not be the same as our training set, it’s the values that you want to use to make predictions. What will the output be when the X variables are these values?
predict_data = pd.read_csv(f"{filepath}\\test.csv")
# Calculates central values
freq_port = predict_data.Embarked.dropna().mode()[0]
avg_fare = round(predict_data.Fare.dropna().mean(),2)
avg_age = round(predict_data.Age.dropna().mean(),2)
# Replace missing values
predict_data['Embarked'] = predict_data['Embarked'].fillna(freq_port)
predict_data["Fare"] = predict_data["Fare"].fillna(avg_fare)
predict_data["Age"] = predict_data["Age"].fillna(avg_age)
# Check missing values
predict_data.isna().sum()
We will need to make the data structure into the same structure with our training set.
X_predict = predict_data.drop(columns = ['PassengerId','Cabin','Name','Ticket'])
X_predict['Embarked'] = predict_data['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
X_predict['Sex'] = predict_data['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
X_predict.info()
Now that the prediction set is ready, we will use our trained model to make predictions on the data.
prediction = model.predict(X_predict)
predict_data['Prediction'] = prediction
predict_data
You can then export this data to .csv file for your analysis.
Conclusion
Machine learning is training a model with historical data and using the model to make predictions on unseen data. We’ve seen how?model training?is just a line of code as well as using the trained model to?make predictions?on an unseen data set.
There is more than just that. In real-world problems, much effort is needed to:
Don’t be discouraged by the complication that might arise, everyone has to start somewhere and this article aims to provide you with the big picture so that you can set the direction right when you first started.
With all that said, I will share my experience here along my learning journey so that it can benefit my readers.