Machine Learning For Beginners
Photo by Element5 Digital on Unsplash

Machine Learning For Beginners

Introduction

For beginners, machine learning might seem intimidating with all the calculus, statistics, and algorithms that confuse you even before you start.

In this article, I want to demonstrate machine learning in the simplest way that machine learning is not as difficult as it seems.

Intuition

The intuition of machine learning is to train a model with historical data and use the model to make predictions.

No alt text provided for this image
Image by Author

Here are the basic steps to write a simple python code to make predictions about your data.

Import Data

We’ll use the famous?Titanic?dataset in this tutorial because this dataset is design to illustrate machine learning for beginners.

import pandas as pd
train_data = pd.read_csv('your-file-path/train.csv')        

Data Understanding

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this tutorial, we will build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Let’s first check the?null values?in the data.

#                 
train_data.isna().sum()        
No alt text provided for this image
Screenshot Image by Author

  • The?cabin?has 687 missing values out of 891 instances. It’s better to exclude the column
  • Embarked?has only 2 missing values. We can easily replace them with the?mode?of the variable without many side effects.
  • Age?is complicated. It has?177 missing values?out of 891. If we replaced missing values with some numbers, we risk adding bias to our model. Excluding this variable without good justification would risk losing some important information from this variable. For simplicity’s sake, we will just fill in the missing values with average values in this tutorial.
  • For now, we will remove?Cabin?from our training data.

# Drop columns in the training set
train_data.drop(['Ticket','Cabin','Name','PassengerId'],axis=1,inplace=True)        

Note that we removed ‘PassengerId’ and ‘Ticket’ from the training set. This is because the id of each passenger doesn’t tell much information about whether they will survive or not.

Replace Missing Values

Most machine learning models cannot take null values as input while making predictions. Therefore we should replace the missing values with the most probable value (make your closest guess). The simplest way of making a guess is the central value of the distribution (mean, mode, median).

To fill in the missing values with their central values, we first need to calculate them. Note that we removed the null values from the central value calculations so that they don’t affect the result.

# Calculates central values
freq_port = train_data.Embarked.dropna().mode()[0]
avg_fare = round(train_data.Fare.dropna().mean(),2)
avg_age = round(train_data.Age.dropna().mean(),2)

# Replace missing values
train_data['Embarked'] = train_data['Embarked'].fillna(freq_port)
train_data["Fare"] = train_data["Fare"].fillna(avg_fare)
train_data["Age"] = train_data["Age"].fillna(avg_age)

# Check missing values
train_data.isna().sum()        
No alt text provided for this image
Screenshot Image by Author

Label Encoding

Some of the models cannot take categorical variables as input. So we’ll need to convert them into integers.

No alt text provided for this image
Embarked column before transformation
# Label Encoding for "Embarked" column
train_data['Embarked'] = train_data['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
train_data['Embarked']        
No alt text provided for this image
Embarked column after transformation
No alt text provided for this image
Sex column before transformation
# Label Encoding for "Embarked" colum
train_data['Sex'] = train_data['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
train_data['Sex']n        
No alt text provided for this image
Sex column after transformation

Train-test-split

Now that we’re ready to train our model, we need to split the data into training and testing sets. A training set is data used to train the model while a testing set is to evaluate the accuracy of the model.

from sklearn.model_selection import train_test_split
X_train = train_data.drop("Survived", axis=1)
y_train = train_data["Survived"]

X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size = 0.2)        

Model Training with LightGBM

There are many models we can consider but Gradient Boosted Trees is always my first choice since it is pretty robust to most prediction problems. In real business problems, it’s not always this easy. We will need to do feature extraction and hyperparameter tuning. However, in today’s article, we’ll keep it simple by staying with the default. `model.fit` is the one-line code that is going to train the model based on our data.

from lightgbm import LGBMClassifier

model = LGBMClassifier()
model.fit(X_train,y_train)        

Make predictions with the trained model

When we use model.fit() to fit the model, “model” is now the trained model. So, we will use “model” to make predictions using the X variables in the testing set.

y_pred = model.predict(X_test)
y_pred        
No alt text provided for this image
y_pred

Note that y_pred is now a list of output from the model using X_test as the X variables. Let’s convert it back to Pandas Data Frame so that we can evaluate the output.

df_test = X_test.copy()
df_test['Survived'] = y_test
df_test['Prediction'] = y_pred
df_test        
No alt text provided for this image
Combine X_test, y_test and y_pred

Prediction Accuracy

We need to justify how accurate is our model prediction. So we need to compare the accuracy of our prediction against the true value.

from sklearn.metrics import accuracy_score

y_true = df_test['Survived']
y_pred = df_test['Prediction']

accuracy = accuracy_score(y_true, y_pred)

print(str(round(accuracy*100,2))+'%')        
No alt text provided for this image
Prediction Accuracy

The simplest evaluation metric for a classification problem is the accuracy score. However, in many situations, accuracy is insufficient to tell the true story. Anyway, let’s save this topic for another day. After all, this article is to show how machine learning is done in the simplest way.

Make Prediction

Now we have a trained model and evaluated its accuracy, now we can use it to make predictions based on given X values. Yes, the model is not a crystal ball and cannot make predictions out of thin air. We need to pass the X variables that we used to train the model into it to make predictions.

No alt text provided for this image
Image by Author

Note that the values in the X variables need not be the same as our training set, it’s the values that you want to use to make predictions. What will the output be when the X variables are these values?

predict_data = pd.read_csv(f"{filepath}\\test.csv")

# Calculates central values
freq_port = predict_data.Embarked.dropna().mode()[0]
avg_fare = round(predict_data.Fare.dropna().mean(),2)
avg_age = round(predict_data.Age.dropna().mean(),2)

# Replace missing values
predict_data['Embarked'] = predict_data['Embarked'].fillna(freq_port)
predict_data["Fare"] = predict_data["Fare"].fillna(avg_fare)
predict_data["Age"] = predict_data["Age"].fillna(avg_age)

# Check missing values
predict_data.isna().sum()        
No alt text provided for this image

We will need to make the data structure into the same structure with our training set.

X_predict = predict_data.drop(columns = ['PassengerId','Cabin','Name','Ticket'])
X_predict['Embarked'] = predict_data['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
X_predict['Sex'] = predict_data['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
X_predict.info()        
No alt text provided for this image

Now that the prediction set is ready, we will use our trained model to make predictions on the data.

prediction = model.predict(X_predict)
predict_data['Prediction'] = prediction
predict_data        
No alt text provided for this image
Prediction of unseen data

You can then export this data to .csv file for your analysis.

Conclusion

Machine learning is training a model with historical data and using the model to make predictions on unseen data. We’ve seen how?model training?is just a line of code as well as using the trained model to?make predictions?on an unseen data set.

There is more than just that. In real-world problems, much effort is needed to:

  1. Clean the data
  2. Extracts useful features and removes redundant features
  3. Hyperparameter tuning to prevent underfitting and overfitting
  4. Evaluate a few models with different (or choosing the right) evaluation metrics
  5. Explain and justify your model to the stakeholders

Don’t be discouraged by the complication that might arise, everyone has to start somewhere and this article aims to provide you with the big picture so that you can set the direction right when you first started.

With all that said, I will share my experience here along my learning journey so that it can benefit my readers.

No alt text provided for this image
Photo by Mantas Hesthaven on Unsplash

要查看或添加评论,请登录

Chee-Chuan Foo的更多文章

社区洞察

其他会员也浏览了