登录查看更多内容

Machine Learning For Beginners

Chee-Chuan Foo

Data Scientist | Writer | Consultant

发布日期: 2023年3月12日

Introduction

For beginners, machine learning might seem intimidating with all the calculus, statistics, and algorithms that confuse you even before you start.

In this article, I want to demonstrate machine learning in the simplest way that machine learning is not as difficult as it seems.

Intuition

The intuition of machine learning is to train a model with historical data and use the model to make predictions.

No alt text provided for this image — Image by Author

Here are the basic steps to write a simple python code to make predictions about your data.

Import Data

We’ll use the famous?Titanic?dataset in this tutorial because this dataset is design to illustrate machine learning for beginners.

import pandas as pd
train_data = pd.read_csv('your-file-path/train.csv')

Data Understanding

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this tutorial, we will build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

Let’s first check the?null values?in the data.

#                 
train_data.isna().sum()

The?cabin?has 687 missing values out of 891 instances. It’s better to exclude the column
Embarked?has only 2 missing values. We can easily replace them with the?mode?of the variable without many side effects.
Age?is complicated. It has?177 missing values?out of 891. If we replaced missing values with some numbers, we risk adding bias to our model. Excluding this variable without good justification would risk losing some important information from this variable. For simplicity’s sake, we will just fill in the missing values with average values in this tutorial.
For now, we will remove?Cabin?from our training data.

# Drop columns in the training set
train_data.drop(['Ticket','Cabin','Name','PassengerId'],axis=1,inplace=True)

Note that we removed ‘PassengerId’ and ‘Ticket’ from the training set. This is because the id of each passenger doesn’t tell much information about whether they will survive or not.

Replace Missing Values

Most machine learning models cannot take null values as input while making predictions. Therefore we should replace the missing values with the most probable value (make your closest guess). The simplest way of making a guess is the central value of the distribution (mean, mode, median).

To fill in the missing values with their central values, we first need to calculate them. Note that we removed the null values from the central value calculations so that they don’t affect the result.

# Calculates central values
freq_port = train_data.Embarked.dropna().mode()[0]
avg_fare = round(train_data.Fare.dropna().mean(),2)
avg_age = round(train_data.Age.dropna().mean(),2)

# Replace missing values
train_data['Embarked'] = train_data['Embarked'].fillna(freq_port)
train_data["Fare"] = train_data["Fare"].fillna(avg_fare)
train_data["Age"] = train_data["Age"].fillna(avg_age)

# Check missing values
train_data.isna().sum()

Label Encoding

Some of the models cannot take categorical variables as input. So we’ll need to convert them into integers.

# Label Encoding for "Embarked" column
train_data['Embarked'] = train_data['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
train_data['Embarked']

# Label Encoding for "Embarked" colum
train_data['Sex'] = train_data['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
train_data['Sex']n

Train-test-split

Now that we’re ready to train our model, we need to split the data into training and testing sets. A training set is data used to train the model while a testing set is to evaluate the accuracy of the model.

领英推荐

How can I begin to learn more about machine learning?

Machine Learning 2 年前

6 Easy Steps to Acquire Machine Learning Skills

Analytics Insight? 8 个月前

Step-by-Step Guide To Become A Machine Learning…

Shailesh Shakya 10 个月前

from sklearn.model_selection import train_test_split
X_train = train_data.drop("Survived", axis=1)
y_train = train_data["Survived"]

X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size = 0.2)

Model Training with LightGBM

There are many models we can consider but Gradient Boosted Trees is always my first choice since it is pretty robust to most prediction problems. In real business problems, it’s not always this easy. We will need to do feature extraction and hyperparameter tuning. However, in today’s article, we’ll keep it simple by staying with the default. `model.fit` is the one-line code that is going to train the model based on our data.

from lightgbm import LGBMClassifier

model = LGBMClassifier()
model.fit(X_train,y_train)

Make predictions with the trained model

When we use model.fit() to fit the model, “model” is now the trained model. So, we will use “model” to make predictions using the X variables in the testing set.

y_pred = model.predict(X_test)
y_pred

Note that y_pred is now a list of output from the model using X_test as the X variables. Let’s convert it back to Pandas Data Frame so that we can evaluate the output.

df_test = X_test.copy()
df_test['Survived'] = y_test
df_test['Prediction'] = y_pred
df_test

Prediction Accuracy

We need to justify how accurate is our model prediction. So we need to compare the accuracy of our prediction against the true value.

from sklearn.metrics import accuracy_score

y_true = df_test['Survived']
y_pred = df_test['Prediction']

accuracy = accuracy_score(y_true, y_pred)

print(str(round(accuracy*100,2))+'%')

The simplest evaluation metric for a classification problem is the accuracy score. However, in many situations, accuracy is insufficient to tell the true story. Anyway, let’s save this topic for another day. After all, this article is to show how machine learning is done in the simplest way.

Make Prediction

Now we have a trained model and evaluated its accuracy, now we can use it to make predictions based on given X values. Yes, the model is not a crystal ball and cannot make predictions out of thin air. We need to pass the X variables that we used to train the model into it to make predictions.

Note that the values in the X variables need not be the same as our training set, it’s the values that you want to use to make predictions. What will the output be when the X variables are these values?

predict_data = pd.read_csv(f"{filepath}\\test.csv")

# Calculates central values
freq_port = predict_data.Embarked.dropna().mode()[0]
avg_fare = round(predict_data.Fare.dropna().mean(),2)
avg_age = round(predict_data.Age.dropna().mean(),2)

# Replace missing values
predict_data['Embarked'] = predict_data['Embarked'].fillna(freq_port)
predict_data["Fare"] = predict_data["Fare"].fillna(avg_fare)
predict_data["Age"] = predict_data["Age"].fillna(avg_age)

# Check missing values
predict_data.isna().sum()

We will need to make the data structure into the same structure with our training set.

X_predict = predict_data.drop(columns = ['PassengerId','Cabin','Name','Ticket'])
X_predict['Embarked'] = predict_data['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
X_predict['Sex'] = predict_data['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
X_predict.info()

Now that the prediction set is ready, we will use our trained model to make predictions on the data.

prediction = model.predict(X_predict)
predict_data['Prediction'] = prediction
predict_data

You can then export this data to .csv file for your analysis.

Conclusion

Machine learning is training a model with historical data and using the model to make predictions on unseen data. We’ve seen how?model training?is just a line of code as well as using the trained model to?make predictions?on an unseen data set.

There is more than just that. In real-world problems, much effort is needed to:

Clean the data
Extracts useful features and removes redundant features
Hyperparameter tuning to prevent underfitting and overfitting
Evaluate a few models with different (or choosing the right) evaluation metrics
Explain and justify your model to the stakeholders

Don’t be discouraged by the complication that might arise, everyone has to start somewhere and this article aims to provide you with the big picture so that you can set the direction right when you first started.

With all that said, I will share my experience here along my learning journey so that it can benefit my readers.

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

要查看或添加评论，请登录

Chee-Chuan Foo的更多文章

Setup an AWS database for FREE

2023年2月6日

Setup an AWS database for FREE

Setup free tier PostgreSQL database on AWS RDS Introduction For a student or beginner data scientist, you need to equip…

4 条评论
Basic Data Preparation with Python (Stock Price Data)

2022年12月31日

Basic Data Preparation with Python (Stock Price Data)

Python stock price analysis 03: Learn typical data preparation process with stock prices data Python Stock Price Data…
Looping API Request in Python to retrieve data in bulk

2022年12月24日

Looping API Request in Python to retrieve data in bulk

Looping API Request in Python to retrieve data in bulk Python stock price analysis 02: Extract stock prices with a list…

2 条评论
Extract stock price data with Python

2022年12月13日

Extract stock price data with Python

Python stock price analysis 01: Get stock price data online Introduction Stock markets from all over the world generate…
Pivot Table in?Python

2022年11月29日

Pivot Table in?Python

Drag & Drop Data Analysis in Python Introduction Exploratory Data Analysis in Python was typically done by using the…
Tableau Quick Tip: Dynamic Chart Title

2022年11月21日

Tableau Quick Tip: Dynamic Chart Title

Create a drop-down to change the x & y axes. Make chart title change according to selection.
Tableau Desktop Quick Tip: The Problem with Pie Chart & Legend

2022年11月20日

Tableau Desktop Quick Tip: The Problem with Pie Chart & Legend

Data analysts always take pie charts and legends for granted, especially with many categories in a field. When it comes…
Simple Steps — Data Exploration

2022年8月8日

Simple Steps — Data Exploration
Problem Solving: List Comprehension (Python)

2022年1月28日

Problem Solving: List Comprehension (Python)

Problem Solving: List Comprehension (Python) In my previous article, "List Comprehension (Python) Explained" I've…
List Comprehension (Python) Explained

2022年1月18日

List Comprehension (Python) Explained

List Comprehension (Python) Explained List is one of the simplest and most common data structure in Python. Today in…

See all articles

Machine Learning For Beginners

Chee-Chuan Foo

Data Scientist | Writer | Consultant

Introduction

Intuition

Import Data

Data Understanding

Replace Missing Values

Label Encoding

Train-test-split

领英推荐

Model Training with LightGBM

Make Prediction

Conclusion

Chee-Chuan Foo的更多文章

社区洞察

其他会员也浏览了

Master Machine Learning: Best Regression Modeling Courses in 2024

My Machine Learning Journey: Perfect Roadmap for Beginners

Starting Machine Learning? Do not repeat my mistakes!

Creating Your First Machine Learning Classifier with Sklearn

Key Machine Learning Areas You Need to Learn

Machine Learning Tools Every Beginner Should Have A Look

12 Useful Things to Know about Machine Learning

TOP 7 HELPFUL TIPS FOR CREATING MACHINE LEARNING PROJECTS

Day 12 — Association Rule Learning

Take Control of Your Trading with Machine Learning Regression

Introduction

Intuition

Import Data

Data Understanding

Replace Missing Values

Label Encoding

Train-test-split

领英推荐

Model Training with LightGBM

Make Prediction

Conclusion

Chee-Chuan Foo的更多文章

Setup an AWS database for FREE

Basic Data Preparation with Python (Stock Price Data)

Looping API Request in Python to retrieve data in bulk

Extract stock price data with Python

Pivot Table in?Python

Tableau Quick Tip: Dynamic Chart Title

Tableau Desktop Quick Tip: The Problem with Pie Chart & Legend

Simple Steps — Data Exploration

Problem Solving: List Comprehension (Python)

List Comprehension (Python) Explained

社区洞察

其他会员也浏览了

Master Machine Learning: Best Regression Modeling Courses in 2024

My Machine Learning Journey: Perfect Roadmap for Beginners

Starting Machine Learning? Do not repeat my mistakes!

Creating Your First Machine Learning Classifier with Sklearn

Key Machine Learning Areas You Need to Learn

Machine Learning Tools Every Beginner Should Have A Look

12 Useful Things to Know about Machine Learning

TOP 7 HELPFUL TIPS FOR CREATING MACHINE LEARNING PROJECTS

Day 12 — Association Rule Learning

Take Control of Your Trading with Machine Learning Regression