登录查看更多内容

Top 6 Machine Learning Classification Algorithms

MRINMOY PAUL

Global Delivery & Program Leader | GCP | Strategic Transformation | Program & Product Management | PRINCE2? | Six Sigma Black Belt (CSSBB) | Startup Advisor | ESG | Certified Mentor | Independent Director Aspirant ??

发布日期: 2023年10月7日

+ 关注

How to Build a Machine Learning Model Pipeline in Python

Machine Learning Algorithms for Classification

Supervised vs. Unsupervised Learning vs. Reinforcement Learning

The easiest way to distinguish between supervised and unsupervised learning is to check whether the data is labelled. Supervised learning involves learning a function that predicts a defined label based on input data. This can be either classifying data into categories (a classification problem) or predicting an outcome (a regression algorithm).

Unsupervised learning uncovers underlying patterns in datasets that are not explicitly represented, discovering similarities between data points (clustering algorithms), revealing hidden relationships among variables ( Association Rule Algorithm) can...

Reinforcement learning is another machine learning in which an agent learns to perform actions based on its interactions with the environment, with the goal of maximizing rewards. type of. This is most similar to the human learning process and follows a trial-and-error approach.

Classification and Regression Supervised learning can be further divided into classification algorithms and regression algorithms. Classification models identify which category an object belongs to, and regression models predict continuous outputs. There can be a blurred line between classification and regression algorithms. Many algorithms can be used for both classification and regression, but classification is nothing more than a regression model with a threshold applied. If the number is greater than the threshold, it is classified as true; if the number is less than the threshold, it is classified as false. This article describes the top six machine learning algorithms for classification problems, including logistic regression, decision trees, random forests, support vector machines, k-nearest neighbours, and naive Bayes. I've summarized the theory behind it and its implementation in Python.

Logistic regression uses the sigmoid function described above to return the probability of a label. This is often used when the classification problem is binary (true or false, win or lose, positive or negative)... A sigmoid function produces a probability output. An appropriate label is assigned to the object by comparing the probability with a predefined threshold.

Below is the code snippet for a default logistic regression and the common hyperparameters to experiment on — see which combinations bring the best result.

from sklearn.linear_model import LogisticRegression
reg = LogisticRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

logistic regression common hyperparameters: penalty, max_iter, C, solver

Decision Tree

Decision tree builds tree branches in a hierarchy approach and each branch can be considered as an if-else statement. The branches develop by partitioning the dataset into subsets based on the most important features. The final classification happens at the leaves of the decision tree.

from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)

decision tree common hyperparameters: criterion, max_depth, min_samples_split, min_samples_leaf; max_features

Random Forest

As the name suggests, a random forest is a collection of decision trees. It is a common type of ensemble method which aggregates results from multiple predictors. Random forest additionally utilizes a bagging technique that allows each tree trained on a random sampling of the original dataset and takes the majority vote from trees. Compared to a decision tree, it has better generalization but is less interpretable, because of more layers added to the model.

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)

random forest common hyperparameters: n_estimators, max_features, max_depth, min_samples_split, min_samples_leaf, bootstrap

Support Vector Machine (SVM)

Support vector machine finds the best way to classify the data based on the position in relation to a border between positive class and negative class. This border is known as the hyperplane which maximizes the distance between data points from different classes. Similar to decision trees and random forests, a support vector machine can be used in both classification and regression, SVC (support vector classifier) is for classification problems.

from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)

support vector machine common hyperparameters: c, kernel, gamma

K-Nearest Neighbour (KNN)

You can think of the k nearest neighbour algorithm as representing each data point in a dimensional space — which is defined by n features. It calculates the distance between one point to another, and then assigns the label of unobserved data based on the labels of the nearest observed data points. KNN can also be used for building a recommendation system.

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

KNN common hyperparameters: n_neighbors, weights, leaf_size, p

6. Naive Bayes

Naive Bayes is based on Bayes’ Theorem — an approach to calculating conditional probability based on prior knowledge, and the naive assumption that each feature is independent of the other. The biggest advantage of Naive Bayes is that, while most machine learning algorithms rely on large amounts of training data, it performs relatively well even when the training data size is small. Gaussian Naive Bayes is a type of Naive Bayes classifier that follows the normal distribution.

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)

gaussian naive bayes common hyperparameters: priors, var_smoothing

Build a Classification Model Pipeline

1. Loading Dataset and Data Overview

I chose the popular dataset Heart Disease UCI on Kaggle for predicting the presence of heart disease based on several health-related factors.

Aqsa Z. 5 个月前

Geometric Learning in Python: Basics

Patrick Nicolas 8 个月前

Geometric Learning in Python: Introduction

Patrick Nicolas 8 个月前

Use df.info()to have a summarized view of the dataset, including data type, missing data and number of records.

2. Exploratory Data Analysis (EDA)

Histogram, grouped bar chart and box plot are suitable EDA techniques for classification machine learning algorithms.

Univariate Analysis

A histogram is used for all features because all features have been encoded into numeric values in the dataset. This saves us the time for categorical encoding which usually happens during the feature engineering stage.

Categorical Features vs. Target — Grouped Bar Chart

To show how categorical value weigh in determining the target value, a grouped bar chart is a straightforward representation. For example, sex = 1 and sex = 0 have a distinct distribution of target value, which indicates it is likely to contribute more to the prediction of the target. Contrarily, if the target distribution is the same regardless of the categorical features, then very likely they are not correlated.

Numerical Features vs. Target—Box Plot

The box plot shows how the values of numerical features vary across target groups. For example, we can tell that “old peak” has a distinct difference when the target is 0 vs. the target is 1, suggesting that it is an important predictor. However, ‘treetops’ and ‘chol’ appear to be less outstanding, as the box plot distribution is similar between target groups.

3. Split Dataset into Training and Testing Set

The classification algorithm falls under the category of supervised learning, so the dataset needs to be split into a subset for training and a subset for testing (sometimes also a validation set). The model is trained on the training set and then examined using the testing set.

from sklearn.model_selection import train_test_split
from sklearn import preprocessingX = df.drop(['target'], axis=1)
y = df["target"]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

4. Machine Learning Model Pipeline

In order to create a pipeline, I append the default state of all classification algorithms mentioned above into the model list and then iterate through them to train, test, predict and evaluate.

5. Model Evaluation

Below is an abstraction explanation of commonly used evaluation methods for classification models — accuracy, ROC & AUC and confusion matrix. Each of the following metrics is worth diving deeper, feel free to visit my article on logistic regression for a more detailed illustration.

1. Accuracy

Accuracy is the most straightforward indicator of the model performance. It measure the percentage of accurate predictions: accuracy = (true positive + true negative) / (true positive + false positive + false negative + false positive)

2. ROC & AUC

ROC is the plot of the true positive rate against the false positive rate at various classification thresholds. AUC is the area under the ROC curve, and a higher AUC indicates better model performance.

3. Confusion matrix

A confusion matrix indicates the actual values vs. predicted values and summarizes the true negative, false positive, false negative and true positive values in a matrix format.

Then we can use Seaborn to visualize the confusion matrix in a heatmap.

Based on the three evaluation techniques mentioned above, random forests and naive bayes perform best, while KNN performs poorly. However, this does not imply that na?ve bayes and random forests are better algorithms. We can only conclude that they are better suited for this dataset given its lower size and different scale of the data.

For example, KNN is sensitive to features at different scales and multicollinearity impacts the outcome of logistic regression. Each algorithm has its own preferences and requires a distinct approach to data processing and feature engineering. We may weigh the trade-offs and choose the best model for the dataset by being aware of the features of each.

If you have enjoyed the efforts, Please LIKE, COMMENT, SHARE & FOLLOW for more insightful content

DISCLAIMER:: All views expressed in this article are only for information purposes and any actions by the readers will be their own responsibility

?2023 ???? ?????????????? ????????

????????????:? https://towardsdatascience.com/

?? ?????? ???????????? ???? ??

要查看或添加评论，请登录

MRINMOY PAUL的更多文章

Most Amazing Facts about World's Oldest university: TAKSHASHILA UNIVERSITY !

2024年11月22日

Most Amazing Facts about World's Oldest university: TAKSHASHILA UNIVERSITY !

BRIEF INTRODUCTION In this article, you will delve into the rich and intricate history of Takshashila University…
Reinforcement Learning Utilising Human Feedback for Artificial Intelligence Applications

2024年11月20日

Reinforcement Learning Utilising Human Feedback for Artificial Intelligence Applications

Generative AI tools such as ChatGPT and Gemini are increasingly essential in our modern landscape. Yet, the immense…
?? What are Eco Friendly Sustainable Homes/Houses ?

2024年11月9日

?? What are Eco Friendly Sustainable Homes/Houses ?

The concept of sustainable, natural, and eco-friendly homes has gained significant attention in recent years. These…

2 条评论
The Journey of Sharkara to Sugar: From Ancient Jaggery to Modern-Day Chinni(Sakhar)

2024年11月6日

The Journey of Sharkara to Sugar: From Ancient Jaggery to Modern-Day Chinni(Sakhar)

The story of sugar is a remarkable saga of ancient innovation, cultural exchange, and the sweetness that has united…
Feeling Trapped: What to Do When You Hate Your Job but Need the Paycheck

2024年10月9日

Feeling Trapped: What to Do When You Hate Your Job but Need the Paycheck

Are you one of the 50% of employees who feel unhappy in their jobs? Do you find yourself counting down the minutes…
AI Power Players: A Guide to Leading AI Companies in the Energy Sector

2024年10月9日

AI Power Players: A Guide to Leading AI Companies in the Energy Sector

Artificial Intelligence (AI) is transforming the global power sector, with companies using AI to optimize energy…
From Tradition to Transformation: CSR in India

2024年8月9日

From Tradition to Transformation: CSR in India

Given that businesses depend on societal resources for their effective operation, they bear an additional ethical…
The recent amendments to India's Insolvency and Bankruptcy Code signify a transformative shift in the realm of corporate restructuring.

2024年8月4日

The recent amendments to India's Insolvency and Bankruptcy Code signify a transformative shift in the realm of corporate restructuring.

India's Insolvency and Bankruptcy Code (IBC) has seen significant changes in recent years, reflecting a pivotal shift…
Unraveling the Journey of Bill Gates: From Visionary Kid to Global Philanthropist

2024年5月11日

Unraveling the Journey of Bill Gates: From Visionary Kid to Global Philanthropist

Introduction: In the annals of history, few names stand as prominently as that of Bill Gates. From his humble…
The Global Cost: Are Citizens Paying for Our Leaders’ Actions?

2024年5月10日

The Global Cost: Are Citizens Paying for Our Leaders’ Actions?

In an interconnected world, the actions of our leaders reverberate far beyond their immediate constituencies. As…

See all articles

Top 6 Machine Learning Classification Algorithms

MRINMOY PAUL

Global Delivery & Program Leader | GCP | Strategic Transformation | Program & Product Management | PRINCE2? | Six Sigma Black Belt (CSSBB) | Startup Advisor | ESG | Certified Mentor | Independent Director Aspirant ??

How to Build a Machine Learning Model Pipeline in Python

Supervised vs. Unsupervised Learning vs. Reinforcement Learning

Build a Classification Model Pipeline

1. Loading Dataset and Data Overview

领英推荐

2. Exploratory Data Analysis (EDA)

3. Split Dataset into Training and Testing Set

4. Machine Learning Model Pipeline

5. Model Evaluation

MRINMOY PAUL的更多文章

社区洞察

其他会员也浏览了

Reasons Why You Will Need Linear Algebra as a Data Scientist

Modular GANs with Neural Blocks in Python

10 Best AI Frameworks for Developers

Getting started with AI & ML- 10 plus use cases

Getting started with AI & ML- 10 plus use cases

How to Learn AI on Your Own

Unlock Your Potential with the Best Machine Learning Courses: Master Data Science and AI Today! (2024)

Machine Learning with (Monty) Python

Topics in Data Science: A Detailed List

How to Perform Zero-Shot Text Classification Using Hugging Face Transformers Library in Python

How to Build a Machine Learning Model Pipeline in Python

Supervised vs. Unsupervised Learning vs. Reinforcement Learning

Build a Classification Model Pipeline

1. Loading Dataset and Data Overview

领英推荐

2. Exploratory Data Analysis (EDA)

3. Split Dataset into Training and Testing Set

4. Machine Learning Model Pipeline

5. Model Evaluation

MRINMOY PAUL的更多文章

Most Amazing Facts about World's Oldest university: TAKSHASHILA UNIVERSITY !

Reinforcement Learning Utilising Human Feedback for Artificial Intelligence Applications

?? What are Eco Friendly Sustainable Homes/Houses ?

The Journey of Sharkara to Sugar: From Ancient Jaggery to Modern-Day Chinni(Sakhar)

Feeling Trapped: What to Do When You Hate Your Job but Need the Paycheck

AI Power Players: A Guide to Leading AI Companies in the Energy Sector

From Tradition to Transformation: CSR in India

The recent amendments to India's Insolvency and Bankruptcy Code signify a transformative shift in the realm of corporate restructuring.

Unraveling the Journey of Bill Gates: From Visionary Kid to Global Philanthropist

The Global Cost: Are Citizens Paying for Our Leaders’ Actions?

社区洞察

其他会员也浏览了

Reasons Why You Will Need Linear Algebra as a Data Scientist

Modular GANs with Neural Blocks in Python

10 Best AI Frameworks for Developers

Getting started with AI & ML- 10 plus use cases

Getting started with AI & ML- 10 plus use cases

How to Learn AI on Your Own

Unlock Your Potential with the Best Machine Learning Courses: Master Data Science and AI Today! (2024)

Machine Learning with (Monty) Python

Topics in Data Science: A Detailed List

How to Perform Zero-Shot Text Classification Using Hugging Face Transformers Library in Python