Machine Learning Classification Algorithms - 1/2 An Introduction
Elsayed Rashed
Technology Leader || Helping businesses to achieve their digital transformation by leveraging the power of Data, AI/ML, and Cloud Computing through agile engineering and problem-solving creativity
Supervised Machine Learning algorithms are classified into Regression and Classification. Regression predicts continuous values, while Classification is used for predicting categorical values.
?Classification is a widely used technique in machine and statistical learning. It is mainly used for identifying spam emails, analyzing financial risks, predicting customer churn, and discovering potential customers.
?In two articles, I will introduce classification algorithms and provide an example of their application to solve a classification problem.
?Article I : Machine Learning Classification Algorithms - An Introduction
Article? II: Language Detector
?Introduction to Machine Learning
Machine learning is a subset of Artificial Intelligence and a subfield of Data Science. It involves the study of how software can learn from past experiences. Machine learning enables computers to learn on their own by using statistical methods to improve performance and predict output without the need for explicit programming.
In the last 5-10 years, there has been a rapid explosion in growth in the field of machine learning, owing to incredible breakthroughs in new algorithms such as deep learning. This, combined with an exponential increase in CPU power, especially in parallel operations with GPUs and TPUs, has allowed for huge improvements in the training of machine learning models.
Types of Machine Learning
Supervised Learning:
?Supervised learning is a popular type of machine learning approach, where labeled data is provided to the machine learning system for training. The system predicts the output based on this training. It is a simple and widely-known automatic learning task. It relies on pre-defined examples, where the category of each input is already known.
?For example, In a spam filtering dataset, we can find both spam messages and non-spam messages. This allows us to identify which messages are spam and which are not during the training process. With this knowledge, we can train our model to accurately classify new and unseen messages.
?In the context of supervised learning, there are two main types of tasks: classification and regression:
?In other words, in classification tasks, the label of the class attribute is predicted, while regression tasks predict a numeric value for the class attribute.
?Common supervised learning applications include:
Unsupervised Learning:
Unsupervised learning refers to a method where a machine learns without any guidance or supervision. In this type of learning, data points do not have any labels or predetermined classes. Therefore, the algorithm needs to infer the classes from the unstructured dataset, which means that its primary goal is to pre-process the data by describing its structure in a structured way.
To enable unsupervised learning, clustering techniques are used to group unlabeled data based on similarity measures, revealing hidden patterns and facilitating feature learning.
?Commons unsupervised applications include:
Reinforcement Learning:
?Reinforcement learning is a technique where the model learns from a series of actions or behaviors, allowing it to improve over time. The complexity of datasets or sample complexity is crucial in the success of reinforcement learning algorithms, as it affects the ability of the algorithm to learn the target function effectively.
?Reinforcement learning is a feedback-based machine learning method where an agent gets rewarded for taking correct actions and penalized for incorrect ones.
?Commons reinforcement applications include:
Introduction to Classification Technique
Classification Technique is a type of Supervised Learning that helps to identify the appropriate category for new observations based on the training data. In this method, a program learns from the available dataset, and then assigns new observations into different categories or classes, such as Yes or No, 0 or 1, Spam or Not Spam, and so on. This approach is useful for making accurate predictions and improving decision-making processes.
The classification algorithm requires labeled input data, with input and corresponding output variables representing categories.
Types of Classification
?The algorithm used to classify a dataset is called a classifier. There are three types of classifications:
?Binomial (Binary) Classifier
?Classifying data into binary categories such as presence/absence, positive/negative, or diseased/healthy.
Multinomial Classifier
Classifies data into three or more classes, such as document classification for Politics, Sports, Social issues, and the Economy.
Ordinal Classifier
Classifies data into three or more ordered classes such as "low", "medium", or "high" based on risk level.
Classification Algorithms
A classification algorithm is a type of Supervised Learning technique that helps in identifying the category of new observations based on training data. To better understand classification algorithms, you can refer to the following diagram. The diagram shows two classes - Class A and Class B, with features that are similar within the same class and dissimilar across different classes.
领英推荐
There are mainly two categories of Classification Algorithms:
?Linear Models, ike:
?Non-linear Models, like:
Logistic Regression
Logistic Regression is a popular Machine Learning algorithm that can provide probabilities and classify new data using both continuous and discrete datasets.
K-Nearest Neighbors (KNN)
KNN is a simple non-parametric machine learning algorithm that does not make any assumptions about the underlying data.
Support Vector Machine (SVM)
As a machine learning algorithm, SVM is considered powerful and well-suited for smaller datasets. However, its effectiveness extends to complex datasets as well. SVM constructs hyperplanes or a set of hyperplanes in a high or infinite-dimensional space that can be used for classification.
Naive Bayes
As a probabilistic classifier, Naive Bayes uses the Maximum A Posteriori decision rule in a Bayesian setting to make classifications. One of the advantages of Naive Bayes is its ability to handle imbalanced data, making it a popular choice for text classification tasks such as spam filtering.
Decision Tree
The Decision Tree algorithm is a popular algorithm due to its simple approach in dealing with complex datasets. It has a hierarchical, tree structure consisting of a root node, branches, internal nodes, and leaf nodes.
Random Forest
Random Forest is a type of classifier that consists of multiple decision trees. It takes the average of these trees to improve the accuracy of predictions. This method is based on the concept of ensemble learning, which involves combining multiple classifiers to solve complex problems.
Evaluating Classification Models
After completing the classification model, it is important to evaluate its performance using the confusion matrix and associated metrics.
?Confusion Matrix
The confusion matrix, also known as the error matrix, is a table that outlines the performance of the model.
?The matrix displays the number of correct and incorrect predictions in a summarized table format:
By utilizing the confusion matrix, we can calculate the model's accuracy and several other performance metrics.
Accuracy
Calculating accuracy is a vital aspect in determining the effectiveness of classification problems. Accuracy refers to the frequency of correct predictions made by the model. It can be computed by dividing the number of correct predictions made by the classifier by the total number of predictions. The formula for accuracy is given below:
?Accuracy = (TP+TN) / (TP+FP+FN+TN)
Precession
The precision of a model is the proportion of correct outputs to the total number of positive and negative classes. Precision answers the question: what proportion of predicted positives is truly positive?
Precision is a valid choice of evaluation metric when we want to be very sure of our prediction.
It can be calculated using the below formula:
?Precession = (TP) / (TP+FP)
Recall
Recall measures the fraction of positive cases that are correctly predicted by the model. Recall answers the question: what proportion of actual positives is correctly classified?
Recall is a valid choice of evaluation metric when we want to capture as many positives as possible.
?It can be calculated using the below formula:
?Recall = (TP) / (TP+FN)
F1 Score
Comparing two models that have either low precision and high recall or high precision and low recall can be challenging. To overcome this, we can use the F-score which evaluates both recall and precision simultaneously. The F-score is at its maximum when the recall and precision are equal.
F1 score is the harmonic mean of precision and recall, and is a value between 0 and 1. It balances the precision and recall of a classifier.
?It can be calculated using the below formula:
?F1 Score = 2 (Precession Recall) / (Precession + Recall)