Do you have someone who can turn you off when they want? And why do i?
25/05/2022 - Alistair Pernigo
Machine Learning algorithm:?a brief 13-point vision.
In the era where it is no longer man who seeks and sees the world, but it is the world that shows itself to man to be investigated, the phenomenon of Machine Learning (ML) occupies a pivotal role.
Machine Learning could be defined as "an algorithmic evolution" capable of "building itself". (P. Domingos, 2016, p. 15) since, thanks to its ability to learn, it predicts and structures data: the technology that characterizes this algorithm, favoring the development of a new approach to the world, offers the possibility of understanding and knowing what we desire or what we might desire, even before the aforementioned desire reaches our awareness.
I have recently been working on building an algorithm for a Deep learning system based on neural networks that communicate and instruct a specific Bio-mimetic sensor?(AI+IOT).
Deep learning is a sub-field of Machine Learning and I will not deal with it directly in this paper.
For now, I would like to take a step back and summarise some of the most common ML algorithms that I have come across and studied.?Leaving at the end a more philosophical observation.
Topics covered:
? Classification of Machine Learning algorithms: Supervised Learning, Unsupervised Learning, Reinforcement Learning.
? Basic Machine Learning Algorithms: Linear Regression, Support Vector Machines (SVM), Nearest Neighbors (KNN), Logistic Regression, Decision Trees, K-means, K-Means Clustering, Hierarchical Clustering, DBSCAN Clustering, Principal Component Analysis (PCA), Random Forest, Naive Bayes, Dimensionality Reduction, Gradient Boosting.
? Formulas, diagrams, cases.
Machine Learning algorithms can be roughly divided into three categories:
? Supervised learning algorithm (Supervised Algorithms): During the supervised learning training process, a pattern (function/learning model) can be learned or established from the training data set, and new instances can be inferred based on this pattern. The algorithm requires specific inputs/outputs, and first, you need to decide what kind of data to use as a reference.
For example, a handwritten character in a text recognition application, or a line of handwritten text.?
The main algorithms include the Neural network, Support vector machine, Nearest Neighbor method, Naive Bayes method, Decision Tree, etc.
? Unsupervised Learning Algorithms (Unsupervised Algorithms): Unsupervised learning models are used when we only have the input variables (X) and no corresponding output variables. They use unlabeled training data to model the underlying structure of the data.
? Reinforcement learning algorithm (Reinforcement Algorithms): Reinforcement learning has strong universality and is mainly based on decision-making for training.
The algorithm trains itself according to the success or error of the output result (decision). The optimized algorithm will be able to give better results through a lot of experience training. Prediction. Similar organisms gradually form anticipation of stimuli under the stimulation of rewards or punishments given by the environment and produce habitual behaviors that can obtain the greatest benefits. In the context of operations research and cybernetics, reinforcement learning is called "approximate dynamic programming" (ADP).
Let's delve into.
Basic Machine Learning Algorithms:
1. Linear Regression Algorithm (Supervised?learning algorithm).
Regression analysis is a statistical data analysis method, the purpose is to understand whether two or more variables are correlated, the direction and strength of the correlation, and to establish a mathematical model to observe specific variables to predict changes in other variables.
The linear regression algorithm (Linear Regression) modeling process is to use data points to find the best fit line. The formula, y = m x + c, where y is the dependent variable and x is the independent variable, finds the values of m and c using the given data set.
Linear regression is divided into two types, namely simple linear regression (simple linear regression), only one independent variable, and multivariate regression (multiple regression), and at least two independent variables.
Here is a linear regression example: described based on the Python toolkit.
2. Support Vector Machine (SVM) algorithm (Supervised?learning algorithm).
Support Vector Machines/Network Algorithms (SVMs) are classified algorithms.
The SVM model represents instances as points in space, and a line will be used to separate the data points.
It should be noted that SVMs require full labeling of the input data and are only directly applicable to two-class tasks, the application reduces the multi-class task needs to a few binary problems.
SVM distinguishes classes by drawing a?decision boundary.?How to draw or determine the decision boundary is the most critical part of SVM algorithms. Before creating the decision boundary, each observation (or data point) is plotted in n-dimensional space. “n” is the number of features used. For instance, if we use “length” and “width” to classify different “cells”, observations are plotted in a 2-dimensional space and the decision boundary is a line. If we use 3 features, decision boundary is a plane in a 3-dimensional space. If we use more than 3 features, decision boundary becomes a hyperplane which is really hard to visualize.
A decision boundary in 2D space is a line
The decision boundary is drawn in a way that the distance to support vectors is maximized.?If the decision boundary is too close to a support vector, it will be highly sensitive to noises and not generalize well. Even very small changes in independent variables may cause a misclassification.
The data points are not always linearly separable like in the figure above. In these cases, SVM uses?the kernel trick?which measures the similarity (or closeness) of data points in a higher-dimensional space in order to make them linearly separable.
function is kind of a similarity measure. The inputs are original features and the output is a similarity measure in the new feature space. The similarity here means a degree of closeness. It is a costly operation to actually transform data points to a high-dimensional feature space. The algorithm does not actually transform the data points to a new, high-dimensional feature space.
Kernelized SVM computes decision boundaries in terms of similarity measures in a high-dimensional feature space without actually doing a transformation. I think this is why it is also called?the kernel trick.
SVM is especially effective in cases where a number of dimensions are more than the number of samples. When finding the decision boundary, SVM uses a subset of training points rather than all points which makes it memory efficient. On the other hand, training time increases for large datasets which negatively affects the performance.
3. The Nearest Neighbor/K-Nearest Neighbor Algorithm (KNN) (Supervised?learning algorithm).
The KNN algorithm is an instance-based learning, or a local approximation and lazy learning that defers all computations until after classification. Use the nearest neighbors (k) to predict unknown data points. The value of k is a key factor in prediction accuracy. Whether it is classification or regression, it is very useful to measure the weight of neighbors, with closer neighbors being weighted more heavily than distant neighbors.
The disadvantage of the KNN algorithm is that it is very sensitive to the local structure of the data. It is computationally intensive and requires normalizing the data so that each data point is in the same range.
Extension: One disadvantage of KNN is that it relies on the entire training dataset, Learning Vector Quantization (LVQ) is a supervised learning human neural network algorithm that allows you to select training instances.
LVQ is driven by data, searches for the two closest neurons to it, draws the same type of neurons, repels the different types of neurons, and finally obtains the distribution pattern of the data. If a better dataset classification effect can be obtained based on KNN, using LVQ can reduce the storage size of the training dataset. Typical learning vector quantization algorithms include LVQ1, LVQ2, and LVQ3, especially LVQ2 is the most widely used.
4. Logistic regression algorithm Logistic Regression.
Logistic regression is a?supervised?learning algorithm that is mostly used for?binary?classification?problems. Although “regression” contradicts “classification”, the focus here is on the word “logistic” referring to?the logistic function?which does the classification task in this algorithm.
Logistic regression is a simple yet very effective classification algorithm so it is commonly used for many binary classification tasks. Customer churn, spam email, website or ad click predictions are some examples of the areas where logistic regression offers a powerful solution.
The basis of logistic regression is the logistic function, also called the sigmoid function, which takes in any real-valued number and maps it to a value between 0 and 1.
Consider we have the following linear equation to solve:
The logistic regression model takes a linear equation as input and uses logistic function and log odds to perform a binary classification task. Then, we will get the famous s-shaped graph of logistic regression:
We can use the calculated probability as is. For example, the output can be “the probability that this email is spam is 95%” or “the probability that customer will click on this ad is 70%”. However, in most cases, probabilities are used to classify data points. For instance, if the probability is greater than 50%, the prediction is a positive class (1). Otherwise, the prediction is a negative class (0).
It is not always desired to choose a positive class for all probability values higher than 50%. Regarding the spam email case, we have to be almost sure in order to classify an email as spam. Since emails detected as spam directly go to spam folder, we do not want the user to miss important emails. Emails are not classified as spam unless we are almost sure.
On the other hand, when classification in a health-related issue requires us to be much more sensitive. Even if we are a little suspicious that a cell is malignant, we do not want to miss it. So the value that serves as a threshold between positive and negative class is problem-dependent. Good thing is that logistic regression allows us to adjust this threshold value.
5. Decision Tree Algorithm.
A decision tree is a special tree structure that consists of a decision graph and possible outcomes (such as cost and risk) to aid decision making. In machine learning, a decision tree is a predictive model. Each node in the tree represents an object, each bifurcation path represents a possible attribute value, and each leaf node corresponds to the root node to the leaf node. The value of the object is represented by the traversed path. Decision trees have only a single output, and usually, the algorithm is used to solve classification problems.
A decision tree contains three types of nodes:
? Decision node: usually represented by a rectangular box
? Opportunity node: usually represented by a circle
? Termination point: usually represented by a triangle
An example of a simple decision tree algorithm to determine who in a crowd prefers to use a credit card. Considering the age and marital status of the population, people are more likely to choose a credit card if they are 30 years old or married, and less if they are married.
This decision tree can be further extended by identifying suitable attributes to define more categories. In this example, if a person is married and he is over 30, they are more likely to have a credit card (100% preference).
Test data is used to generate decision trees.
Note: For data with an inconsistent number of samples in each category, the results of information gained in decision trees are biased towards those features with more numerical values.
6. K-Means Algorithm.
The k-means algorithm (K-Means) is an unsupervised learning algorithm that provides a solution to the clustering problem.
The K-Means algorithm divides n points (which can be an observation or an instance of a sample) into k clusters so that each point belongs to the cluster corresponding to the nearest mean (ie, the cluster center, centroid). Repeat the above process until the center of gravity does not change.
K-Means Clustering.
Clustering is a way to group a set of data points in a way that similar data points are grouped together. Therefore, clustering algorithms look for similarities or dissimilarities among data points. Clustering is an unsupervised learning method so there is no label associated with data points. Clustering algorithms try to find the underlying structure of the data.
Clustering is not classification.
Observations (or data points) in a classification task have labels. Each observation is classified according to some measurements. Classification algorithms try to model the relationship between measurements (features) on observations and their assigned class. Then the model predicts the class of new observations.
K-means clustering?aims to partition data into k clusters in a way that data points in the same cluster are similar and data points in the different clusters are farther apart. Thus, it is a?partition-based?clustering technique. The similarity of the two points is determined by the distance between them.
K-means clustering tries to minimize distances within a cluster and maximize the distance between different clusters. K-means algorithm is not capable of determining the number of clusters. We need to define it when creating the KMeans object which may be a challenging task.
Consider the following 2D visualization of a dataset:
It can be partitioned into 4 different clusters as below:
Real-life datasets are much more complex in that clusters are not clearly separated. However, the algorithm works in the same way. K-means is an iterative process.
It is built on?an expectation-maximization? algorithm. After the number of clusters are determined, it works by executing the following steps:
K-Means clustering is relatively fast and easy to interpret. It is also able to choose the positions of initial centroids in a smart way that speeds up the convergence.
One challenge with k-means is that number of clusters must be pre-determined. K-means algorithm is not able to guess how many clusters exist in the data. If there is a non-linear structure separating groups in the data, k-means will not be a good choice.
7. Hierarchical Clustering.
Hierarchical clustering means creating a tree of clusters by iteratively grouping or separating data points. There are two types of hierarchical clustering:
领英推荐
One of the advantages of hierarchical clustering is that we do not have to specify the number of clusters (but we can).
Agglomerative clustering is kind of a bottom-up approach. Each data point is assumed to be a separate cluster at first. Then the similar clusters are iteratively combined.
The figure above is called?a dendrogram?which is a diagram representing a tree-based approach. In hierarchical clustering, dendrograms are used to visualize the relationship among clusters.
One of the advantages of hierarchical clustering is that we do not have to specify the number of clusters beforehand. However, it is not wise to combine all data points into one cluster. We should stop combining clusters at some point. Scikit-learn provides two options for this:
Divisive clustering is not commonly used in real life so I will mention it briefly. A simple yet clear explanation is that?divisive clustering?is the opposite of agglomerative clustering. We start with one giant cluster including all data points. Then data points are separated into different clusters. It is an up-to-bottom approach.
Hierarchical clustering always generates the same clusters. K-means clustering may result in different clusters depending on how the centroids (center of cluster) are initiated. However, it is a slower algorithm compared to k-means. Hierarchical clustering takes a long time to run especially for large data sets.
8. DBSCAN Clustering.
Partition-based and hierarchical clustering techniques are highly efficient with normal-shaped clusters. However, when it comes to arbitrary-shaped clusters or detecting outliers, density-based techniques are more efficient.
DBSCAN stands for?density-based?spatial?clustering of?applications with?noise. It is able to find arbitrary-shaped clusters and clusters with noise (i.e. outliers).
The main idea behind DBSCAN is that a point belongs to a cluster if it is close to many points from that cluster.
There are two key parameters of DBSCAN:
Based on these two parameters, points are classified as core points, border points, or outliers:
DBSCAN does not require to specify the number of clusters beforehand. It is robust to outliers and able to detect outliers.
In some cases, determining an appropriate distance of neighborhood (eps) is not easy and it requires domain knowledge.
9. Random Forest Algorithm.
The name of the Random Forest Algorithm (Random Forest) comes from random decision forests proposed by Bell Labs in 1995. As its name says, the random forest can be regarded as a collection of decision trees.
Each decision tree in a random forest estimates a classification, a process called "voting". Ideally, we choose the class with the most votes for each vote for each decision tree.
? Paper Random Forest | Leo Breiman | Statistics Department University of California Berkeley
Random forests are built using a method called?bagging?in which decision trees are used as parallel estimators. If used for a classification problem, the result is based on a majority vote of the results received from each decision tree. For regression, the prediction of a leaf node is the mean value of the target values in that leaf. Random forest regression takes the mean value of the results from decision trees.
Random forests reduce the risk of overfitting and accuracy is much higher than a single decision tree. Furthermore, decision trees in a random forest run in parallel so that the time does not become a bottleneck.
The success of a random forest highly depends on using uncorrelated decision trees. If we use the same or very similar trees, the overall result will not be much different from the result of a single decision tree. Random forests achieve to have uncorrelated decision trees by?bootstrapping?and?feature randomness.
Bootstrapping?is randomly selecting samples from training data with replacement. They are called bootstrap samples.
Feature randomness?is achieved by selecting features randomly for each decision tree in a random forest. The number of features used for each tree in a random forest can be controlled with?the max_features?parameter.
Random forest is a highly accurate model for many different problems and does not require normalization or scaling. However, it is not a good choice for high-dimensional data sets (i.e. text classification) compared to fast linear models (i.e. Naive Bayes).
10. Naive Bayes Algorithm.
Naive Bayes is based on the Bayes theorem of probability theory and has a wide range of applications, from text classification, spam filters, medical diagnosis, and more. Naive Bayes is suitable for independent scenarios between features, such as predicting flower type using the length and width of petals. The connotation of "simplicity" can be understood as the strong independence between features and features.
A concept closely related to the naive Bayesian algorithm is Maximum likelihood estimation, and most of the maximum likelihood estimation theory in history has also been greatly developed in Bayesian statistics. For example, to establish a population height model, it is difficult to have the human and material resources to count the height of everyone in the country, but it is possible to obtain the height of some people through sampling, and then obtain the mean and variance of the distribution through maximum likelihood estimation.
Naive Bayes is called naive because it assumes that each input variable is independent.
The intuition behind the Naive Bayes algorithm is the Bayes's theorem:
p(A|B): Probability of event A given event B has already occurred.
p(B|A): Probability of event B given event A has already occurred.
p(A): Probability of event A.
p(B): Probability of event B.
Naive Bayes classifier calculates the probability of a class given a set of feature values (i.e. p(yi | x1, x2, … , xn)).?Input this into Bayes’ theorem:
p(x1, x2 , … , xn | yi)?means the probability of a specific combination of features (an observation/row in a dataset) given a class label. We need extremely large datasets to have an estimate on the probability distribution for all different combinations of feature values.
To overcome this issue,?the Naive Bayes algorithm assumes that all features are independent of each other.?Furthermore, denominator (p(x1,x2, … , xn)) can be removed to simplify the equation because it only normalizes the value of conditional probability of a class given an observation ( p(yi | x1,x2, … , xn)).
The probability of a class ( p(yi) ) is very simple to calculate:
Under the assumption of features being independent,?p(x1, x2 , … , xn | yi)?can be written as:
The conditional probability for a single feature given the class label (i.e. p(x1 | yi) ) can be more easily estimated from the data. The algorithm needs to store probability distributions of features for each class independently. For example, if there are 5 classes and 10 features, 50 different probability distributions need to be stored. Adding all these up, it became an easy task for the Naive Bayes algorithm to calculate the probability to observe a class given values of features (p(yi | x1, x2 , … , xn) )
The assumption that all features are independent makes the Naive Bayes algorithm?very fast?compared to complicated algorithms.?In some cases, speed is preferred over higher accuracy. On the other hand, the same assumption makes the Naive Bayes algorithms less accurate than complicated algorithms. Speed comes at a cost!
11. Dimensional Reduction.
In the field of machine learning and statistics, dimensionality reduction refers to the process of reducing the number of random variables under limited conditions to obtain a set of "irrelevant" main variables, which can be further subdivided into feature selection and feature extraction.
Some datasets may contain many intractable variables. Especially in the case of abundant resources, the data in the system will be very detailed. In this case, the dataset may contain thousands of variables, most of which may also be unnecessary. In this case, it is nearly impossible to identify the variables that have the greatest impact on our predictions. At this point, we need to use a dimensionality reduction algorithm, and other algorithms may also be used in the process of dimensionality reduction, such as borrowing random forests and decision trees to identify the most important variables.
12. Gradient Boosted Decision Trees (GBDT).
GBDT?is an ensemble algorithm that uses?boosting?method to combine individual decision trees.
Boosting means combining a learning algorithm in series to achieve a strong learner from many sequentially connected weak learners. In the case of GBDT, the weak learners are decision trees.
Each tree attempts to minimize the errors of the previous tree. Trees in boosting are weak learners but adding many trees in series and each focusing on the errors from the previous one make boosting a highly efficient and accurate model. Unlike bagging, boosting does not involve bootstrap sampling. Every time a new tree is added, it fits on a modified version of the initial dataset.
Since trees are added sequentially, boosting algorithms learn slowly. In statistical learning, models that learn slowly perform better.
A loss function is used to detect the residuals. For instance, mean squared error (MSE) can be used for a regression task and logarithmic loss (log loss) can be used for classification tasks. It is worth noting that existing trees in the model do not change when a new tree is added. The added decision tree fits the residuals from the current model.
Learning rate?and?n_estimators?are two critical hyperparameters for gradient boosting decision trees. Learning rate, denoted as α, simply means how fast the model learns. Each new tree modifies the overall model. The magnitude of the modification is controlled by learning rate.?n_estimator?is the number of trees used in the model. If the learning rate is low, we need more trees to train the model. However, we need to be very careful in selecting the number of trees. It creates a high risk of overfitting to use too many trees.
GBDT is very efficient on both classification and regression tasks and provides more accurate predictions compared to random forests. It can handle the mixed types of features and no pre-processing is needed. GBDT requires careful tuning of hyperparameters in order to prevent the model from overfitting.
GBDT algorithm is so powerful that there are many upgraded versions of it have been implemented such as XGBOOST, LightGBM, CatBoost.
Note on overfitting.
One key difference between random forests and gradient boosting decision trees is the number of trees used in the model. Increasing the number of trees in random forests does not cause overfitting. After some point, the accuracy of the model does not increase by adding more trees but it is also not negatively affected by adding excessive trees. You still do not want to add an unnecessary amount of trees due to computational reasons but there is no risk of overfitting associated with the number of trees in random forests.
However, the number of trees in gradient boosting decision trees is very critical in terms of overfitting. Adding too many trees will cause overfitting so it is important to stop adding trees at some point.
13. Principal Component Analysis (PCA)
PCA is a dimensionality reduction algorithm that basically derives new features from the existing ones with keeping as much information as possible. PCA is an unsupervised learning algorithm but it is also widely used as a preprocessing step for supervised learning algorithms.
PCA derives new features by finding the relations among features within a dataset.
Note: PCA is a linear dimensionality reduction algorithm. There are also non-linear methods available.
The aim of PCA is to explain the variance within the original dataset as much as possible by using fewer features (or columns). The new derived features are called?principal components.?The order of principal components is determined according to the fraction of variance of the original dataset they explain.
The principal components are linear combinations of the features of original dataset.
The advantage of PCA is that a significant amount of variance of the original dataset is retained using a much smaller number of features than the original dataset. Principal components are ordered according to the amount of variance they explain.
Conclusion
If we are speaking about artificial intelligence and its impact on our daily lives; it is important to consider both technology and philosophy, machines and men, after all, it is to increasingly develop the ability to know the nature of human beings with their needs, and to develop the ability to integrate with technological innovation without getting lost in it.
The spread of ICT, Information Communications Technology, determines the entry of a new historical phase of a predominantly informational nature, called Hyper-history, in which, as Luciano Floridi argues, "social and individual well-being depends on the use of the same" technologies since their correct use facilitates communication between "human and computational systems" (L. Floridi, 2017).
The formation of multi-agent systems in which technology reorders space and time tends to produce significant anthropological and ethical change.?
It is then a question of investigating the re-location of man, who is no longer configured as a bio, but as a 'bio-techno-being' managed and constantly connected with algorithms and Artificial Intelligence.
This significant ontological novelty, modifies, as argued by Roberto Marchesini (Post-human, 2002), the reflection on man and the representation of man, who has now become a subject hybridized not only with the animal world but also, and above all, with technology.
A reality inhabited by learners and meta-learners would increase the economic profit of companies, and dispense man from any worries related to choices at work, which will be taken care of by the ultimate algorithm, for this reason, it is necessary to understand and know the algorithmic functionality that dominates today and with which we are constantly interfaced.
In an environment in which technologies penetrate the categories of the human to the point of redefining their predicates, the subject is "re-identified" (L. Floridi, 2017) and reformulated by a digital environment, in which, he constructs a virtual identity; existence is "digitized" and the increasingly frequent and assiduous relations of hybridization suggest that an ontological hermeneutics of the subject should be addressed.
Man's behavior is ultimately profoundly modified by technological partners with which he measures himself; the enormity and ubiquity of hybridization processes should not, however, lead us to a total surrender of thought, but rather to a rethinking of the categories of the human being that takes into account the occurred (and not yet finished) computer and biotechnological revolution.
This existential transformation that is manifested and realized through new technologies, therefore, needs to be investigated and analyzed in order to be able to structure a new philosophical, ethical, and moral reflection (A. Fabris, 2019) on the use of technology 'beyond good and evil'.
What do you think?
I do wish to have stimulated your curiosity and thank you for your time.
If you think of any corrections, clarifications, questions and feedback.
Please do not hesitate to contact me.
Alistair Pernigo
Email: [email protected]