Scenarios: Which Machine Learning (ML) to choose?
Scenarios: Which Machine Learning (ML) to choose?
Based on the inspiration from “Which chart to choose?" [1], which helps you to choose the right chart for your data, we developed the idea to chart “Which Machine Learning (ML) to choose?”
Before we present the flowchart of “Which Machine Learning (ML) to choose?” as part of the "Architectural Blueprints—The “4+1” View Model of Machine Learning," let us take a look at the big picture and zoom in on the steps that this flowchart can guide you in the selection of a machine learning to solve a business problem.
ML Architectural Blueprints = {Scenarios, Accuracy, Complexity, Interpretability, Operations}
Solving a problem and finding its solution, you can follow these steps:
Quality of your data, a good data quality, is a necessary prerequisite to building your accurate ML model. [Data Science Approaches to Data Quality: From Raw Data to Datasets]
Processing pipeline should include at least the following stages:
a.)???Data preprocessing and preparation
b.)???Datasets sampling for training and validation
c.)????Model training, validation, and evaluation
d.)???Predication model deployment
e.)???Production model monitor, feedback, and retrain
Which Machine Learning (ML) to choose? Chart: Visual Science Informatics, LLC
Selecting a logical learning paradigm or a computational method has four primary categories, four major algorithm types, and two major techniques. The four major categories are supervised, semi-supervised, unsupervised, and reinforcement. The four major algorithm types are classification, regression, associations, and clustering. The two techniques are ensemble methods and reward feedback. The chart above, “Which Machine Learning (ML) to choose?” guides you through the major categories, data types, and objectives of which algorithm types or techniques to choose.
Choosing the right machine learning (ML) approach depends on various factors related to the problem you are trying to solve, the nature of your data, and the goals of your project. Here are some common scenarios and the types of ML techniques that might be suitable for each:
Predicting a Continuous Value
Regression is a machine learning task where the goal is to predict a continuous numerical value. This is in contrast to classification, where the goal is to predict a categorical label. Types of Regression:
1. Linear Regression:
2. Polynomial Regression:
3. Logistic Regression:
4. Support Vector Regression (SVR):
5. Decision Trees and Random Forests:
Classifying Data into Categories
Classification is a machine learning task where the goal is to predict a categorical label or class for a given input. There are two main types of classification: binary and multi-class.
Binary Classification
- Support Vector Machines (SVMs) are a powerful supervised learning algorithm used for both classification and regression tasks. In classification, SVMs aim to find an optimal hyperplane that effectively separates data points into different classes in a high-dimensional space. This hyperplane is determined by maximizing the margin, which is the distance between the hyperplane and the closest data points of each class, known as support vectors. SVMs can handle both linear and nonlinearly separable data through the use of kernel functions, which map the data into a higher-dimensional space where linear separation becomes possible. This flexibility makes SVMs highly effective in various applications, including image recognition, text classification, and bioinformatics.
Multi-Class Classification
- Naive Bayes is a simple yet powerful probabilistic classification algorithm based on Bayes' theorem, assuming independence between features given the class label. Naive Bayes is capable of both binary and multi-class classification. It can handle datasets with two or more class labels.
One-vs-One (OvO) Classification
1. Train a binary classifier for each pair of classes.
2. For a new input, each classifier predicts the more likely class of the pair.
3. The class with the most wins is predicted as the final class.
One-vs-All (OvA) / One-vs-Rest (OvR) Classification
1. Train a binary classifier for each class, treating that class as positive and the rest as negative.
2. For a new input, predict the class with the highest probability from all the binary classifiers.
In summary, binary classification deals with two classes, while multi-class classification handles more than two.
Clustering Data into Groups
Clustering is an unsupervised machine learning technique used to group similar data points together. It is a powerful tool for discovering patterns and relationships within data that might not be immediately apparent. Types of Clustering Algorithms:
1. Partitioning Clustering:
2. Hierarchical Clustering:
3. Density-Based Clustering:
4. Distribution-Based Clustering:
Choosing the right clustering algorithm
The best clustering algorithm depends on the specific characteristics of the data and the desired outcome. Consider factors such as:
By carefully considering these factors, you can select the appropriate clustering algorithm for your specific application.
In each scenario, you will also need to consider factors such as data availability, interpretability, computational resources, and model complexity. It often helps to experiment with multiple approaches and evaluate them based on performance metrics relevant to your specific problem.
ML Exploration Workflow
Machine Learning Exploration Workflow. Diagram: Google
Understanding a model's problem-solving capabilities, process, inputs, and outputs is essential before selecting your ML model.?An applicable machine learning model depends on your problem and objectives. Machine learning approaches are deployed where it is highly complex or unfeasible to develop conventional algorithms to perform needed tasks or solve problems. Machine learning models are utilized in many domains, such as advertising, agriculture, communication, computer vision, customer services, finance, gaming, investing, marketing, medicine, robotics, security, visualization, and weather.
Range of Business/Machine Learning Algorithms. Mind map: GEEKSFORGEEKS
Choosing an applicable metric for evaluating machine learning models depends on the problem and objectives. From a business perspective, two of the most significant measurements are accuracy and interpretability. Accuracy degree measures – how reliable is the conclusion while interpretability (reasoning) measures – how well the model enables understanding of the justification and reasoning for the decision conclusion.
Evaluating the accuracy of a machine learning model is critical in selecting and deploying a machine learning model. Choosing the right accuracy metric for evaluating your machine learning model depends on your problem solution objectives and datasets. Before choosing one, it is important to understand the business problem context, the pros and cons, and the usefulness of each error metric.
Chart by?Alvira Swalin?via?“Choosing the Right Metric for Evaluating Machine Learning Models — Part 1" [2], & Choosing the Right Metric for Evaluating Machine Learning Models — Part 2" [3]
The chart above captures and categorizes useful metrics for evaluating machine learning models for a variety of machine learning algorithms, computational methods, and techniques.?
Measuring, for instance, a binary output prediction (Classification) is captured in a specific table layout - a Confusion Matrix, which visualizes whether a model is confusing two classes. Each row of the matrix represents the instances in an actual class, while each column represents the instances in a predicted class. Four measures are captured: True Positive, False Negative, False Positive, and True Negative.
Calculating accuracy is derived from the four values in a confusion matrix. Additional metrics with formulas on the right and below are Classification Evaluation Metrics. These metrics include but are not limited to the following: Sensitivity, Specificity, Accuracy, Negative Predictive Value, and Precision.
Confusion Matrix and Classification Evaluation Metrics. Table:?Maninder Virk
Building an accurate classification model can correctly classify positives from negatives.
On the other hand, measuring interpretability (reasoning) is a more complex task because there is neither a universally agreeable definition nor an objective quantitative measure. In general, opaque computational methods obtain higher accuracies than transparent ones. There are computational methods that produce an interpretable predictive model such as a post hoc interpretable model or an?intrinsically interpretable?algorithm. One measure of interpretability based on “triptych predictivity, stability, and simplicity” is proposed by Vincent Margot in “How to measure interpretability?" [4], and [Interpretability: “Seeing Machines Learn”]
Chart by?Sharayu Rane?via?“The balance: Accuracy vs. Interpretability" [5]
The chart “The balance: Accuracy vs. Interpretability” sorts out the trade-off between accuracy and interpretability (reasoning) for a variety of machine learning algorithms, computational methods, and techniques. [Accuracy: The Bias-Variance Trade-off]
Overall, selecting a machine learning technique depends on your problem, objectives, and data. As we mentioned above, there are four major categories, four major algorithm types, and two major techniques. The chart at the top “Which Machine Learning (ML) to choose?” guides you through the major categories, data types, and objectives of which algorithm types or techniques to choose. The chart below extends to additional horizontal ML techniques such as attribute and row importance, feature extraction, and anomaly detection.
Machine Learning Techniques. Chart: Data Science School
Ensemble Methods
Ensemble methods are powerful techniques in machine learning that combine multiple models to improve predictive performance. By harnessing the strengths of diverse models, ensembles can often outperform individual models.
Ensemble Methods. Diagrams: Neri Van Otten
Bagging (Bootstrap Aggregating)
1. Create multiple subsets of the training data through bootstrapping.
2. Train a base model on each subset.
3. Combine predictions from all models, often by averaging or voting.
- Random Forest: A powerful ML algorithm that combines the output of multiple decision trees and mitigates the drawbacks of individual decision trees to make accurate predictions. It works by creating a multitude of decision trees during training, each trained on a random subset of the data and features. By combining multiple trees, Random Forest effectively balances between bias and variance resulting in models that are both accurate and robust. This ensemble approach mitigates underfitting, reduces overfitting, and improves generalization performance. Random Forest is versatile and can handle both classification and regression problems, making it a popular choice for various applications.
Boosting
1. Train a base model on the entire dataset.
2. Assign weights to data points based on their classification accuracy.
3. Train subsequent models, giving more weight to misclassified data points.
4. Combine predictions using weighted voting.
- AdaBoost (Adaptive Boosting): An ensemble learning method that combines multiple weak learners (e.g., decision trees) to create a strong classifier. It focuses on misclassified samples, assigning higher weights to them in subsequent iterations. Prone to overfitting if not carefully tuned, especially with a large number of iterations or weak learners. Generally reduces bias by sequentially adding weak learners to correct errors, but can potentially increase variance if not tuned properly.
- CatBoost (Categorical Boosting): A gradient boosting algorithm specifically designed to handle categorical features effectively. It uses a novel technique called "ordered boosting" to improve accuracy and efficiency. Generally robust to overfitting due to its regularization techniques and categorical feature handling. Reduces both bias and variance due to its effective handling of categorical features and regularization techniques.
- Gradient Boosting: A general machine learning ensemble method that produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. Primarily reduces bias by sequentially adding weak learners to minimize a loss function. Can overfit if the model complexity is too high or the number of iterations is excessive.
- XGBoost (Extreme Gradient Boosting): An optimized distributed gradient boosting library that is highly efficient, flexible, and portable. It implements machine learning algorithms under the Gradient Boosting framework, often outperforming other boosting algorithms in terms of accuracy and speed. Reduces both bias and variance through regularization techniques, optimization algorithms, and careful model tuning. Less prone to overfitting due to its regularization techniques, early stopping, and efficient optimization algorithms. However, it can still overfit if not tuned properly.
Stacking (Stacked Generalization)
1. Train multiple base models on the training data.
2. Use the base models to make predictions on a holdout set.
3. Use the predictions from the base models as features for a meta-model.
4. Train the meta-model to make final predictions.
Meta-Learner: The optimal choice of a meta-learner depends on the specific problem, the complexity of the relationships between the base models' predictions, and the desired level of interpretability. Experimentation with different meta-learners, such as linear or logistic regression, decision trees, neural networks, or Support Vector Machines (SVMs), is often necessary to find the best performing model.
Cascading
Key Difference of Ensemble Methods. Table: Gemini
Choosing the right ensemble method depends on the specific problem, dataset, and desired performance.
Ensemble Methods Comparison. Table: Neri Van Otten
If you have multiple ML models with similar accuracy, precision, recall, or other metrics, you can create a majority vote classifier, weighted voting, or stacking ensemble with a meta-model that utilizes all models, especially when the models are different in nature. Combining multiple high-performing models through ensemble techniques is a powerful strategy to enhance predictive accuracy and robustness.
Reinforcement Learning (RL)
Reinforcement Learning (RL) offers various approaches to solve problems where an agent learns to make decisions by interacting with an environment. The primary computational methods can be categorized into:
Reinforcement Learning (RL) Agent Taxonomy. Diagram adapted: Pratap Dangeti
Model-Based Reinforcement Learning
1. Learn a transition model: P(s'|s, a)
2. Learn a reward model: R(s, a)
3. Use planning algorithms (e.g., dynamic programming, search) to find optimal actions based on the learned model.
领英推荐
Policy-Based Reinforcement Learning
Value-Based Reinforcement Learning
Actor-Critic Reinforcement Learning
- Actor: Learns a policy using policy gradients.
- Critic: Learns a value function to estimate the expected return.
- The actor improves its policy based on the critic's evaluation.
Key Differences of RL Primary Computational Methods. Table: Gemini
Choosing the right computational method depends on the specific problem and environment.
Time Series Components. Chart: Nirmal Gaud
"Time series is a ML technique that forecasts target value based solely on a known history of target values. It is a specialized form of regression known in the literature as auto-regressive modeling. The input to time series analysis is a sequence of target values." [Oracle]
Time Series Forecasting
Time series analysis comprises methods for analyzing time series data to extract meaningful statistics and data predictors characteristics. Time series regression, autoregressive dynamics, is a statistical method for predicting a future response based on the response history.
Categorized ML Algorithms. Mind map: Gina Acosta Gutiérrez
After choosing your ML scenario, your next step is to choose your ML algorithm. To choose your ML algorithm, you can utilize the categorized ML algorithms diagram, which is a partial list of ML and data mining algorithms that are organized in a hierarchical tree diagram of ML algorithms categories.
Your data type is a critical success factor when selecting your ML algorithm. For example, tree-based models outperform deep learning on typical tabular data. An experimental in-depth analysis of ML algorithms on tabular datasets with both categorical and numerical features, by Léo Grinsztajn et al., provided empirical results and insights into the reasons:
"1. Neural networks are biased to overly smooth solutions
2. Neural networks are more impacted by uninformative features
3. Data is non-invariant by rotation, so should be learning procedures"
Benchmark on medium-sized datasets. Graphs: Léo Grinsztajn et al.
Also, on the one hand, deep learning models are notorious for hyperparameter optimization. On the other hand, tree-based models (e.g., XGBoost) are simpler algorithms, easier to tune, and the best performer on tabular data.?
At a higher level, they are six archetypical analysis methods, Descriptive, Exploratory, Interference, Predictive, Prescriptive, and Causality. These analysis methods are defined as:
Six Archetypical Analyses. Chart: Visual Science Informatics, LLC
Each archetypical analysis method aims to answer different questions. The higher the complexity of the analyses (in terms of knowledge, cost, and time), the more valuable the answer output of the analytic method. [Complexity: Time, Space, & Sample]
The Value of Analytics Methods. Chart: Visual Science Informatics, LLC
Learning goals and objectives are significant to establish. Organizing objectives helps to clarify objectives.
"Bloom's taxonomy is a set of three hierarchical models used for the classification of educational learning objectives into levels of complexity and specificity. The three lists cover the learning objectives in the cognitive, affective, and psychomotor domains.
Bloom's Revised Taxonomy. Diagram: Jessica Shabatura, UARK
There are six levels of cognitive learning according to the revised version of Bloom's Taxonomy. Each level is conceptually different. The six levels are?remembering, understanding, applying, analyzing, evaluating, and creating. The new terms are defined as:
This Bloom's taxonomy was adapted for machine learning.
Bloom’s Taxonomy Adapted for Machine Learning (ML). Chart: Visual Science Informatics, LLC
There are six levels of model learning in the adapted version of Bloom's Taxonomy for ML. Each level is a conceptually different learning model. The levels order is from lower-order learning to higher-order learning. The six levels are?Store, Sort, Search, Descriptive, Discriminative,?and?Generative. Bloom’s Taxonomy adapted for ML terms are defined as:
Neural Networks (NNs)
A Neural Network (NN) is a series of algorithms inspired by the structure and function of the human brain. Neural networks are used for a variety of tasks, including image recognition, speech recognition, and natural language processing.
Neural Networks have high predictive power, but have low interpretability because the nature of neural networks is a black box where the inner working of deep networks is not fully explainable.
“An artificial neuron simply hosts the mathematical computations. Like our neurons, it triggers when it encounters sufficient stimuli. The neuron combines input from the data with a set of coefficients, or weights, which either amplify or dampen that input, which thereby assigns significance to inputs for the task the algorithm is trying to learn". [Anddy Cabrera]
Neural Networks learn by adjusting the weights of the connections between neurons. The weights determine how much influence one neuron has on another. By adjusting the weights, a neural network can learn to perform a specific task.
Neural Networks' Architectures: ANN, RNN, LSTM & CNN. Diagrams: A. Catherine Cabrera, and B. InterviewBit
"Neural Network Standard Components:
Backpropagation is a fundamental algorithm used to train artificial neural networks. It is essentially a computational method for calculating the gradient of the error function with respect to the network's weights. In simpler terms, it helps the network learn from its mistakes by adjusting its parameters to minimize the error between its predicted output and the actual output. The biggest development advance in neural networks between 1987 and 1993 was a wide adaptation of the backpropagation algorithm. This algorithm provided an efficient method for training multi-layer neural networks, allowing them to learn complex patterns and relationships in data. It was a significant breakthrough that revitalized interest in neural networks and paved the way for their subsequent applications in various fields.
Different neural networks have distinct architectures tailored to their functions and strengths. Here are description of major neural networks' architectures:
Here is a table summarizing the key differences:
Key Differences of Neural Networks' Architectures. Table: Gemini
Deep Neural Networks (DNNs) are trained using large sets of labeled or unlabeled data and increasingly learn abstract features directly from the data without manual feature extraction. Traditional neural networks may contain around 2-3 hidden layers, while deep networks can have as many as 100-200 hidden layers.
The Neural Network Zoo. Node Maps: Van Veen, F. & Leijnen, S. (2019). The Asimov Institute
Note that "Node Maps" have limitations in portraying the nuances of deep learning models. There are numerous differences in the usage scenarios, scalabilities, restrictions, and mitigations (decaying, vanishing, and exploding information).?Additional functionality could be?to preprocess, encode, or decode?information, parallel competitive learning, predicting, and generating, or un-black-box.?Additionally, the differences are in inputs: data, feedback, and noise, connectivity: past, present, future, random, reversed, stacked, and extra, and states: activations, triggers, stateless, memory, probabilistic, and pooling multiple weights as a vector.
Also, there are numerous more special networks, layers, and operations such as transformers, latent diffusion models, inception, features pyramid networks, etc.
Deep Belief Networks (DBNs) are a type of deep learning architecture used for unsupervised learning tasks. They can be thought of as building blocks for more complex neural networks. Here is a breakdown of how they work:
- Building Blocks: Restricted Boltzmann Machines (RBMs)
- Stacked for Learning: The Deep Belief Network
- Unsupervised Training
- Applications of Deep Belief Networks
- Some limitations of DBNs include:
Overall, Deep Belief Networks are a powerful tool for unsupervised feature learning and can be a valuable component in building more complex deep learning architectures.
A Generative Adversarial Network (GAN) is a type of deep learning system that uses two neural networks to compete against each other. Here is a breakdown of how it works:
- Generator: This network creates new data, such as images or music, based on the data it is been trained on.
- Discriminator: This network tries to tell the difference between the new data created by the generator and real data from the training set.
Conditional Generative Adversarial Network Model Architecture. Chart: Jason Brownlee
Another decision point in choosing a machine learning model is the difference between discriminative, predictive, and generative models. A discriminative approach focuses on a solution and performs better for classification tasks by dividing the data space into classes by learning the boundaries. A predictive approach relies on historical data, statistical modeling, and machine learning algorithms to forecast future trends, outcomes, or behaviors for making informed guesses about what might happen next. A generative model approach understands how data is embedded throughout space and generates new data points.
Discriminative vs. Generative. Table: Supervised Learning Cheatsheet
Generative AI focuses on creating new content, such as images, text, or music, by learning patterns from existing data. It often employs complex models such as GANs and transformers to generate highly creative and realistic outputs. On the other hand, Predictive AI is designed to analyze historical data to forecast future trends and outcomes. It utilizes techniques such as regression, decision trees, and neural networks to identify patterns and make predictions. While Generative AI excels at creativity and innovation, Predictive AI is invaluable for decision-making and risk assessment.
Generative vs. Predicative. Table: Gemini
The table above compares Generative AI vs. Predictive AI employing the "Architectural Blueprints—The “4+1” View Model of Machine Learning" and the "Data Science Approaches to Data Quality: From Raw Data to Datasets" as an architectural evaluation framework.
Activation Function in NNs
An Artificial Neuron in Action. Animation: Anddy Cabrera
Non-Linear Activation Functions. Graphs: Nikita Prasad
Activation functions play a crucial role in neural networks by introducing non-linearity, enabling them to learn complex patterns. The evolution of activation functions has been closely tied to the development of neural network architectures and training algorithms. Early activation functions were Sigmoid and Tanh, but since then there are improved activation functions, which are listed in the table below.
Activation Function Comparison. Table: Gemini
Note: This table provides a brief overview of common activation functions. The choice of activation function often depends on the specific task and architecture of the neural network.
Factors driving evolution
For example, ReLU is a common choice for many deep learning architectures, and helps to alleviate the vanishing gradient problem in deep networks, leading to improved performance. While more specialized functions such as GELU and Swish may be better suited for certain tasks such as translation, text classification, and natural language processing.
However, it is important to note that other activation functions and their variants, can also perform well in many cases. The best activation function for a given task may need to be determined through experimentation and evaluation.
If you are considering using a specific activation function in your own projects, it is recommended to try it alongside other activation functions and evaluate its performance on your specific dataset.
Best practices for neural network training
softmax Function Layer
A softmax function layer can be used in the output layer of neural networks for multi-class classification problems. It takes a vector of real numbers as input and transforms it into a probability distribution over the possible classes.
1. Input: The neural network produces a vector of raw scores, one for each class. The softmax layer receives a vector of raw scores, often the output of a preceding layer in the neural network.
2. Exponentiation: Each raw score is exponentiated to ensure positive values.
3. Normalization: The exponentiated values are divided by their sum, scaling them to a probability distribution.
4. Probability Distribution Output: The resulting values represent the probability of the input belonging to each class.
5. Prediction: The class with the highest probability is selected as the predicted class.
softmax equation. Google
In summary, softmax is a function that transforms raw output scores into probabilities for multi-class classification. It ensures that the probabilities sum to 1, making the model's predictions interpretable and comparable.
Another aspect of the ML exploration workflow is MLOps (Machine Learning Operations). MLOps is a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently. It involves a combination of software engineering and continuous machine learning best practices to streamline the entire ML lifecycle, from data ingestion and model training to deployment and monitoring. [Operations: MLOps, Continuous ML, & AutoML]
ML Algorithms Cheat Sheet. Diagram: SSAS
In conclusion, choosing a Machine Learning (ML) depends on multiple complex factors and challenging trade-offs. You will need to consider at least four competing architectural factors: Accuracy, Complexity, Interpretability, and Operations.?Selecting machine learning, which balances all decision factors, is important. Because the capital investment, in the processing pipeline stages, is costly and requires considerable time and effort. Therefore, it is highly valuable to employ a rigorous process in choosing machine learning. [29]
Next, read the "Accuracy: The Bias-Variance Trade-off" article at https://www.dhirubhai.net/pulse/accuracy-bias-variance-tradeoff-yair-rajwan-ms-dsc.
---------------------------------------------------------
| BSN | RN | Founder | Y-Visionary |
2 年I love your articles! Thank you for sharing ??
Crafting Scalable Data Pipelines | Data Engineer @Datain | Ex Taager | ETL | Snowflake | Airflow | DBT | Python | Advanced SQL | Git/GitHub/GitLab | Docker | Linux
2 年Great article!
Read the "Accuracy: The Bias-Variance Tradeoff" article at https://www.dhirubhai.net/pulse/accuracy-bias-variance-tradeoff-yair-rajwan-ms-dsc
Great post. Thank you for sharing