Scenarios: Which Machine Learning (ML) to choose?
Which Machine Learning (ML) to choose? Visual Science Informatics, LLC

Scenarios: Which Machine Learning (ML) to choose?

Scenarios: Which Machine Learning (ML) to choose?

Based on the inspiration from “Which chart to choose?" [1], which helps you to choose the right chart for your data, we developed the idea to chart “Which Machine Learning (ML) to choose?”

Before we present the flowchart of “Which Machine Learning (ML) to choose?” as part of the "Architectural Blueprints—The “4+1” View Model of Machine Learning," let us take a look at the big picture and zoom in on the steps that this flowchart can guide you in the selection of a machine learning to solve a business problem.

Solving a problem and finding its solution, you can follow these steps:

  1. Your strategy can be to select Artificial Intelligent (AI) as your conceptual framework.
  2. One of the viable approaches within AI is Machine Learning (ML).
  3. After formulating the problem and exploring feasible data acquisition, part of your methodology is choosing a logical learning paradigm.
  4. Then you can identify the available data type and define an objective. The logical learning paradigm, data type, and objectives are the criteria for selecting a physical learning method.
  5. The next step is to follow a workflow procedure.
  6. This workflow procedure can be customized for specific techniques.
  7. Finally, you can select a machine learning algorithm.

Quality of your data, a good data quality, is a necessary prerequisite to building your accurate ML model. [Data Science Approaches to Data Quality: From Raw Data to Datasets]

Processing pipeline should include at least the following stages:

a.)???Data preprocessing and preparation

b.)???Datasets sampling for training and validation

c.)????Model training, validation, and evaluation

d.)???Predication model deployment

e.)???Production model monitor, feedback, and retrain

Which Machine Learning (ML) to choose?
Which Machine Learning (ML) to choose? Chart: Visual Science Informatics, LLC

Which Machine Learning (ML) to choose? Chart: Visual Science Informatics, LLC

Selecting a logical learning paradigm or a computational method has four primary categories, four major algorithm types, and two major techniques. The four major categories are supervised, semi-supervised, unsupervised, and reinforcement. The four major algorithm types are classification, regression, associations, and clustering. The two techniques are ensemble methods and reward feedback. The chart above, “Which Machine Learning (ML) to choose?” guides you through the major categories, data types, and objectives of which algorithm types or techniques to choose.

Choosing the right machine learning (ML) approach depends on various factors related to the problem you are trying to solve, the nature of your data, and the goals of your project. Here are some common scenarios and the types of ML techniques that might be suitable for each:

Predicting a Continuous Value

Regression is a machine learning task where the goal is to predict a continuous numerical value. This is in contrast to classification, where the goal is to predict a categorical label. Types of Regression:

  1. Linear Regression:

  • A simple model that assumes a linear relationship between the independent variables and the dependent variable.
  • Used for predicting numerical values like house prices, stock prices, or sales figures.

2. Polynomial Regression:

  • Extends linear regression to model non-linear relationships by fitting a polynomial curve to the data.
  • Used for predicting values that have a non-linear relationship with the independent variables.

3. Logistic Regression:

  • While often used for binary classification, logistic regression can also be used for predicting continuous values that are bounded between 0 and 1, such as probabilities or proportions.

4. Support Vector Regression (SVR):

  • A regression technique that uses support vectors to create a linear or non-linear regression model.
  • Used for predicting continuous values with outliers or noise in the data.

5. Decision Trees and Random Forests:

  • Can be used for both classification and regression.
  • Decision trees create a series of if-else questions to predict a continuous value.
  • Random forests combine multiple decision trees to improve accuracy and reduce overfitting.

Classifying Data into Categories

Classification is a machine learning task where the goal is to predict a categorical label or class for a given input. There are two main types of classification: binary and multi-class.

Binary Classification

  • Definition: In binary classification, the model predicts only one of two possible classes.
  • Scenarios: Spam detection (spam or not spam), Sentiment analysis (positive or negative), Credit card fraud detection (fraudulent or legitimate)

Multi-Class Classification

  • Definition: In multi-class classification, the model predicts one of more than two possible classes.
  • Scenarios: Image classification (cat, dog, bird, etc.), Genre classification (rock, pop, jazz, etc.), Language identification

One-vs.-All Classification

  • Method: A strategy for handling multi-class classification problems by training a binary classifier for each class. Each classifier distinguishes that class from all the others. One-vs.-All will independently predict a probability for each possible class.
  • Process:

1. Train a binary classifier for each class, treating that class as positive and the rest as negative.

2. For a new input, predict the class with the highest probability from all the binary classifiers.

Softmax (One-vs.-One (Rest) Classification)

  • Method: Another strategy for handling multi-class classification problems that directly predicts probabilities for each class. One-vs.-One (multi-class with softmax) is a good choice in cases where the possible classes the model could predict are mutually exclusive.
  • Process:

1. Apply a softmax activation function to the output layer of the neural network.

2. The softmax function converts the raw outputs into probabilities that sum to 1.

3. The class with the highest probability is predicted.

softmax equation. Google

softmax equation. Google

In summary, binary classification deals with two classes, while multi-class classification handles more than two. One-vs.-All and softmax are common approaches to tackle multi-class classification problems.

Clustering Data into Groups

Clustering is an unsupervised machine learning technique used to group similar data points together. It is a powerful tool for discovering patterns and relationships within data that might not be immediately apparent. Types of Clustering Algorithms:

  1. Partitioning Clustering:

  • Divides the dataset into predefined clusters.
  • K-means: One of the most popular algorithms, it divides data into K clusters based on the distance to cluster centroids.
  • K-medoids: Similar to K-means, but uses actual data points as cluster centers instead of centroids.
  • Fuzzy c-means: Allows data points to belong to multiple clusters with varying degrees of membership.

2. Hierarchical Clustering:

  • Creates a hierarchy of clusters, starting with individual data points and merging them into larger clusters.
  • Agglomerative: Starts with each data point as a separate cluster and merges them based on similarity.
  • Divisive: Starts with one large cluster and divides it into smaller clusters.

3. Density-Based Clustering:

  • Identifies clusters based on the density of data points in a region.
  • DBSCAN: Identifies clusters based on the density of data points in a region.
  • OPTICS: Similar to DBSCAN but provides an ordering of data points based on their density.

4. Distribution-Based Clustering:

  • Assumes that the data points belong to different probability distributions.
  • Gaussian Mixture Models: Assumes that the data is generated from a mixture of Gaussian distributions.

Choosing the right clustering algorithm

The best clustering algorithm depends on the specific characteristics of the data and the desired outcome. Consider factors such as:

  • Shape of clusters: Some algorithms are better suited for spherical clusters (e.g., K-means), while others can handle more complex shapes (e.g., DBSCAN).
  • Number of clusters: If you know the approximate number of clusters, partitioning algorithms like K-means might be suitable.
  • Noise: If the data contains noise or outliers, density-based algorithms like DBSCAN can be effective.
  • Computational efficiency: For large datasets, algorithms like K-means might be more efficient than hierarchical clustering.

By carefully considering these factors, you can select the appropriate clustering algorithm for your specific application.

In each scenario, you will also need to consider factors such as data availability, interpretability, computational resources, and model complexity. It often helps to experiment with multiple approaches and evaluate them based on performance metrics relevant to your specific problem.

Machine learning exploration workflow. Diagram: Google

Machine learning exploration workflow. Diagram: Google

Understanding a model's problem-solving capabilities, process, inputs, and outputs is essential before selecting your ML model.?An applicable machine learning model depends on your problem and objectives. Machine learning approaches are deployed where it is highly complex or unfeasible to develop conventional algorithms to perform needed tasks or solve problems. Machine learning models are utilized in many domains, such as advertising, agriculture, communication, computer vision, customer services, finance, gaming, investing, marketing, medicine, robotics, security, visualization, and weather.

Range of Business/Machine Learning Algorithms. Image: GEEKSFORGEEKS
Range of Business/Machine Learning Algorithms. Mind map: GEEKSFORGEEKS

Range of Business/Machine Learning Algorithms. Mind map: GEEKSFORGEEKS

Choosing an applicable metric for evaluating machine learning models depends on the problem and objectives. From a business perspective, two of the most significant measurements are accuracy and interpretability. Accuracy degree measures – how reliable is the conclusion while interpretability (reasoning) measures – how well the model enables understanding of the justification and reasoning for the decision conclusion.

Evaluating the accuracy of a machine learning model is critical in selecting and deploying a machine learning model. Choosing the right accuracy metric for evaluating your machine learning model depends on your problem solution objectives and datasets. Before choosing one, it is important to understand the business problem context, the pros and cons, and the usefulness of each error metric.

Image by Alvira Swalin via “Choosing the Right Metric for Evaluating Machine Learning Models — Part 1[2]” and Choosing the Right Metric for Evaluating Machine Learning Models — Part 2[3]”
Choosing the Right Metric for Evaluating Machine Learning Models. Chart: Alvira Swalin

Chart by?Alvira Swalin?via?“Choosing the Right Metric for Evaluating Machine Learning Models — Part 1" [2], & Choosing the Right Metric for Evaluating Machine Learning Models — Part 2" [3]

The chart above captures and categorizes useful metrics for evaluating machine learning models for a variety of machine learning algorithms, computational methods, and techniques.?

Measuring, for instance, a binary output prediction (Classification) is captured in a specific table layout - a Confusion Matrix, which visualizes whether a model is confusing two classes. Each row of the matrix represents the instances in an actual class, while each column represents the instances in a predicted class. Four measures are captured: True Positive, False Negative, False Positive, and True Negative.

Calculating accuracy is derived from the four values in a confusion matrix. Additional metrics with formulas on the right and below are Classification Evaluation Metrics. These metrics include but are not limited to the following: Sensitivity, Specificity, Accuracy, Negative Predictive Value, and Precision.

Confusion Matrix and Classification Evaluation Metrics. Image: Maninder Virk
Confusion Matrix and Classification Evaluation Metrics. Table:?Maninder Virk

Confusion Matrix and Classification Evaluation Metrics. Table:?Maninder Virk

Building an accurate classification model can correctly classify positives from negatives.

On the other hand, measuring interpretability (reasoning) is a more complex task because there is neither a universally agreeable definition nor an objective quantitative measure. In general, opaque computational methods obtain higher accuracies than transparent ones. There are computational methods that produce an interpretable predictive model such as a post hoc interpretable model or an?intrinsically interpretable?algorithm. One measure of interpretability based on “triptych predictivity, stability, and simplicity” is proposed by Vincent Margot in “How to measure interpretability?" [4], and [Interpretability/Explainability: “Seeing Machines Learn”]

Image by Sharayu Rane via “The balance: Accuracy vs. Interpretability[5]”
The balance: Accuracy vs. Interpretability. Chart: Sharayu Rane

Chart by?Sharayu Rane?via?“The balance: Accuracy vs. Interpretability" [5]

The chart “The balance: Accuracy vs. Interpretability” sorts out the trade-off between accuracy and interpretability (reasoning) for a variety of machine learning algorithms, computational methods, and techniques. [Accuracy: The Bias-Variance Trade-off]

Overall, selecting a machine learning technique depends on your problem, objectives, and data. As we mentioned above, there are four major categories, four major algorithm types, and two major techniques. The chart at the top “Which Machine Learning (ML) to choose?” guides you through the major categories, data types, and objectives of which algorithm types or techniques to choose. The chart below extends to additional horizontal ML techniques such as attribute and row importance, feature extraction, and anomaly detection.

Machine Learning Techniques. Image: Data Science School
Machine Learning Techniques. Chart: Data Science School

Machine Learning Techniques. Chart: Data Science School

Ensemble methods are powerful techniques in machine learning that combine multiple models to improve predictive performance. By harnessing the strengths of diverse models, ensembles can often outperform individual models.

Ensemble Methods. Diagrams: Neri Van Otten

Ensemble Methods. Diagrams: Neri Van Otten

Bagging (Bootstrap Aggregating)

  • Concept: Trains multiple models on different subsets of the training data created by random sampling with replacement.
  • Process:

1. Create multiple subsets of the training data through bootstrapping.

2. Train a base model on each subset.

3. Combine predictions from all models, often by averaging or voting.

  • Advantages: Reduces variance, improves stability, and can handle both classification and regression problems.
  • Example: Random Forest

Boosting

  • Concept: Sequentially trains multiple models, where each model focuses on correcting the errors of its predecessors.
  • Process:

1. Train a base model on the entire dataset.

2. Assign weights to data points based on their classification accuracy.

3. Train subsequent models, giving more weight to misclassified data points.

4. Combine predictions using weighted voting.

  • Advantages: Typically achieves high accuracy, can handle complex patterns, and is effective for both classification and regression.
  • Examples: Gradient Boosting, AdaBoost

Stacking (Stacked Generalization)

  • Concept: Trains a meta-model to combine the predictions of multiple base models.
  • Process:

1. Train multiple base models on the training data.

2. Use the base models to make predictions on a holdout set.

3. Use the predictions from the base models as features for a meta-model.

4. Train the meta-model to make final predictions.

  • Advantages: Can leverage the strengths of different base models, often achieving higher accuracy than individual models.
  • Challenges: Can be computationally expensive and requires careful tuning of base models and meta-model.

Cascading

  • Concept: Organizes models in a hierarchical structure, where the output of one model is used as input to the next.
  • Process:

  1. Train a base model to filter out easy instances.
  2. Pass difficult instances to the next model in the cascade.
  3. Continue building layers of models until desired performance is achieved.

  • Advantages: Can improve efficiency by focusing computational resources on difficult instances.
  • Challenges: Requires careful design of the cascade structure and can be sensitive to the performance of early-stage models.

Key Difference of Ensemble Methods. Table: Gemini

Key Difference of Ensemble Methods. Table: Gemini

Choosing the right ensemble method depends on the specific problem, dataset, and desired performance.

Ensemble Methods Comparison. Table: Neri Van Otten

Ensemble Methods Comparison. Table: Neri Van Otten

If you have multiple ML models with similar accuracy, precision, recall, or other metrics, you can create a majority vote classifier, weighted voting, or stacking ensemble with a meta-model that utilizes all models, especially when the models are different in nature. Combining multiple high-performing models through ensemble techniques is a powerful strategy to enhance predictive accuracy and robustness.

Reinforcement Learning (RL) offers various approaches to solve problems where an agent learns to make decisions by interacting with an environment. The primary computational methods can be categorized into:

Reinforcement Learning (RL) Agent Taxonomy. Diagram adapted: Pratap Dangeti

Reinforcement Learning (RL) Agent Taxonomy. Diagram adapted: Pratap Dangeti

Model-Based Reinforcement Learning

  • Concept: The agent learns a model of the environment, predicting how the environment will respond to different actions. This model is then used to plan optimal actions.
  • Process:

1. Learn a transition model: P(s'|s, a)

2. Learn a reward model: R(s, a)

3. Use planning algorithms (e.g., dynamic programming, search) to find optimal actions based on the learned model.

  • Advantages: Can be efficient in environments with well-structured dynamics.
  • Disadvantages: Relies on accurate model learning, which can be challenging in complex environments.

Policy-Based Reinforcement Learning

  • Concept: The agent directly learns a policy, which maps states to actions.
  • Process: The agent learns to improve its policy by adjusting parameters based on the received rewards.
  • Advantages: Can represent complex stochastic policies and often converges to locally optimal solutions.
  • Disadvantages: Can be less sample-efficient than value-based methods and might get stuck in local optima.

Value-Based Reinforcement Learning

  • Concept: The agent learns a value function, which estimates the expected return from a given state or state-action pair.
  • Process: The agent selects actions based on the estimated values and updates the value function based on observed rewards.
  • Advantages: Often sample-efficient and can find globally optimal solutions.
  • Disadvantages: Difficulty in representing complex stochastic policies.

Actor-Critic Reinforcement Learning

  • Concept: Combines the strengths of policy-based and value-based methods.
  • Process:

- Actor: Learns a policy using policy gradients.

- Critic: Learns a value function to estimate the expected return.

- The actor improves its policy based on the critic's evaluation.

  • Advantages: Combines the exploration benefits of policy-based methods with the stability of value-based methods.
  • Disadvantages: Can be more complex to implement than pure policy-based or value-based methods.

Key Differences of RL Primary Methods. Table: Gemini

Key Differences of RL Primary Computational Methods. Table: Gemini

Choosing the right computational method depends on the specific problem and environment.

  • Model-based: Suitable for environments with well-defined dynamics and where building a model is feasible.
  • Policy-based: Good for complex environments with continuous action spaces or stochastic policies.
  • Value-based: Effective in simpler environments with discrete action spaces and where sample efficiency is crucial.
  • Actor-Critic: Versatile approach that can handle various environments and often provides good performance.

"Time series is a ML technique that forecasts target value based solely on a known history of target values. It is a specialized form of regression known in the literature as auto-regressive modeling. The input to time series analysis is a sequence of target values." [Oracle]

Time Series Components. Chart: Nirmal Gaud

Time Series Components. Chart: Nirmal Gaud

Time Series Forecasting

  • Scenario: You need to forecast future sales based on historical data.

  • ARIMA (Auto Regressive Integrated Moving Average): Traditional statistical method for time series forecasting.
  • Exponential Smoothing (ETS): Useful for data with trends and seasonality.
  • Recurrent Neural Networks (RNN): Especially Long Short-Term Memory (LSTM) networks for capturing long-term dependencies in time series.
  • Prophet: Developed by Facebook, good for handling seasonality and holidays.

Time series analysis comprises methods for analyzing time series data to extract meaningful statistics and data predictors characteristics. Time series regression, autoregressive dynamics, is a statistical method for predicting a future response based on the response history.

Categorized ML Algorithms. Mind map: Gina Acosta Gutiérrez

Categorized ML Algorithms. Mind map: Gina Acosta Gutiérrez

After choosing your ML scenario, your next step is to choose your ML algorithm. To choose your ML algorithm, you can utilize the categorized ML algorithms diagram, which is a partial list of ML and data mining algorithms that are organized in a hierarchical tree diagram of ML algorithms categories.

Your data type is a critical success factor when selecting your ML algorithm. For example, tree-based models outperform deep learning on typical tabular data. An experimental in-depth analysis of ML algorithms on tabular datasets with both categorical and numerical features, by Léo Grinsztajn et al., provided empirical results and insights into the reasons:

"1. Neural networks are biased to overly smooth solutions

2. Neural networks are more impacted by uninformative features

3. Data is non-invariant by rotation, so should be learning procedures"

Benchmark on medium-sized datasets. Image: Léo Grinsztajn et al.
Benchmark on medium-sized datasets. Graphs: Léo Grinsztajn et al.

Benchmark on medium-sized datasets. Graphs: Léo Grinsztajn et al.

Also, on the one hand, deep learning models are notorious for hyperparameter optimization. On the other hand, tree-based models (e.g., XGBoost) are simpler algorithms, easier to tune, and the best performer on tabular data.?

At a higher level, they are six archetypical analysis methods, Descriptive, Exploratory, Interference, Predictive, Prescriptive, and Causality. These analysis methods are defined as:

Six Archetypical Analyses. Visual Science Informatics, LLC
Six Archetypical Analyses. Chart: Visual Science Informatics, LLC

Six Archetypical Analyses. Chart: Visual Science Informatics, LLC

  • Descriptive statistics is the discipline of quantitatively describing the main features of a data collection.
  • Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics in an easy-to-understand form, often with visual graphs and dynamic visualization capabilities,?without using a statistical model or having formulated a hypothesis.
  • Inference is the process of drawing conclusions from data that is subject to random variation.
  • Predictive analytics analyzes historical facts to make forecasting future trends, behavior patterns, or unknown events.?
  • Prescriptive analytics synthesizes big data, past performance, mathematical sciences, business rules, and machine learning to suggest decision options to take advantage of a probable future outcome of an event or a likelihood of a situation occurring.
  • Causality (causation) is the relationship between an event or a set of factors (the cause) and a second event or phenomenon (the effect), where the second event is understood as a consequence of the first.

Each archetypical analysis method aims to answer different questions. The higher the complexity of the analyses (in terms of knowledge, cost, and time), the more valuable the answer output of the analytic method. [Complexity: Time, Space, & Sample]

The value of Analytics Methods. Visual Science Informatics, LLC
The Value of Analytics Methods. Chart: Visual Science Informatics, LLC

The Value of Analytics Methods. Chart: Visual Science Informatics, LLC

Learning goals and objectives are significant to establish. Organizing objectives helps to clarify objectives.

"Bloom's taxonomy is a set of three hierarchical models used for the classification of educational learning objectives into levels of complexity and specificity. The three lists cover the learning objectives in the cognitive, affective, and psychomotor domains.

Bloom's Revised Taxonomy. Image: Jessica Shabatura, UARK
Bloom's Revised Taxonomy. Diagram: Jessica Shabatura, UARK

Bloom's Revised Taxonomy. Diagram: Jessica Shabatura, UARK

There are six levels of cognitive learning according to the revised version of Bloom's Taxonomy. Each level is conceptually different. The six levels are?remembering, understanding, applying, analyzing, evaluating, and creating. The new terms are defined as:

  • Remembering: Retrieving, recognizing, and recalling relevant knowledge from long-term memory.
  • Understanding: Constructing meaning from oral, written, and graphic messages through interpreting, exemplifying, classifying, summarizing, inferring, comparing, and explaining.
  • Applying: Carrying out or using a procedure through executing, or implementing.
  • Analyzing: Breaking material into constituent parts, determining how the parts relate to one another and to an overall structure or purpose through differentiating, organizing, and attributing.
  • Evaluating: Making judgments based on criteria and standards through checking and critiquing.
  • Creating: Combining elements to form a coherent or functional whole; reorganizing elements into a new pattern or structure through generating, planning, or producing." [Anderson & Krathwohl, 2001, pp. 67-68]

This Bloom's taxonomy was adapted for machine learning.

Bloom’s Taxonomy Adapted for Machine Learning (ML). Visual Science Informatics, LLC
Bloom’s Taxonomy Adapted for Machine Learning (ML). Chart: Visual Science Informatics, LLC

Bloom’s Taxonomy Adapted for Machine Learning (ML). Chart: Visual Science Informatics, LLC

There are six levels of model learning in the adapted version of Bloom's Taxonomy for ML. Each level is a conceptually different learning model. The levels order is from lower-order learning to higher-order learning. The six levels are?Store, Sort, Search, Descriptive, Discriminative,?and?Generative. Bloom’s Taxonomy adapted for ML terms are defined as:

  • Store models capture three perspectives: Physical, Logical, and Conceptual data models. Physical data models describe the physical means by which data are stored. Logical data models describe the semantics represented by a particular data manipulation technology. Conceptual data models describe a domain's semantics in the model's scope. Extract, Transform, and Load (ETL) operations are a three-phase process where data is extracted, transformed, and loaded into store models. Collected data can be from one or more sources. ETL data can be stored in one or more models.
  • Sort models arrange data in a meaningful order and systematic representation, which enables searching, analyzing, and visualizing.
  • Search models solve a search problem to retrieve information stored within some data structure, or calculated in the search space of a problem domain, either with discrete or continuous values.
  • Descriptive models specify statistics that quantitatively describe or summarize features and identify trends and relationships.?
  • Discriminative models focus on a solution and perform better for classification tasks by dividing the data space into classes by learning the boundaries.
  • Generative models understand how data is embedded throughout space and generate new data points.

Conditional Generative Adversarial Network Model Architecture Example. Image: Jason Brownlee
Conditional Generative Adversarial Network Model Architecture Example. Diagram: Jason Brownlee

Conditional Generative Adversarial Network Model Architecture Example. Diagram: Jason Brownlee

Another decision point in choosing a machine learning model is the difference between a discriminative vs. a generative model. A discriminative approach focuses on a solution and performs better for classification tasks by dividing the data space into classes by learning the boundaries. A generative approach models understand how data is embedded throughout space and generates new data points.

Discriminative vs Generative. Image: Supervised learning cheatsheet
Discriminative model vs Generative model. Table: Supervised learning cheatsheet

Discriminative vs. Generative. Table: Supervised Learning Cheatsheet

A Neural Network (NN) is a series of algorithms inspired by the structure and function of the human brain. Neural networks are used for a variety of tasks, including image recognition, speech recognition, and natural language processing.

Neural Networks have high predictive power, but have low interpretability because the nature of neural networks is a black box where the inner working of deep networks is not fully explainable.

“An artificial neuron simply hosts the mathematical computations. Like our neurons, it triggers when it encounters sufficient stimuli. The neuron combines input from the data with a set of coefficients, or weights, which either amplify or dampen that input, which thereby assigns significance to inputs for the task the algorithm is trying to learn". [Anddy Cabrera]

Neural Networks learn by adjusting the weights of the connections between neurons. The weights determine how much influence one neuron has on another. By adjusting the weights, a neural network can learn to perform a specific task.

Neural Networks' Architectures: ANN, RNN, LSTM & CNN. Diagrams: A. Catherine Cabrera, & B. InterviewBit

Neural Networks' Architectures: ANN, RNN, LSTM & CNN. Diagrams: A. Catherine Cabrera, and B. InterviewBit

"Neural Network Standard Components:

  • Nodes: A set of nodes, analogous to neurons, organized in layers.
  • Weights: A set of weights representing the connections between each neural network layer and the layer beneath it. The layer beneath may be another neural network layer, or some other kind of layer.
  • Biases: A set of biases, one for each node.
  • Activation Function: An activation function that transforms the output of each node in a layer. Different layers may have different activation functions." [Google]

Backpropagation is a fundamental algorithm used to train artificial neural networks. It is essentially a computational method for calculating the gradient of the error function with respect to the network's weights. In simpler terms, it helps the network learn from its mistakes by adjusting its parameters to minimize the error between its predicted output and the actual output.

Different neural networks have distinct architectures tailored to their functions and strengths. Here are description of major neural networks' architectures:

  • Artificial Neural Network (ANN): ANN is the foundation for other NN's architectures. ANNs are loosely inspired by the structure and function of the human brain. They consist of interconnected nodes called neurons, arranged in layers. Data is fed into the input layer, processed through hidden layers, and an output is generated. ANNs are powerful for various tasks such as function approximation, classification, and regression.
  • Recurrent Neural Network (RNN): RNNs are a special kind of ANN designed to handle sequential data such as text or speech. Unlike ANNs where data flows forward, RNNs have connections that loop back, allowing information to persist across steps. This is helpful for tasks such as language translation, speech recognition, and time series forecasting. However, RNNs can struggle with long-term dependencies in data.
  • Long Short-Term Memory (LSTM): LSTMs are a type of RNN specifically designed to address the long-term dependency problems of RNNs. LSTMs have internal mechanisms that can learn to remember information for longer periods, making them very effective for tasks such as time series forecasting, machine translation, caption generation, and handwriting recognition.
  • Convolutional Neural Network (CNN): CNNs are another specialized type of ANN excelling at image and video analysis. CNNs use a specific architecture with convolutional layers that can automatically extract features from the data. This makes them very powerful for tasks such as image recognition, object detection, and image segmentation.

Here is a table summarizing the key differences:

Key Differences of Neural Networks' Architectures. Table: Gemini

Key Differences of Neural Networks' Architectures. Table: Gemini

Deep Neural Networks (DNNs) are trained using large sets of labeled or unlabeled data and increasingly learn abstract features directly from the data without manual feature extraction. Traditional neural networks may contain around 2-3 hidden layers, while deep networks can have as many as 100-200 hidden layers.

The Neural Network Zoo. Node Maps: Van Veen, F. & Leijnen, S. (2019). The Asimov Institute

The Neural Network Zoo. Node Maps: Van Veen, F. & Leijnen, S. (2019). The Asimov Institute

A Generative Adversarial Network (GAN) is a type of deep learning system that uses two neural networks to compete against each other. Here is a breakdown of how it works:

  • Two Neural Networks: There are two main parts to a GAN:

- Generator: This network creates new data, such as images or music, based on the data it is been trained on.

- Discriminator: This network tries to tell the difference between the new data created by the generator and real data from the training set.

  • The Adversarial Process: These two networks are pitted against each other. The generator is getting better at creating new data, while the discriminator is getting better at spotting fakes. This creates an ongoing competition that refines both networks.
  • The Result: Over time, the generator learns to create new data that is increasingly difficult for the discriminator to distinguish from real data. Ideally, the generator becomes so good that it can create very realistic and convincing new data.


Deep Belief Networks (DBNs) are a type of deep learning architecture used for unsupervised learning tasks. They can be thought of as building blocks for more complex neural networks. Here is a breakdown of how they work:

- Building Blocks: Restricted Boltzmann Machines (RBMs)

  • DBNs are composed of multiple layers of processing units called Restricted Boltzmann Machines (RBMs).
  • RBMs are relatively simple neural networks with two layers: a visible layer that receives input data, and a hidden layer that extracts features from the data.
  • The key aspect of RBMs is that connections only exist between units in different layers, not between units within the same layer.

- Stacked for Learning: The Deep Belief Network

  • A DBN is essentially a stack of multiple RBMs.
  • The hidden layer of one RBM becomes the visible layer for the next RBM in the stack.
  • This stacking allows DBNs to learn complex features from data in a hierarchical way.
  • Each layer learns to represent the data in a more abstract form, building on the knowledge from the previous layer.

- Unsupervised Training

  • Unlike some deep learning models, DBNs are trained in an unsupervised manner. This means they do not require labeled data for training.
  • The training process involves adjusting the connections between units in each RBM to minimize a specific energy function.
  • By minimizing this energy, the DBN learns to reconstruct the input data as accurately as possible, essentially capturing the underlying patterns within the data.

- Applications of Deep Belief Networks

  • DBNs are often used as a pre-training step for more complex supervised deep learning models.
  • By learning good feature representations from the data in an unsupervised way, DBNs can improve the performance of supervised models when they are fine-tuned for specific tasks such as image recognition or natural language processing.
  • DBNs can also be used for dimensionality reduction, which is helpful for compressing data without losing important information.

- Some limitations of DBNs include:

  • They can be computationally expensive to train, especially with large datasets.
  • Fine-tuning a pre-trained DBN for a specific supervised task can require additional training data.

Overall, Deep Belief Networks are a powerful tool for unsupervised feature learning and can be a valuable component in building more complex deep learning architectures.

There are numerous more special networks, layers, and operations such as transformers, latent diffusion models, inception, features pyramid networks, etc.

Note that Node Maps have limitations in portraying the nuances of deep learning models. There are numerous differences in the usage scenarios, restrictions, mitigations (decaying, vanishing, and exploding information), and scalabilities.?Additional functionality could be?to preprocess, encode, or decode?information, parallel competitive learning/predicting/generating, or un-black-box.?Also, the differences are in inputs: data, feedback, and noise, connectivity: past, present, future, random, reversed, stacked, and extra, and states: activations, triggers, stateless, memory, probabilistic, and pooling multiple weights as a vector.

Activation Function in NN

An Artificial Neuron in Action. Animation: Anddy Cabrera

An Artificial Neuron in Action. Animation: Anddy Cabrera

Activation functions play a crucial role in neural networks by introducing non-linearity, enabling them to learn complex patterns. The evolution of activation functions has been closely tied to the development of neural network architectures and training algorithms. Early activation functions were Sigmoid and Tanh, but since then there are improved activation functions, which are listed in the table below.

Activation Function Comparison. Table: Gemini

Activation Function Comparison. Table: Gemini

Note: This table provides a brief overview of common activation functions. The choice of activation function often depends on the specific task and architecture of the neural network.

Factors driving evolution:

  • Vanishing gradient problem: The difficulty of training deep networks due to the multiplication of small gradients.
  • Computational efficiency: The need for efficient activation functions to handle large-scale datasets.
  • Improved performance: The need for better performance in various tasks, such as image classification, natural language processing, and speech recognition.
  • Biological inspiration: The desire to mimic the behavior of biological neurons.
  • Empirical evaluation: The exploration of new activation functions through experimentation and benchmarking.

For example, ReLU is a common choice for many deep learning architectures, and helps to alleviate the vanishing gradient problem in deep networks, leading to improved performance. While more specialized functions such as GELU and Swish may be better suited for certain tasks such as translation, text classification, and natural language processing.

However, it is important to note that other activation functions and their variants, can also perform well in many cases. The best activation function for a given task may need to be determined through experimentation and evaluation.

If you are considering using a specific activation function in your own projects, it is recommended to try it alongside other activation functions and evaluate its performance on your specific dataset.

Best practices for neural network training:

  • "Prevent Vanishing Gradients: The ReLU activation function can help prevent vanishing gradients.
  • Prevent Exploding Gradients: Batch normalization can help prevent exploding gradients, as can lowering the learning rate.
  • Prevent Dead ReLU Units: Lowering the learning rate can help keep ReLU units from dying.
  • Prevent Overfitting: Dropout regularization removes a random selection of a fixed number of the units in a network layer for a single gradient step. The more units dropped out, the stronger the regularization. This is analogous to training the network to emulate an exponentially large ensemble of smaller networks.
  • "Feed-forward": Only the output node changes. Because inference for this neural network is "feed-forward" (calculations progress from start to finish), the addition of a new layer to the network will only affect nodes after the new layer, not those that precede it." [Google]

Additionally, ML Operations (MLOps) and Continuous ML (CML) are a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently. [Operations: MLOps, Continuous ML, & AutoML]

ML Algorithms Cheat Sheet. Diagram: SSAS

ML Algorithms Cheat Sheet. Diagram: SSAS

In conclusion, choosing a Machine Learning (ML) depends on multiple complex factors and challenging trade-offs. You will need to consider at least four competing architectural factors: Accuracy, Complexity, Interpretability/Explainability, and Operations.?Selecting machine learning, which balances all decision factors, is important. Because the capital investment, in the processing pipeline stages, is costly and requires considerable time and effort. Therefore, it is highly valuable to employ a rigorous process in choosing machine learning. [27]

Next, read my "Accuracy: The Bias-Variance Trade-off" article at?https://www.dhirubhai.net/pulse/accuracy-bias-variance-tradeoff-yair-rajwan-ms-dsc.

---------------------------------------------------------

[1] https://www.dhirubhai.net/pulse/how-choose-right-chart-your-data-yair-rajwan-ms-dsc

[2] https://medium.com/usf-msds/choosing-the-right-metric-for-machine-learning-models-part-1-a99d7d7414e4

[3] https://www.kdnuggets.com/2018/06/right-metric-evaluating-machine-learning-models-2.html

[4] https://towardsdatascience.com/how-to-measure-interpretability-d93237b23cd3

[5] https://towardsdatascience.com/the-balance-accuracy-vs-interpretability-1b3861408062

J.A. Gáítan Dávíla BSN RN

| BSN | RN | Founder | Y-Visionary |

2 年

I love your articles! Thank you for sharing ??

Omar Abdelaleem ????

Crafting Scalable Data Pipelines | Data Engineer @Datain | Ex Taager | ETL | Snowflake | Airflow | DBT

2 年

Great article!

Great post. Thank you for sharing

要查看或添加评论,请登录

社区洞察

其他会员也浏览了