Complexity: Time, Space, & Sample
Computational Complexity of Machine Learning Algorithms. Table: thekerneltrip.com 2018

Complexity: Time, Space, & Sample

Complexity: Time, Space, & Sample

In the article “Which Machine Learning (ML) to choose?" [1] , as part of the "Architectural Blueprints—The “4+1” View Model of Machine Learning ," which helps you to choose the right ML for your data, we indicated that “From a business perspective, two of the most significant measurements are accuracy [2] and interpretability." [3] However, we did not discuss computational complexity.

ML Architectural Blueprints = {Scenarios, Accuracy, Complexity, Interpretability, Operations}

Computational complexity of an algorithm is a fundamental concept in computer science. It is necessary to take computational complexity into account because it affects the amount of resources required to run your model. These required resources are estimated for time, space, and sample complexities.

These estimated computational complexities are approximated for a variety of different machine learning algorithms during training and prediction.

Note: Computational complexity and ML model complexity are distinct concepts, though they are often related.

Computational Complexity

  • Definition: Refers to the amount of computational resources (time and space) required to solve a problem using a particular algorithm.
  • Focus: Primarily concerned with the efficiency of the algorithm and its scalability with increasing input size.
  • Metrics: Often measured using Big O notation (e.g., O(n), O(n^2), O(log n)). ?

ML Model Complexity

  • Definition: Refers to the capacity of a machine learning model to fit complex patterns in the data. ?
  • Focus: Concerned with the model's ability to learn and generalize from the data.
  • Metrics: Often measured using regularization techniques (e.g., L1, L2) or by analyzing the model's architecture (e.g., number of layers, neurons). ?

Relationship between the two:

  • More complex models often require more computational resources: A deep neural network with many layers and neurons will generally be more computationally expensive to train and use compared to a simpler linear regression model. ?
  • Overfitting: A complex model can be more prone to overfitting, where it learns the training data too well and performs poorly on new data. This can be mitigated using regularization techniques, which can also increase computational complexity. ?

In summary, while computational complexity is about the efficiency of the algorithm, ML model complexity is about the model's ability to learn and generalize from the data. Understanding both concepts is essential for building effective and efficient machine learning systems.


Definitions: Time, Space, & Sample

Computational Complexity of Machine Learning. Diagram: Visual Science Informatics

Computational Complexity of ML. Diagram: Visual Science Informatics

  • "Time complexity is the computational complexity that estimates the number of elementary operations performed by an algorithm."
  • "Space complexity of an algorithm is an estimated amount of memory space required to solve an instance of the computational problem as a function of characteristics of the input."
  • "Sample complexity of a machine learning algorithm represents the number of training samples it needs in order to successfully learn a target function."

Relationships between AI, ML, NN, and DL. Diagram: Dirk Zee

Relationships between AI, ML, NN, and DL. Diagram: Dirk Zee

The diagram above depicts the relationships between artificial intelligence, machine learning, neural networks, and deep learning include MLP: multilayer perception; CNN: convolutional neural network; RNN: recurrent neural network; DBN: deep belief network; and GAN: generative adversative network.

Artificial Intelligence (AI) is the broadest term, encompassing the creation of intelligent agents. AI is the overarching goal of creating intelligent machines.

Machine Learning (ML) is a subset of AI that focuses on algorithms allowing computers to learn from data without explicit programming. ML is a computational method to achieve AI by enabling computers to learn from data.

Neural Networks (NNs) are a type of ML algorithm inspired by the human brain's structure. They consist of interconnected nodes (artificial neurons) that process information. NN are a specific type of ML algorithm.

Deep Learning (DL) is a subset of ML that uses neural networks with multiple layers (deep networks) to learn complex patterns in data. DL is a powerful technique within neural networks for handling complex data.


Brief History of Artificial Intelligence (AI), Machine Learning (ML), & Neural Networks (NNs)

History of Artificial Intelligence (AI), Machine Learning (ML), and Neural Networks (NNs). Diagram: Visual Science Informatics

History of Artificial Intelligence (AI), Machine Learning (ML), and Neural Networks (NNs). Diagram: Visual Science Informatics

The brief history diagram above of Artificial Intelligence (AI), Machine Learning (ML), and Neural Networks (NNs) illustrates the evolution of these fields. It also captures the causes and effects on the cyclical nature of their development, including periods of rapid advancement (waves) and subsequent slower progress or decline (winters).

Artificial Intelligence (AI)

The concept of AI can be traced back to ancient myths and legends, but the modern field of AI emerged in the mid-20th century.

  • Early AI: The first AI programs were developed in the 1950s and 1960s, focusing on tasks such as playing games and solving mathematical problems.
  • AI Winter: The field faced setbacks in the 1970s and 1980s due to limitations in computing power and the complexity of real-world problems.
  • AI Renaissance: The resurgence of AI began in the 1990s with advances in machine learning and the availability of larger datasets.

Machine Learning (ML)

ML is a subset of AI that involves teaching computers to learn from data and improve their performance over time without being explicitly programmed. ?

  • Early ML: Early ML algorithms focused on supervised learning, where the computer is provided with labeled data to learn from.
  • Unsupervised Learning: In the 1990s, unsupervised learning techniques became more popular, allowing computers to find patterns in unlabeled data.
  • Deep Learning: The breakthrough in deep learning in the 2010s, driven by advances in neural networks, has revolutionized the field of ML.

Neural Networks (NNs)

NNs are a type of ML algorithm inspired by the structure and function of the human brain.

  • Early NNs: The first NNs were developed in the 1940s and 1950s, but they were limited in their capabilities.
  • Backpropagation: The development of the backpropagation algorithm in the 1980s allowed NNs to train more efficiently.
  • Deep NNs: Deep NNs with multiple layers have become increasingly powerful in recent years, enabling breakthroughs in tasks such as image recognition, natural language processing, and game playing.

Key milestones and breakthroughs

  • 1956: Dartmouth Conference: The term "artificial intelligence" is coined.
  • 1960s: Early AI programs such as Eliza and SHRDLU.
  • 1980s: Backpropagation algorithm, expert systems.
  • 1990s: Neural networks, support vector machines.
  • 2010s: Deep learning revolution, AlphaGo, self-driving cars.
  • 2020s: Generative AI, large language models.


Overview of ML Data-Driven Methods & Algorithms

The diagram below provides an overview of ML data-driven methods categorization for such as Artificial Neural Network (ANN), Decision Tree (DT), Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Generalized DBSCAN (GDBSCAN), Gaussian Means (GM), Hierarchical Clustering (HC),?k-Nearest Neighbors (kNN), Random Forest (RF), Support Vector Machines (SVM), and eXtreme Gradient Boosting (XGBoost).

Types of ML Methods & Algorithms. Diagram adapted: Aristek System

Types of ML Methods & Algorithms. Diagram adapted: Aristek System

ML Algorithms. Table: George Firican

ML Algorithms. Table: George Firican

Common ML Algorithms. Diagrams: datasciencedojo

Common ML Algorithms. Diagrams: datasciencedojo

Selecting a logical learning paradigm or a computational method has four primary categories, four major algorithm types, and two major techniques. The four major categories are supervised, semi-supervised, unsupervised, and reinforcement. The four major algorithm types are classification, regression, associations, and clustering. The two techniques are ensemble methods and reward feedback. The chart “Which Machine Learning (ML) to choose?”, at [Scenarios: Which Machine Learning (ML) to choose?], guides you through the major categories, data types, and objectives of which algorithm types or techniques to choose.

Additionally, complexity increases when adding model-agnostic explainability methods. [Interpretability: “Seeing Machines Learn”]

Moreover, complexity affects ML Operations (MLOps) and Continuous ML (CML), which are a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently. [Operations: MLOps, Continuous ML, & AutoML]

Furthermore, complexity is affected by the quality of your data, a good data quality, is a necessary prerequisite to building your ML model. [Data Science Approaches to Data Quality: From Raw Data to Datasets]


Time & Space: Big O Notation

"Big O notation is used to find the upper bound (the highest possible amount) of a function's growth rate, meaning it works out the time of the longest route it will take to turn input into output. It is often used to compare the efficiency of different algorithms, which is done by calculating, in a worst-case scenario, how much memory is needed, and how much time it takes to complete." [Wikipedia]

  • Computational Complexity of ML Algorithms [4]

"Calling n the number of training sample,

p the number of features,

n(trees) the number of trees (for computational methods based on various trees),

n(sv) the number of support vectors, and

n(li) the number of neurons at layer i in a neural network,

we have the following approximations." [thekerneltrip]

The enclosed table proposes a theoretical point of view of some algorithms’ upper bounds.

 Computational complexity of machine learning algorithms [3]
Computational Complexity of Machine Learning Algorithms. Table: thekerneltrip 2018

Computational Complexity of Machine Learning Algorithms. Table: thekerneltrip 2018

All these estimated computational complexities with assumptions are approximated for a variety of different machine learning algorithms during training time, prediction time, and auxiliary space complexity. These are computational complexity of ML algorithms, where:

n = number of training examples,

m = number of features,

n' = number of support vectors,

k = number of neighbors,

k' = number of trees.

Computational Complexity of ML Algorithms. Table: Krish Naik

Computational Complexity of ML Algorithms. Table: Krish Naik

These are computational complexity of ML algorithms, where:

- n is the size of the samples [training] set,

- m is the number of dimensions [features],

- epoch refers to the one entire passing of training data through the algorithm,

- c is the number of classes,

- d(tree) is the death of a tree,

- n(sv) is the number of support vectors,

- k is the number of clusters, and

- i is the number of iterations.

Training and Inference Time Complexity of ML Algorithms. Table: Avi Chawla

Training and Inference Time Complexity of ML Algorithms. Table: Avi Chawla

Abbreviations: Convolutional Neural Networks (CNN); Density-Based Spatial Clustering of Applications with Noise (DBSCAN); K-Nearest Neighbors (KNN); Non-negative Matrix Factorization (NMF); One-vs-Rest (OvR); Ordinary Least Squares (OLS); Principal Component Analysis (PCA); Recurrent Neural Networks (RNN); Stochastic Gradient Descent (SGD); Support Vector Machines (SVM); and t-distributed Stochastic Neighbor Embedding (t-SNE)."

Also, if you deploy your model within ensemble methods, then they multiply the complexity of your model. A lower computational complexity of a ML algorithm is significant for optimizing the efficiency of other ML computational methods such as Bagging, Boosting, Stacking, and Cascading ensembles. Because ensemble methods require input results from numerous ML algorithms.

Top Machine Learning Algorithms for Prediction. AISOMA
ML Algorithms for Prediction. Table: AISOMA

ML Algorithms for Prediction. Table: AISOMA

The table above captures the advantages and disadvantages of top machine learning algorithms for prediction, which are linear, tree-based, and neural networks.

Execution Speeds Dashboard. Chart: KNIME AG

Execution Speeds Dashboard. Chart: KNIME AG

For example, an AutoML model dashboard can visualize execution speed, which is measured during two stages: training time and deployment speed. Training time, typically in seconds, reflects how long it takes to train the model on the data. Deployment speed, measured in milliseconds, focuses on the average time to make a single prediction on new data. This highlights how fast the model can process one new input. Interactive features allow users to adjust plot settings and explore data dynamically, providing valuable insights into both training efficiency and deployment performance of the machine learning model.


Mathematical Formulas

Understanding the mathematical formulas of machine learning algorithms is instrumental in gaining insights into algorithms computational complexity of machine learning algorithms.

Mathematical formulas of machine learning algorithms. Image: Bindu Reddy
Mathematical Formulas of ML Algorithms. Table: Bindu Reddy

Mathematical Formulas of ML Algorithms. Table: Bindu Reddy

Regression Algorithms. Avi Chawla
Mathematical Formulas of Regression Algorithms. Table: Avi Chawla

Mathematical Formulas of Regression Algorithms. Table: Avi Chawla


Sample: Sampling & Fundamental Concepts

"Sample complexity of a machine learning algorithm represents the number of training samples it needs in order to successfully learn a target function." The two main categories are probability sampling and non-probability sampling methods. In probability sampling, every member of the target population has a known chance of being included. This allows researchers to make statistically significant inferences about the population from the sample. Non-probability sampling methods are used when it's not possible or practical to get a random sample, but they limit the ability to generalize the findings to the larger population.

Sampling Methods and Techniques. Diagram: Asad Naveed

Sampling Methods and Techniques. Diagram: Asad Naveed

Monte Carlo and Markov Chain Monte Carlo (MCMC) Sampling Methods

Both Monte Carlo and Markov Chain Monte Carlo (MCMC) are powerful techniques used to generate samples from complex probability distributions that are often used in Bayesian inference. The main difference between Monte Carlo and MCMC is that Monte Carlo methods use random sampling to generate independent samples, while MCMC methods use dependent samples to generate sequences of dependent observations. While Monte Carlo methods can be used for direct sampling in some cases, MCMC methods often provide a more efficient and flexible approach, especially for complex distributions.

Monte Carlo Methods

  • Direct Sampling: When possible, Monte Carlo methods directly sample from the target distribution.
  • Indirect Sampling: For complex distributions, Monte Carlo methods can use techniques such as importance sampling or rejection sampling to approximate the target distribution.

Markov Chain Monte Carlo (MCMC) Methods

  • Indirect Sampling: MCMC methods construct a Markov chain that converges to the target distribution. By simulating this Markov chain, you can obtain samples that approximate the target distribution.
  • Common MCMC Algorithms: Metropolis-Hastings algorithm and Gibbs sampling.
  • Efficiency: MCMC methods are often more efficient than direct Monte Carlo methods for complex distributions, especially in high-dimensional spaces.

Key Differences between Monte Carlo vs. MCMC. Table: Gemini

Key Differences between Monte Carlo vs. MCMC. Table: Gemini

In Conclusion, both methods are valuable tools for statistical inference and machine learning, but their applicability and efficiency can vary depending on the specific problem and the complexity of the target distribution. The choice of method depends on the complexity of the distribution and the desired level of accuracy. MCMC methods are particularly powerful for handling complex distributions and high-dimensional problems.


Entropy in Information Theory

When discussing the learnability of models and the sample complexity required for effective training, Entropy in information theory, Probably Approximately Correct (PAC) learning theory, and VC (Vapnik-Chervonenkis) dimension are fundamental concepts.

  • Based on information theory, Entropy calculates the average amount of information needed to correctly classify an instance from a node, considering the probabilities of each class. Values range from 0 to log2(k) (number of classes): 0 represents a pure node, and higher values imply greater uncertainty. [Accuracy: The Bias-Variance Trade-off #8) Gini Impurity and Entropy]

Here is a breakdown of how it works:

  • Decision trees: These are tree-like models used for classification, where each internal node represents a feature (attribute) of the data, and each branch represents a possible value of that feature. The leaf nodes represent the class labels.
  • Entropy at a node: It reflects the impurity or randomness of the data reaching a particular node in the tree. Higher entropy indicates more uncertainty about the class labels of the data points.
  • Information needed to classify: Classifying an instance involves navigating the decision tree based on its features. Entropy essentially tells you how much information, on average, is needed (in terms of the number of questions asked) to correctly classify an instance reaching a specific node.

For example, if you have a dataset of emails classified as spam or not spam. At a particular node in the decision tree, the entropy might indicate that on average, you would need to ask about two additional features (such as presence of certain keywords) to correctly classify an incoming email as spam or not spam.

Here is the formula for calculating entropy in information theory:

Entropy = -Σ(p(i) * log2(p(i)))
        

Where:

  • Σ (sigma) represents the sum across all possible classes (i).
  • p(i) represents the probability of class i.
  • log2 is the logarithm base 2.

The outcome is in bits, representing the average information needed per classification.

-??????Probably Approximately Correct (PAC) Learning Theory

  • PAC learning formalizes the idea of learning a model from data with some guarantees.
  • It sets up a framework where we can say a model is "learnable" if it can be trained on a dataset of sufficient size to perform well on unseen data with high probability.
  • PAC learning uses two parameters:

- ε (epsilon): This is the desired error tolerance. We want the model's generalization error (error on unseen data) to be within ε of its training error (error on the training data) with high probability.

- δ (delta): This is the confidence parameter. We want this level of confidence (1 - δ) that the model's generalization error is within ε of the training error.

-??????VC (Vapnik-Chervonenkis) Dimension

  • VC dimension is a measure of the complexity of a hypothesis class (the set of all possible models) used in learning.
  • It essentially tells us how well the class can represent different concepts based on the data it sees.
  • A higher VC dimension indicates a more complex class that can fit a wider range of patterns. However, this also makes it more prone to overfitting.

There is a connection between VC dimension and sample complexity in PAC learning. Intuitively, learning a more complex concept class (higher VC dimension) requires more training data to achieve the desired accuracy with high probability (as specified by ε and δ in PAC learning).

PAC learning theory provides bounds on the number of training samples (sample complexity) needed to learn a concept class with a given VC dimension within the desired accuracy (ε) and confidence (δ).

In simpler terms, if a concept class has a lower VC dimension (less complex), it can be learned with a smaller training dataset to achieve the same level of performance compared to a class with a higher VC dimension.

VC dimension helps us understand the inherent complexity of a model class, while PAC learning provides a framework to analyze how much data is needed to train a model effectively with some guarantees on its generalization performance.


Generative Models Complexity

The roots of generative AI can be traced back to the early days of AI research. While the term "Generative AI" is relatively new, the underlying concepts and techniques have been evolving for decades. Here are some key milestones:

  • 1950s: Alan Turing introduced the Turing Test, a method to determine if a machine could exhibit intelligent behavior indistinguishable from a human.
  • 1960s: Early experiments in natural language processing and pattern recognition laid the groundwork for generative models.
  • 1980s and 1990s: Neural networks, inspired by the human brain, began to be applied to AI tasks, including generative modeling.
  • 2014: A breakthrough came with the introduction of Generative Adversarial Networks (GANs) by Ian Goodfellow and others. GANs use a competitive process between a generator and a discriminator to create highly realistic and diverse outputs.
  • 2020s: The widespread availability of large datasets and powerful hardware, combined with advancements in deep learning techniques, has led to a surge in generative AI applications.

Where did Generative AI come from? Chart: Laurie McCabe

Where did Generative AI come from? Chart: Laurie McCabe

Generative models understand how data is embedded throughout space and generate new data points. A Discriminative model focuses on a solution and performs better for classification tasks by dividing the data space into classes by learning the boundaries.

Discriminative vs. Generative. Table: Supervised Learning Cheatsheet

Discriminative vs. Generative. Table: Supervised Learning Cheatsheet

Deep Learning Algorithms. Table: George Firican

Deep Learning Algorithms. Table: George Firican


A full Transformer contains both an encoder and a decoder. Diagram: Google
Transformer Model Architecture. Diagrams: Ashish Vaswani et al.

A full Transformer contains both an encoder and a decoder. Diagram: Google

Transformer Model Architecture. Diagrams: Ashish Vaswani et al.


Big O for LLMs

"Self-attention forces every word in the context to learn the relevance of all the other words in the context. So, it is tempting to proclaim this an O(N2) problem, where:

  • N is the number of tokens in the context.

As if the preceding Big O were not disturbing enough, Transformers contain multiple self-attention layers and multiple self-attention heads per self-attention layer, so Big O is actually:

O(N2 · S · D)
        

where:

  • S is the number of self-attention layers.
  • D is the number of heads per layer." [Google]

Large Language Model (LLM) is a type of generative model that excels at understanding and generating human language. LLMs are trained on massive amounts of text data, allowing them to process information, respond to questions, and even create different creative text formats of text content. Attention is a technique in neural networks that allows the model to focus on specific parts of its input data. Attention mechanism is focusing on the important bits. It is inspired by the human ability to selectively concentrate on relevant information while ignoring irrelevant details. Comparing LLMs complexity requires granular parameters of time, space, and sample. Comparing the complexity of different LLM models can be based on:

Dataset Size:

  • Dataset size refers to the total amount of data used to train the model. It is measured in units such as terabytes (TB), petabytes (PB), or exabytes (EB).

Tokens:

  • A token is the basic unit of text a model understands, like a word or sub-word. The total number of tokens in the training dataset gives an idea of the model's exposure to different text combinations.

Compute Size:

  • This refers to the amount of computational resources used to train the model. It's typically measured in petaflop/s-days (floating-point operations per second for a certain number of days).

Model Size:

  • This refers to the number of parameters the LLM has. Parameters are like dials or knobs in the model's internal network that get adjusted during training. A higher parameter count generally indicates a more complex and potentially more capable model.

Estimating Word Count (Indirectly):

  • If you have the dataset size and an assumption about average word length (e.g., 5 words per token), you can calculate an approximate word count (dataset size in tokens * average word length). This is a rough estimate because average sentence length and vocabulary size vary across datasets.

Understanding computational complexity helps you choose an efficient machine-learning algorithm for your problem, considering non-functional factors such as required response time, memory and storage size, and dataset sample size.


Sorting Algorithms' Computational Complexity

An example from the domain of sorting algorithms illustrates the impact of computational complexity on algorithms' efficiency. "Efficient sorting is consequential for optimizing the efficiency of other algorithms, such as search and merge algorithms, which requires input data to be in sorted lists."

The following animations illustrate how effectively data sets from different starting points can be sorted using different algorithms.

Sorting Algorithms. Animation: Toptal

Sorting Algorithms. Animation: Toptal


Statistical Machine Learning (SML)

Statistical Machine Learning (SML) merges statistics with the computational sciences: computer science, systems science, and optimization. SML provides mathematical tools for analyzing behavior and the generalization performance of ML algorithms. “The major difference between machine learning and statistics is their purpose. Machine learning models are designed to make the most accurate predictions possible. Statistical models are designed for inference about the relationships between variables." [5]

Algorithms of ML. Mind map: Jason Brownlee

Algorithms of ML. Mind map: Jason Brownlee [22]

Next, read the "Interpretability: “Seeing Machines Learn - Understanding models' prediction/decision reasoning/transparency: Why? – Trust; How? ” article at https://www.dhirubhai.net/pulse/interpretabilityexplainability-understanding-models-rajwan-ms-dsc .

---------------------------------------------------------

[1] https://www.dhirubhai.net/pulse/machine-learning-101-which-ml-choose-yair-rajwan-ms-dsc

[2] https://www.dhirubhai.net/pulse/accuracy-bias-variance-tradeoff-yair-rajwan-ms-dsc

[3] https://www.dhirubhai.net/pulse/interpretabilityexplainability-understanding-models-rajwan-ms-dsc/

[4] https://www.thekerneltrip.com/machine/learning/computational-complexity-learning-algorithms

[5] https://towardsdatascience.com/the-actual-difference-between-statistics-and-machine-learning-64b49f07ea3

Must read the “Computational complexity of machine learning algorithms” at?https://www.thekerneltrip.com/machine/learning/computational-complexity-learning-algorithms. Also, an excellent in-depth read of “The Future of Computation for Machine Learning and Data Science” at?https://towardsdatascience.com/the-future-of-computation-for-machine-learning-and-data-science-fad7062bc27d

回复

Read the "Interpretability/Explainability: Understanding models' prediction/decision reasoning/transparency: Why? – Trust; How? – “Seeing Machines Learn” article at?https://www.dhirubhai.net/pulse/interpretabilityexplainability-understanding-models-rajwan-ms-dsc

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了