登录查看更多内容

The Big O notation and its significance in LLMs

Tarry Singh

CEO, Visiting Prof. AI, Board Director & AI Researcher @ Real AI Inc. & DeepKapha AI Lab | Simplifying AI for Enterprises | Keynote Speaker ??

发布日期: 2024年12月20日

This chart shows various complexities using Big-O notation, describing how the runtime or resource usage of an algorithm scales with input size (nnn). I'll explain these complexities in the context of large language models (LLMs) and transformer architectures. But first, an introduction ??

Introduction

Big O notation is a mathematical tool used to describe how quickly the resource requirements of an algorithm—such as time or memory—grow relative to the size of the input. Instead of focusing on exact runtimes, Big O gives a high-level view of an algorithm’s efficiency as input size moves toward very large values.

It provides a way to classify algorithms into broad categories (like O(n), O(n2), or O(log n)) that reflect how computation scales, making it easier for engineers, researchers, and developers to compare methods and guide their choices.

In the context of large language models (LLMs), understanding Big O notation becomes especially important. Modern LLMs often process enormous corpora of text and handle increasingly lengthy input sequences.

As these models grow in parameter count and complexity, the computational and memory demands can scale dramatically, influencing everything from training costs to inference latency.

By applying Big O principles, practitioners can better estimate the resources required to train, fine-tune, or run these models, and gain insights into how new techniques might improve efficiency. In short, Big O notation offers a conceptual framework for understanding and managing the computational challenges at the heart of developing and deploying LLMs.

According to Wikipedia

Big O notation is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity. Big O is a member of a family of notations invented by German mathematicians Paul Bachmann, Edmund Landau and others, collectively called Bachmann–Landau notation or asymptotic notation. The letter O was chosen by Bachmann to stand for Ordnung, meaning the order of approximation.

Drawing a chart for all possibilities

A graph illustrating all 14 Big-O notations as shown in the chart. Each complexity shows how time scales with input size for different growth rates.

Brief Explanation of all the notations w.r.t LLMs

Explanation of the 14 Notations with Examples in Large Language Models (LLMs) and Transformers

Complete Big-O Notations and potential LLM Examples

Simulating Attention Mechanisms

Simulating Scaled Dot-Product Attention

This is the core of the attention mechanism:

Visualizing Attention Scores Using a small example sequence, we can calculate and visualize attention scores for each token.
Simulating Multi-Head Attention we see how how attention is computed across multiple "heads" to capture diverse relationships.
Sparse Attention Mechanisms For longer sequences, here we simulate sparse attention patterns like local attention or longformer-style attention.

Attention Weights (Softmax Scores)

The simulation showcases the scaled dot-product attention mechanism with:

Heatmap of Attention Weight : Rows represent query tokens, and columns represent key tokens, The values in the heatmap represent the attention scores after applying the softmax function, which shows how much each token attends to others.
Attention Outputs: The weighted sum of value vectors (V) computed using these attention weights. These outputs are the "contextualized" embeddings of each token.

Extending the Simulation to Multi-Head Attention

Multi-Head Attention is an extension of the scaled dot-product attention mechanism that improves the model's ability to capture diverse relationships in the input. Instead of performing a single attention calculation, multiple attention mechanisms (or "heads") are computed in parallel, and their outputs are concatenated.

Here’s how it works:

Key Components of Multi-Head Attention:

Multiple Heads: Each head has its own Q,K,V matrices, Each head focuses on a different part of the input or captures different relationships.
Parallel Attention: Each head computes scaled dot-product attention independently.
Concatenation and Linear Transformation: The outputs of all heads are concatenated and passed through a linear transformation to produce the final result.

Multi-Head Attention

Exploring multi-head attention mechanism with multiple heads focusing on different input relationships.

Here’s the visualization of multi-head attention:

Attention Heatmaps for Each Head:

Each heatmap represents the attention weights computed by a specific head.
Each head attends to the sequence tokens differently, allowing the model to capture diverse relationships.

Concatenated Output

After computing attention for each head, the outputs (weighted sums of values) are concatenated.

The concatenated output dimensions are (sequence_length,embedding_dim)(sequence_length, embedding_dim) here 5×8 times.

领英推荐

??Top ML Papers of the Week

DAIR.AI 11 个月前

??Top ML Papers of the Week

DAIR.AI 7 个月前

??Top ML Papers of the Week

DAIR.AI 1 年前

Key Takeaways:

Multi-head attention enhances the model's ability to attend to different aspects of the input sequence.
Each head operates on a subset of the embedding dimensions, making the computations more diverse and efficient.

Sparse Attention and Its Role in Transformers

Sparse Attention is an optimization of the standard attention mechanism that addresses the inefficiency of computing O(n^2) relationships in large sequences. Sparse attention reduces the computational and memory costs by focusing on a subset of token pairs instead of all possible combinations.

Key Types of Sparse Attention

Local Attention:

Each token attends to only a fixed number of nearby tokens (e.g., a sliding window).
Useful for tasks like text generation, where immediate context matters more than distant relationships.

Strided Attention:

Each token attends to tokens at fixed intervals.
Helps capture periodic or structured relationships in data.

Global Attention:

A small set of special tokens (e.g., [CLS]) attends to all tokens, and vice versa.
Enables long-range dependencies while reducing computation.

Random Attention:

Tokens attend to a random subset of other tokens.
Balances efficiency and generalization.

Longformer-Style Attention:

Combines local and global attention patterns.
Used for handling very long sequences efficiently.

Here is the visualization of sparse attention using local attention with a sliding window size of 2. Each token attends only to its nearby neighbors (±2 tokens), significantly reducing the computational complexity compared to standard full attention.

Key Observations:

Attention Heatmap:

The non-zero regions form a diagonal band, corresponding to the tokens that each query token attends to within its local context.

Benefits:

Reduces computational cost from O(n^2) (full attention) to O(n?window_size).
Suitable for long sequences where distant relationships are less important (e.g., local patterns in text).

Applications in Transformers:

Longformer: Combines local attention with global attention to efficiently handle long documents.
Efficient Transformers: Variants like Reformer or BigBird use sparse attention to reduce memory usage while maintaining good performance.

Finally

Big O notation is vital for AI companies, researchers, and startups as it provides a mathematical framework for evaluating algorithm efficiency and scalability.

In AI-driven businesses, where models process vast datasets, understanding computational complexity helps optimize performance while managing costs.

For researchers, Big O analysis guides model selection, architecture design, and algorithm innovation by revealing computational bottlenecks and enabling resource-efficient scaling.

Startups benefit by leveraging complexity insights to build scalable solutions with limited infrastructure, ensuring competitiveness in a rapidly evolving tech landscape. Mastering Big O notation empowers stakeholders across the AI ecosystem to make data-driven decisions that balance accuracy, speed, and operational feasibility.

Hope this helps a bit.

PS: We've been training our Hominis.io for nearly a year now and already have our 13B model ready for use and currently on track to deliver our 1B model for edge- and mobile computing.

Reach out to me at [email protected] if you want to know more.

Addendum

For the above header image here is the description of each notation

O(1) – Constant Time Meaning: The runtime does not depend on the input size. Example in Transformers: Rare in practice. An operation like accessing a specific parameter or performing a constant-sized embedding lookup is O(1) since it takes the same amount of time regardless of input length.
O(log n) – Logarithmic Time Meaning: The runtime grows logarithmically as the input size increases. Example in Transformers: Efficient searching methods, such as using hierarchical softmax or a binary search structure to select the next token from a large vocabulary, may achieve O(log n) complexity.
O(n) – Linear Time Meaning: The runtime grows linearly with the input size n. Example in Transformers: Creating token embeddings for an input sequence of length n is O(n) because each token is embedded once.
O(n log n) – Quasilinear Time Meaning: The runtime is proportional to n * log n. Example in Transformers: Sorting operations during token generation or in beam search decoding could be O(n log n), especially when ranking probabilities over large vocabularies.
O(n2) – Quadratic Time Meaning: The runtime grows quadratically with the input size. Example in Transformers: The self-attention mechanism in standard transformers has O(n2) complexity. This is because the model computes pairwise interactions between all pairs of tokens in a sequence, which becomes expensive for long input sequences.
O(2^n) – Exponential Time Meaning: The runtime doubles with each additional input element. Example in Transformers: Rarely encountered. Would appear in exhaustive combinatorial searches or attempts to consider all possible paths. Such complexity is generally avoided due to its impracticality.
O(n!) – Factorial Time Meaning: The runtime grows factorially with the input size. Example in Transformers: Not directly related to common transformer tasks, but could appear in theoretical scenarios where considering all permutations of a sequence is required. This complexity is never used in practice for LLMs due to its extreme computational cost.

Key Considerations for LLMs and Transformers:

Quadratic Bottleneck (O(n2)): The primary computational challenge in transformers arises from the O(n2) complexity of the self-attention mechanism over long sequences. As sequence length increases, the pairwise token interactions grow quadratically. To address this, researchers explore methods like sparse attention or linear transformers, which aim to reduce this bottleneck.

Vocabulary Size (O(n)): While sequence length contributes a quadratic cost in traditional attention, vocabulary size generally affects runtime linearly during token prediction. Larger vocabularies require more computation per token, but this typically grows more slowly than the quadratic cost from longer sequences.

Practical Optimizations:

Linearized Attention: Converts O(n2) complexity into O(n) by approximating or reformulating the attention mechanism, thus making it more scalable for long sequences.
Efficient Decoding Algorithms: Techniques like beam search or nucleus sampling can achieve approximately O(n log n) performance, balancing quality and efficiency when generating output tokens.
Chunked Processing: Splitting long input sequences into smaller chunks reduces both memory and computational overhead. This approach helps circumvent the quadratic complexity by effectively handling smaller subsequences at a time.

Tarry's AI Notes

29,055 位关注者

Stanley Chong

IIoT Edge Gateway / Co-Founder / BUILD + Deploy Customer Solutions ??

2 个月

Insighful write-up mate. Funny how we're wrestling with O(n2?) complexity in LLMs while dreaming of AGI Maybe the real breakthrough isn't just throwing more compute at the problem, but finding those elegant mathematical shortcuts our future AGI overlords will figure out ... ?? #AIEfficiency #FutureOfCompute

2 次回应

查看更多评论

要查看或添加评论，请登录

Tarry Singh的更多文章

How Can Germany (Europe) Stop Its Deindustrialization

2025年2月28日

How Can Germany (Europe) Stop Its Deindustrialization

Introduction Germany, long revered as Europe’s industrial powerhouse, is facing a manufacturing crisis that is sending…

4 条评论
Is Microsoft Quietly Pulling Out of the AI Race?

2025年2月27日

Is Microsoft Quietly Pulling Out of the AI Race?

Satya Admits "Their" AI strategy is still having zero Returns Satya Nadella, CEO of Microsoft, recently shared a…

11 条评论
First Welfare Theorem and Grok 3's Brutal Take on Europe’s AI Landscape

2025年2月18日

First Welfare Theorem and Grok 3's Brutal Take on Europe’s AI Landscape

First Welfare Theorem as a Molotov Cocktail I think the First Welfare Theorem is a smug little fantasy: perfect…

3 条评论
In 2025, AI Will Have Completely Eaten Your Job

2025年2月10日

In 2025, AI Will Have Completely Eaten Your Job

Introduction Imagine a world where writing software is a solved problem – where advanced AI systems can build…

3 条评论
DeepSeek's Hammer : An AI Bubble Lie, Open Source Revenge or A Warning?

2025年1月26日

DeepSeek's Hammer : An AI Bubble Lie, Open Source Revenge or A Warning?

Recent buzz around “DeepSeek” technology suggests a radical disruption in AI hardware requirements, casting doubt on…

12 条评论
Test Driving OpenAI Operator Agent

2025年1月24日

Test Driving OpenAI Operator Agent

As part of my ongoing exploration of AI, I recently downloaded the Pro version of ChatGPT to evaluate its performance…

7 条评论
Can we achieve Super Intelligence with finite energy sources on Earth?

2024年12月11日

Can we achieve Super Intelligence with finite energy sources on Earth?

Humanity stands at a pivotal crossroads in its development. We have begun to recognize the limits of our home planet’s…

4 条评论
Nvidia is on a speed thrill, Intel on a death chill. What's Next?

2024年11月27日

Nvidia is on a speed thrill, Intel on a death chill. What's Next?

The Chip Sector Shake-Up: How Nvidia Overtook Intel and Reshaped the Industry In the fast-paced world of technology…

2 条评论
Agentic Swarm Intelligence with LLMs

2024年10月16日

Agentic Swarm Intelligence with LLMs

Swarm Intelligence - An Introduction In 1999 a really cool book was written by by Eric Bonabeau, Marco Dorigo, and Guy…

11 条评论
Dense AI Startups - Recipe for an Exponential Organization?

2024年3月19日

Dense AI Startups - Recipe for an Exponential Organization?

As Nvidia unleashes its Blackwell strategy and is racing towards its $3T Market Cap growth, I cannot stop thinking of…

6 条评论

See all articles

The Big O notation and its significance in LLMs

Tarry Singh

CEO, Visiting Prof. AI, Board Director & AI Researcher @ Real AI Inc. & DeepKapha AI Lab | Simplifying AI for Enterprises | Keynote Speaker ??

Introduction

Drawing a chart for all possibilities

Brief Explanation of all the notations w.r.t LLMs

Simulating Attention Mechanisms

Attention Weights (Softmax Scores)

Extending the Simulation to Multi-Head Attention

Key Components of Multi-Head Attention:

Multi-Head Attention

领英推荐

Key Takeaways:

Sparse Attention and Its Role in Transformers

Key Types of Sparse Attention

Key Observations:

Finally

Addendum

Tarry's AI Notes

29,055 位关注者

Tarry Singh的更多文章

社区洞察

其他会员也浏览了

Discovering FunSearch: How This New Generative AI Architecture Can Transform Your Business

Harnessing Large Language Models (LMMs) for complex document processing

Text classification with Neo4j-GraphRAG using Knowledge Graph Agent

Geneea's AI Spotlight #2

A Quick Langchain Guide: Custom Data and External APIs

The Gods of Intelligence

?? The Downsides of Structured Outputs

???? The Next Impact Factor

Integrating Graphs with LLMs; LLM for Investment using Financial Domain; LLM in Banking; How You Start Using Google’s Gemini; and More.

Evaluating LLM and RAG Systems

Introduction

Drawing a chart for all possibilities

Brief Explanation of all the notations w.r.t LLMs

Simulating Attention Mechanisms

Attention Weights (Softmax Scores)

Extending the Simulation to Multi-Head Attention

Key Components of Multi-Head Attention:

Multi-Head Attention

领英推荐

Key Takeaways:

Sparse Attention and Its Role in Transformers

Key Types of Sparse Attention

Key Observations:

Finally

Addendum

Tarry's AI Notes

29,055 位关注者

Tarry Singh的更多文章

How Can Germany (Europe) Stop Its Deindustrialization

Is Microsoft Quietly Pulling Out of the AI Race?

First Welfare Theorem and Grok 3's Brutal Take on Europe’s AI Landscape

In 2025, AI Will Have Completely Eaten Your Job

DeepSeek's Hammer : An AI Bubble Lie, Open Source Revenge or A Warning?

Test Driving OpenAI Operator Agent

Can we achieve Super Intelligence with finite energy sources on Earth?

Nvidia is on a speed thrill, Intel on a death chill. What's Next?

Agentic Swarm Intelligence with LLMs

Dense AI Startups - Recipe for an Exponential Organization?

社区洞察

其他会员也浏览了

Discovering FunSearch: How This New Generative AI Architecture Can Transform Your Business

Harnessing Large Language Models (LMMs) for complex document processing

Text classification with Neo4j-GraphRAG using Knowledge Graph Agent

Geneea's AI Spotlight #2

A Quick Langchain Guide: Custom Data and External APIs

The Gods of Intelligence

?? The Downsides of Structured Outputs

???? The Next Impact Factor

Integrating Graphs with LLMs; LLM for Investment using Financial Domain; LLM in Banking; How You Start Using Google’s Gemini; and More.

Evaluating LLM and RAG Systems