ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Optimisation Strategies to Speed up Transformers

Himank Jain

Senior Data Scientist @Bajaj Finserv Health | Translating complex data into simpler solutions for Healthcare | Problem Solver | Learner

å‘å¸ƒæ—¥æœŸ: 2024å¹´6æœˆ17æ—¥

Transformers have revolutionised the field of natural language processing (NLP) and have found applications in various domains such as machine translation, text generation, and more. The transformer architecture employs self-attention mechanisms to process input data in parallel, addressing the limitations of recurrent neural networks (RNNs) and achieving state-of-the-art performance on many tasks. Despite their success, transformers are computationally intensive and often require significant resources in terms of memory and processing power. This makes optimising transformers critical for practical applications, especially in environments with limited computational resources or where low-latency responses are crucial.

Types of Optimisation strategies

Optimisation strategies for transformers can be broadly categorised into:

Algorithmic Improvements - involve modifying the transformer architecture or training procedures to enhance efficiency. Techniques such as reducing the complexity of self-attention mechanisms, employing sparse attention, and utilising knowledge distillation fall under this category.
Hardware Accelerations - leverage specialised hardware like GPUs, TPUs, and FPGAs to speed up computations. These accelerations are often coupled with low-level optimisations and parallel processing capabilities to maximize performance.
Software Engineering practices - focus on optimising the implementation of transformer models. This includes leveraging optimised libraries, utilising mixed-precision training, and implementing efficient data handling and batching processes.

In this exploration, we delve into Software Engineering practices and how they can be used to efficiently speed up Transformers. By examining these strategies, we aim to provide a comprehensive understanding of the current state-of-the-art methods for enhancing transformer efficiency, thus enabling their wider and more effective deployment across different applications and industries.

Optimisation Strategies

Fixed-Length Padding, Dynamic Padding and Uniform Length Batching

Fixed-length padding is the process of ensuring that all input sequences in a batch fed into the model have the same length by adding padding tokens to shorter sequences.

Let's illustrate fixed-length padding with example. Consider the following set of sentences:

"She enjoys reading books."
"He loves coding in Python."

If the maximum sequence length allowed by the model is 8 tokens, we need to pad each sentence to ensure they all have the same length. Here is how it would look:

Original Sentences:

"She enjoys reading books."
"He loves coding in Python."

Tokenised Sentences:

"She enjoys reading books [PAD] [PAD] [PAD] [PAD]"
"He loves coding in Python [PAD] [PAD] [PAD]"

Dynamic padding involves adjusting the length of input sequences by padding them to match the maximum length within each specific batch, rather than padding all sequences to the length of the longest sequence in the entire dataset. Unlike fixed-length padding, which standardizes sequence lengths across the entire dataset, dynamic padding ensures that sequences are only padded to the length of the longest sequence within their respective batch.

Uniform Length Batching involves organising sequences of similar lengths into batches for training or inference. Unlike dynamic padding, which pads sequences to the length of the longest sequence within each batch, uniform-length batching creates batches where all sequences are inherently the same length. This is done by dividing the dataset into buckets based on sequence lengths and then forming batches by sampling sequences from these buckets.

As seen from above images, the total tokens using Fixed-Length padding is 168 tokens which is then reduced to 160 tokens using Dynamic Padding and can be further dropped down to 124 tokens using Uniform Length Batching.

é¢†è‹±æŽ¨è

Mastering Prompt Engineering Strategies and Tactics

Krishna Srikanth K 1 å¹´å‰

Embeddings in Natural Language Processing (NLP)

Sanjay Kumar MBA,MS,PhD 1 å¹´å‰

LLM

Darshika Srivastava 1 å¹´å‰

Numeric Precision Reduction

Numeric precision reduction speeds up computations by using lower precision floating-point numbers, a common method to enhance prediction speed.

Automatic Mixed Precision (AMP) is a technique that accelerates deep learning model training by combining single-precision (float32) and half-precision (float16) arithmetic. Modern GPUs handle float16 computations faster and with less memory than float32.

AMP dynamically changes the precision during training, using float16 for most calculations but switching to float32 when necessary to avoid numerical issues like underflow or overflow.

Parameter Sharing

Parameter sharing in transformers is a technique used to reduce the number of parameters in a model by reusing the same parameters across multiple layers. This approach can help mitigate the memory and computational overhead of deep transformer models, making them more efficient while maintaining their performance.

Freezing Embeddings

Freezing the embedding layer during the fine-tuning of a pre-trained language model is a straightforward yet effective technique to enhance training efficiency. The embedding layer, which converts input tokens into dense vectors, is computationally intensive and memory-hungry. By freezing this layer, you prevent its weights from being updated during back propagation. This approach offers several benefits:

Memory Efficiency: By not updating the embedding layer, you save significant GPU memory. This freed-up memory can be utilized to increase the batch size, which can lead to more stable and faster convergence during training.
Training Speed: Since the embedding layer's weights are not adjusted, fewer computations are required during each training step. This reduction in computational load speeds up the overall training process.
Stability: Pre-trained embeddings already capture a lot of useful information about the language. Freezing them ensures that this valuable information is preserved, which can lead to better performance on downstream tasks, especially when the amount of fine-tuning data is limited.
Simplified Optimisation: With fewer parameters to update, the optimization process becomes simpler and potentially more effective, as the model can focus on adjusting the more task-specific layers.

By freezing the embedding layer, you leverage the pre-trained knowledge effectively while optimising resource usage, making it a practical strategy for fine-tuning large language models.

Gradient Accumulation

Gradient accumulation is a strategy to optimize computational resources, especially when GPU memory is limited. It works by accumulating gradients over several smaller batches of data rather than updating model parameters after each small batch.

Mini-batch: It's a subset of the training data, used in batches during model training to compute gradients and update parameters. Smaller mini-batches consume less memory but may lead to noisy gradient estimates.
Gradients: These are derivatives of the loss function with respect to the model parameters. They guide the parameter updates during training, aiming to minimize the loss.

By accumulating gradients over multiple mini-batches, gradient accumulation allows for a larger effective batch size without requiring all data to be stored in memory simultaneously. This method balances memory usage with computational efficiency in training deep learning models.

Gradient Checkpointing

Gradient checkpointing is a technique that cuts down on memory usage during deep neural network training, at the cost of slightly longer computation times. It trades computational resources to save memory, making it possible to train bigger or wider models and use larger mini-batch sizes.

Here's how it works:

Back propagation in chunks: Instead of storing the entire backpropagation graph in memory, gradient checkpointing breaks down the process into smaller parts.
Recalculation of activations: It saves memory by recalculating intermediate activations during the backward pass, rather than storing them.

Although this approach increases computation due to extra forward passes for each chunk, it significantly reduces memory demands. Despite slowing down training, it enables handling larger batches effectively, which can be advantageous in certain training scenarios.

Conclusion

Optimisation strategies to speed up transformers are essential for harnessing their full potential in practical applications. By implementing techniques such as fixed-length and dynamic padding, uniform length batching, and numeric precision reduction, we can significantly improve the efficiency and performance of transformer models. Methods like freezing embeddings and gradient accumulation further enhance training speed and resource utilisation.

As transformer models continue to grow in complexity and application scope, these optimisation strategies will play a critical role in making them accessible and usable in real-world scenarios. By reducing computational demands and accelerating training times, these methods enable more efficient use of resources, allowing for the deployment of powerful transformer models in environments with limited computational capabilities. Ultimately, these optimizations contribute to the broader adoption and impact of transformer-based technologies across various domains.

Pete Grett

GEN AI Evangelist | #TechSherpa | #LiftOthersUp

9 ä¸ªæœˆ

Insightful overview. Making transformers efficient is game-changing for real-world adoption. What optimization strategies have you found most impactful? Himank Jain

èµž

å›žå¤

1 æ¬¡å›žåº”

Sudhanshu Pandey

Data Scientist @Bajaj Finserv Health | ex Machine Learning Engineer @Intellekt AI | Blogger at Medium

9 ä¸ªæœˆ

As correctly mentioned gradient accumulation is one of the best methods for optimization, as it gives the ability to use a higher batch size.

èµž

å›žå¤

1 æ¬¡å›žåº”

æŸ¥çœ‹æ›´å¤šè¯„è®º

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

Himank Jainçš„æ›´å¤šæ–‡ç«

Accelerating Language Models with Multi-Token Prediction

2024å¹´7æœˆ23æ—¥

Accelerating Language Models with Multi-Token Prediction

Meta's new research introduces an improved method for training Large Language Models (LLMs). This model predictsâ€¦
Understanding the variance of Variational Autoencoders

2024å¹´7æœˆ13æ—¥

Understanding the variance of Variational Autoencoders

In the field of deep learning, autoencoders are well-known for their ability to compress and reconstruct data, aimingâ€¦
Decoding Autoencoders

2024å¹´7æœˆ10æ—¥

Decoding Autoencoders

Autoencoders are a type of deep learning neural network that have found applications in various domains such asâ€¦
Tokenization & Byte-Pair Encoding

2024å¹´7æœˆ1æ—¥

Tokenization & Byte-Pair Encoding

â€œGemini-1.5 has a context length of 1M tokens.
Mastering Text Generation: Unveiling the Secrets of Decoding Strategies in Large Language Models

2024å¹´6æœˆ20æ—¥

Mastering Text Generation: Unveiling the Secrets of Decoding Strategies in Large Language Models

Decoding in the context of large language models (LLMs) refers to the process of generating sequences of words orâ€¦

2 æ¡è¯„è®º
Why Decoder-only Transformers?

2024å¹´6æœˆ10æ—¥

Why Decoder-only Transformers?

In the realm of natural language processing (NLP), Transformer architectures have revolutionized the way machinesâ€¦
MORA: A High Rank PEFT Approach for Fine-Tuning

2024å¹´6æœˆ3æ—¥

MORA: A High Rank PEFT Approach for Fine-Tuning

Introduction: Researchers from Microsoft and Beihang University have introduced MoRA, a new parameter-efficientâ€¦
RAG & GraphRAG

2024å¹´5æœˆ31æ—¥

RAG & GraphRAG

Introduction Large Language Models (LLMs) operate on fixed datasets, with their knowledge limited to the point of theirâ€¦

5 æ¡è¯„è®º
xLSTM: A New Frontier in Large Language Model Efficiency and Performance

2024å¹´5æœˆ27æ—¥

xLSTM: A New Frontier in Large Language Model Efficiency and Performance

The landscape of large language models (LLMs) has been revolutionized by Transformers since their debut in 2017. Priorâ€¦

3 æ¡è¯„è®º
How Tensor Streaming Processor (TSP) forms the backend for LPU?

2024å¹´5æœˆ13æ—¥

How Tensor Streaming Processor (TSP) forms the backend for LPU?

Recently, Groq garnered attention for surpassing LLM inference benchmarks with their Language Processing Unit (LPU)â€¦

See all articles

Optimisation Strategies to Speed up Transformers

Himank Jain

Senior Data Scientist @Bajaj Finserv Health | Translating complex data into simpler solutions for Healthcare | Problem Solver | Learner

Types of Optimisation strategies

Optimisation Strategies

Fixed-Length Padding, Dynamic Padding and Uniform Length Batching

é¢†è‹±æŽ¨è

Numeric Precision Reduction

Parameter Sharing

Freezing Embeddings

Gradient Accumulation

Gradient Checkpointing

Conclusion

Himank Jainçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

LLM

What is a Large Language Model?

Mastering BERT: A Comprehensive Guide from Beginner to Advanced in Natural Language Processing (NLP)

No code-no maths: Learn Gen AI

Language Modeling and Computer Interfacing

Bridging gaps between LMs and KGs

The Rudiments of Artificial Intelligence

Potential of Large Language Models for Feature Selection in Machine Learning

Types of Optimisation strategies

Optimisation Strategies

Fixed-Length Padding, Dynamic Padding and Uniform Length Batching

é¢†è‹±æŽ¨è

Numeric Precision Reduction

Parameter Sharing

Freezing Embeddings

Gradient Accumulation

Gradient Checkpointing

Conclusion

Himank Jainçš„æ›´å¤šæ–‡ç«

Accelerating Language Models with Multi-Token Prediction

Understanding the variance of Variational Autoencoders

Decoding Autoencoders

Tokenization & Byte-Pair Encoding

Mastering Text Generation: Unveiling the Secrets of Decoding Strategies in Large Language Models

Why Decoder-only Transformers?

MORA: A High Rank PEFT Approach for Fine-Tuning

RAG & GraphRAG

xLSTM: A New Frontier in Large Language Model Efficiency and Performance

How Tensor Streaming Processor (TSP) forms the backend for LPU?

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

LLM

What is a Large Language Model?

Mastering BERT: A Comprehensive Guide from Beginner to Advanced in Natural Language Processing (NLP)

No code-no maths: Learn Gen AI

Language Modeling and Computer Interfacing

Bridging gaps between LMs and KGs

The Rudiments of Artificial Intelligence

Potential of Large Language Models for Feature Selection in Machine Learning

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†