AI Performance Stories 02.24.2024
Hi Everyone,
Two weeks is a long time in the frenzy world of Generative AI. In this edition though, I want to take a step back to the basics of one of the hottest areas in AI research - LLM inference. The inference market has exploded primarily as a result of the easy availability of pre-trained foundation models (thanks Meta!). So it is important to understand what’s going on in this space. First article is from Sharada Yeluri posted on linkedin https://www.dhirubhai.net/pulse/llm-inference-hwsw-optimizations-sharada-yeluri-wfdyc. This article is an excellent overview of how LLM inference works and the state of the art hw/sw optimizations available. For a more detailed analysis, let's dive into - Where do LLMs spend their FLOPS?. I follow that up with commentary from Dylan Patel (Inference Race To The Bottom - Make It Up On Volume?) and Finbarr Timbers (The evolution of the LLM API market - by Finbarr Timbers) on the market for LLM inference. And we end with this website I discovered recently called https://artificialanalysis.ai/ that has done a terrific job of providing an astounding number of inference related metrics on both closed and open models. Kudos for a job well done!
As always, all articles are accompanied with AI generated summaries thanks to our generous friends at Anthropic (claude.ai)
Happy reading !
Here is a detailed summary of the key points in the text:
Introduction
- The article explains the details of large language model (LLM) inference workflow and how it differs from training.
- It covers the optimizations done to make inference efficient and the hardware landscape for inference.
LLM Inferencing
- Inference refers to the process of getting a response from a trained LLM model for a user's query or prompt.
- It is a critical step in deploying LLMs. The trained model undergoes optimizations to reduce memory footprint and computations before deployment.
- Inference serving systems refer to the entire infrastructure and software ecosystem to manage and serve models for inference. Key components include load balancers, inference servers, and accelerator clusters.
Steps to Predict Next Token
- Input sequence is tokenized, then fed to the embedding layer to create embedding vectors.?
- Queries, keys, values are computed for each token through linear projection of embeddings.
- Attention output is computed using queries, keys and values. This goes through a prediction layer to assign probabilities to each token.
- The token with highest probability or sampled probabilistically is the predicted next token.
Steps to Generate Full Sequence
- Prefill phase: First token generated based on user input sequence. Parallel computation possible.
- Feedback loop: Generated token concatenated to input sequence.
- Sequential prediction: Updated sequence fed back to model to generate next token. Key-value caching saves compute cycles.
- Continuation: Above steps repeat in an autoregressive manner until stopping criteria is met.
Optimizing Inference Cost?
- Techniques include model architecture optimizations, memory optimizations, throughput optimizations, and hardware optimizations.
Model Architecture Optimizations
- Enhanced attention mechanisms like multi-query attention reduce computations.?
- Model distillation transfers knowledge from larger models to smaller, efficient models.
Memory Optimizations
- Quantization and pruning reduce model size.
- Dynamic key-value cache allocation improves memory utilization.?
- Storing parameters in server memory and prefetching reduces latency.
Throughput Optimizations
- Optimal model partitioning across accelerators.
- Improved batching schemes increase utilization.
- Speculative multi-token predictions hide memory latency.
领英推荐
Hardware Optimizations
- Dedicated inference accelerators optimize die area and cost.?
- New number formats improve efficiency.
- In-memory compute also being explored.
Inference Metrics
- Time to first token, time per output token, throughput, tail latency etc. are key metrics.
Accelerator Memory Requirements
- Model parameters and key-value caches take up significant memory.
- Conservative allocation leads to overprovisioning.
Summary
- Many optimizations across hardware, software and architecture to make LLM inference sustainable and economical.
Here is a summary of the key points from the text:
- Theoretical analysis estimates that LLMs spend 25% of compute on QKV projections, 8% on attention output, and 66% on feedforward networks (FFN). Attention mechanism itself is negligible.
- For GPT-3, the KV cache would require 4.72 MB per token generated. Modern architectures like Mistral use techniques to reduce this memory requirement.
- Doubling model depth doubles parameters and flops. Doubling width quadruples parameters and flops. Wider models parallelize better.
- Empirical profiling of Llama2 found 40% of time in attention and 53% in FFN, roughly matching theory. Attention time dominated by QKV projections.
- Increasing model width showed little speedup until 2048+ dimension, likely due to overhead and lack of parallelism. Increasing depth slowed speed linearly.
- Generation time and memory scale linearly with number of tokens, as expected with KV cache. Some overhead found empirically versus theoretical memory estimate.
Here is a summary of the key points from the text:
1. There are now many companies that have developed large language models comparable to or better than GPT-3.5, with some achieving this with very small teams. This shows that developing these models has become commoditized.?
2. Firms with unique distribution advantages (like Microsoft), ability to fine-tune models, ensure legal compliance, etc. will still have competitive edges. Those just serving open models will not.
3. There is a "race to the bottom" on inference pricing for these models, with companies subsidizing costs to acquire customers. Mistral's Mixtral model sets pricing just below GPT-3.5, but then many companies announced even lower pricing, likely below cost.
4. Inference costs do not drop as quickly with increased batch size for mixture-of-experts models like Mixtral as they do for dense models. This limits the benefits of scale.
5. Speculative decoding methods can improve throughput but are less effective on mixture-of-experts models. Quantization can also help but risks quality losses without proper fine-tuning.
6. Upcoming hardware like the H200 and MI300X GPUs will significantly improve inference cost and performance for these large language models.
Here are the key points made in the text:
1. The LLM API market began with OpenAI having a monopoly with ChatGPT, but now has multiple competitors like Bard, Claude, and Gemini. This competition erodes profit margins.
2. There is a "ruthless competition for efficiency" as companies aim to serve LLMs at the lowest cost possible through optimizations like quantization and building custom chips.?
3. There is a bifurcation occurring between expensive, high-quality models from big labs and cheaper, lower-quality open weight models. The open models are steadily improving in quality and decreasing in cost.
4. Most economically valuable LLM tasks likely do not require the most complex, expensive models. As tooling makes switching between APIs easier, developers will opt for the lowest cost model that can accomplish their task.?
5. Successful consumer LLM companies will start training their own models to reduce dependence on expensive APIs from other providers.
6. The market seems to be converging towards low-cost models for simpler tasks, with only the most complex tasks still requiring the biggest, most expensive models from labs like OpenAI.