登录查看更多内容

Engineering Smarter Conversations: A Resource Guide for Large-Scale Chat LLM

Hassen Dhrif, PhD

AI Applied Scientist and Engineering Leader @ Amazon

发布日期: 2024年1月10日

In the fast-paced world of technology, deploying a Chat Large Language Model (LLM) service is a complex task that requires careful planning and resource estimation. With the rise of conversational AI, businesses are increasingly looking to provide real-time, engaging experiences to their users. In this blog post, we'll explore a detailed system design and resource estimation for a Chat LLM service that caters to 100,000 Daily Active Users (DAU), ensuring a seamless experience with sub-second latency.

When designing a system to support a Chat LLM service, several critical factors must be considered to ensure that the service can handle the expected load while meeting performance requirements. Here's how we can approach this challenge:

Estimating Model Size and Memory Requirements

Our Chat LLM, fine-tuned using QLORA with 4-bit quantization, has 7 billion parameters. Given 32-bit floating-point encoding, the original model size would be 28 GB. With 4-bit quantization, the model size is reduced to one-eighth of the original size, resulting in a 3.5 GB model. For each user interaction, with an average conversation length of 500 tokens, we estimate that we need approximately 1 GB of memory to account for the model, input, intermediate states, and output.

Throughput and Capacity Planning

To maintain a latency of less than 1 second, we need to determine the throughput per GPU. If a single GPU can handle 10 requests concurrently within 1 second, and assuming each user engages in 10 sessions per day, one GPU can serve up to 8640 users daily.

Peak Load and Batching

Considering that peak concurrent users might be around 10% of DAU, we have 10,000 concurrent users at peak times. By batching requests, we can increase throughput efficiently. Assuming a batch size of 32, we would need to process approximately 312.5 batches per second.

Memory and GPU Estimation

For our QLORA model, we assume a memory footprint of 4 bytes per token. This translates to roughly 64 MB per batch when considering memory alignment padding. Consequently, we would require about 20 GB of memory per second on each GPU. Given that high-end GPUs come with varying memory capacities, we will provision each GPU with at least 16 GB of memory to meet this demand efficiently.

领英推荐

ChatGPT turns 2, HackerRank Innovator Awards, and CES…

HackerRank 2 个月前

Microsoft unveils AI-powered Bing model like ChatGPT

Interesting Engineering 2 年前

AI Breakthroughs, Apple's Next Big Moves, and the…

Brian Burke 4 个月前

Resource Estimation Overview

To accommodate each user interaction, which averages a conversation length of 500 tokens, we estimate that we need approximately 1 GB of memory per request. This includes the space required for the model itself, the input, intermediate states, and the output generated during the interaction.

Given our performance targets, we've determined that each GPU can handle 10 requests concurrently within a second. With users typically engaging in 10 sessions per day, a single GPU can cater to 8640 users daily. To support our entire user base with a sub-second response time, we initially estimated the need for 12 GPUs. However, to account for high availability and maintenance, we've decided to include a 20% redundancy, bringing our total to 15 GPUs.

Now, let's address the memory requirements. The updated model size of 3.5 GB allows us to lower the memory allocation per GPU. Taking into account the memory needed for processing requests and system overhead, we recommend equipping each GPU with at least 16 GB of memory.

Our system design will thus consist of containerized model servers, each powered by a high-end GPU with 16 GB of memory. This setup is sufficient to manage the load and provides room for scaling up as demand increases. A shared storage solution will store conversation history and other pertinent data, ensuring that all servers have access to the information they need. A load balancer will play a crucial role in evenly distributing user requests across the available GPUs, optimizing resource utilization and maintaining the health of the system.

In conclusion, our revised system design for the Chat LLM service will require around 15 GPUs, each outfitted with 16 GB of memory. This configuration strikes a balance between meeting current demands and allowing for future expansion. As we deploy our service, we will continuously monitor its performance and make necessary adjustments to resources, ensuring that we provide a responsive, efficient, and cost-effective solution for real-time conversational AI.

Conclusion and Next Steps

This estimation serves as a starting point for deploying a Chat LLM service. It's essential to monitor actual load and performance to adjust resources accordingly. Exploring dynamic batching and considering accelerators like TPUs could further optimize efficiency. Remember to include buffer and safety margins in your resource allocation to handle traffic spikes, and consider expanding the system to multiple data centers for enhanced availability and fault tolerance.

By following this guide, you'll be well on your way to estimating the resources needed for your Chat LLM service and designing a scalable and efficient architecture that meets the demands of today's conversational AI landscape.

Mihails Ivanovs

Solution Sales Specialist

5 个月

Thank you for this valuable article!

要查看或添加评论，请登录

Hassen Dhrif, PhD的更多文章

DeepSeek's Revolutionary Approach to AI Reasoning

2025年1月24日

DeepSeek's Revolutionary Approach to AI Reasoning

DeepSeek's Revolutionary Approach to AI Reasoning In this post, I want to share my analysis of DeepSeek's…

2 条评论
Beyond Information Retrieval: The Ascendancy of Function Execution in LLMs

2024年2月26日

Beyond Information Retrieval: The Ascendancy of Function Execution in LLMs

Large language models (LLMs), a transformative force in human-computer interaction, have restricted ability to interact…

2 条评论
Navigating the Landscape of LLMs: Custom vs General-Purpose Approaches

2024年2月22日

Navigating the Landscape of LLMs: Custom vs General-Purpose Approaches

For a decision maker, navigating the varied strengths and weaknesses of LLMs can be a complex endeavor. This article…

3 条评论
Training Specialized LLMs with RLHF (Part 1)

2024年1月30日

Training Specialized LLMs with RLHF (Part 1)

Pre-trained Large Language Models (LLMs) are remarkable in their ability to predict the next token in a sequence, yet…

3 条评论
The iPhone Moment for Generative AI: Is Your Enterprise Ready?

2024年1月17日

The iPhone Moment for Generative AI: Is Your Enterprise Ready?

Recall the year 2007, when Steve Jobs unveiled the iPhone, a breakthrough in personal technology that forever altered…

2 条评论
How Flash Attention Speeds Up Transformers

2024年1月8日

How Flash Attention Speeds Up Transformers

In the expansive realm of Natural Language Processing (NLP), where words exchange secrets across shelves, a…
Navigating the AI Job Space

2024年1月6日

Navigating the AI Job Space

An Inside Look at Foundational Models, Applications, and Enterprise Adoption The AI landscape is vast and rapidly…

4 条评论
The Death of RAG?

2023年12月27日

The Death of RAG?

AFTs, Virtual Memory, and the Future of Long-Context LLMs Within the burgeoning field of large language models (LLMs)…

See all articles

Engineering Smarter Conversations: A Resource Guide for Large-Scale Chat LLM

Hassen Dhrif, PhD

AI Applied Scientist and Engineering Leader @ Amazon

领英推荐

Hassen Dhrif, PhD的更多文章

社区洞察

其他会员也浏览了

ChatGPT as Job, Value Creator

Is Grok 3 Better than ChatGPT?

?? Another ChatGPT Moment

Typical first experience with AI

Top Tech News

The ChatGPT Observer Edition 31

ChatGPT will Now have Bing Search Engine Built-In

Comparing AI chatbots: Bing versus ChatGPT versus Bard.

Harnessing NVIDIA’s New Reward Model to Align LLMs with Human Preferences

Considering ChatGPT Implications for Civic Tech: A Coforma Technologists’ Panel

领英推荐

Hassen Dhrif, PhD的更多文章

DeepSeek's Revolutionary Approach to AI Reasoning

Beyond Information Retrieval: The Ascendancy of Function Execution in LLMs

Navigating the Landscape of LLMs: Custom vs General-Purpose Approaches

Training Specialized LLMs with RLHF (Part 1)

The iPhone Moment for Generative AI: Is Your Enterprise Ready?

How Flash Attention Speeds Up Transformers

Navigating the AI Job Space

The Death of RAG?

社区洞察

其他会员也浏览了

ChatGPT as Job, Value Creator

Is Grok 3 Better than ChatGPT?

?? Another ChatGPT Moment

Typical first experience with AI

Top Tech News

The ChatGPT Observer Edition 31

ChatGPT will Now have Bing Search Engine Built-In

Comparing AI chatbots: Bing versus ChatGPT versus Bard.

Harnessing NVIDIA’s New Reward Model to Align LLMs with Human Preferences

Considering ChatGPT Implications for Civic Tech: A Coforma Technologists’ Panel