How do you optimize latency for Conversational AI?
For most applications, latency is a mild concern. However, for conversational AI, latency is what separates good applications from great ones.
For starters, the goal of conversational AI is fairly aspirational—to provide the same feeling, touch, and voice as a human conversation, while surpassing a human in intelligence. To accomplish this, an application must converse without long silent gaps. Otherwise, the realism is shattered.
Conversational AI’s latency challenge is compounded by its piecemeal nature. Conversational AI is a series of intermediate processes, all considered state-of-the-art in their respective fields. Each of these processes incurs additive latency.
As a generative voice company, we’ve spent a long time studying how to minimize latency for conversational AI. Today, we want to share our learnings, out of hopes that they’ll be helpful for anyone interested in building conversational AI applications.
The Four Core Components
Every conversational AI application involves at least four steps: speech-to-text, turn-taking, text processing (i.e. LLMs), and text-to-speech. While these steps are executed in-parallel, each step still contributes some latency.
Notably, conversational AI’s latency equation is unique. A lot of process latency problems can be reduced to a single bottleneck. For instance, when a website makes a database request, the web’s network latency drive the total latency, with just trivial contributions from the backend’s VPC latency. However, conversational AI’s latency components aren’t drastically varied. They are uneven, but each component’s latency contribution is within a degree of the others. Accordingly, latency is driven by a sum of parts.
Automatic Speech Recognition
The System’s “Ear”
Automatic speech recognition (ASR)—sometimes referred to as speech-to-text (STT)—is the process of converting spoken audio into written text.
ASR’s latency is not the time it takes to generate text, as the speech-to-text process runs in the background while the user speaks. Instead, the latency is the time between the end of the speech and the end of the text generation.
Accordingly, short and long speaking intervals can incur similar ASR latency. Latency can vary across ASR implementations (In some cases, there is no network latency whatsoever as the model is embedded in the browser,?such as Chrome/Chromium). The standard open source model, Whisper, adds 300ms + latency. Our custom implementation adds <100ms.
Turn-Taking / Interruption
The System’s “Instinct”
Turn-taking / Interruption (TTI) is an intermediary process that determines when a user is finished speaking. The underlying model is known as a Voice Activity Detector (or VAD).
Turn-Taking involves a complex set of rules. A short burst of speech (e.g. “uh-huh”) shouldn’t trigger a turn; otherwise, conversations would feel too staccato. Instead, it must assess when a user is actually trying to fetch the model’s attention. It also must determine when the user is finished relaying their thoughts.
A good VAD will not signal a new turn whenever it detects silence. There is silence between words (and phrases), and the model needs to be confident that the user is actually done speaking. To accomplish this reliably, it needs to seek a threshold of silence (or more specifically, a lack of speech). This process introduces a delay, contributing to the overall latency experienced by the user.
Technically speaking, if all of the other conversational AI components incurred zero latency, the latency attributed to TTI would be a good thing. Humans take a beat before responding to speech. A machine taking a similar pause gives realism to the interaction. However, since other components of conversational AI already incur latency, minimal TTI latency is ideal.
Text Processing
The System’s Brain
Next, the system needs to generate a response. Today, this is typically accomplished with a Large Language Model (LLM), such as GPT-4 or Gemini Flash 1.5.
The choice of the language model makes a significant difference. Models like Gemini Flash 1.5 are incredibly fast—generating outputs in less than 350ms. More robust models that can handle more complex queries—such as GPT-4 variants and Claude—could take between 700ms to 1000ms. Choosing the right model is typically the easiest way to target latency when optimizing a conversational AI process.
However, the latency of the LLM is time it takes to start generating tokens. These tokens can be immediately streamed to the following text-to-speech process. Because text-to-speech is slowed by the natural pace of a human voice, the LLM reliably outpaces it—what matters most is the first token latency (i.e., time to first byte).
There are other contributors to an LLM’s latency beyond the model’s choice. These include the prompt length and the size of the knowledge base. The larger of either, the longer the latency. It boils down to a simple principle: the more that the LLM need to consider, the longer it takes. Accordingly, companies need to strike a balance between a healthy amount of context without overburdening the model.
领英推荐
Text to Speech
The System’s Mouth
The final component of conversational AI is text-to-speech (TTS). Text-to-speech’s net latency is the time it takes to begin speaking after receiving input tokens from text-processing. That’s it—because additional tokens are made available at a rate faster than human speech, text-to-speech’s latency is strictly the time to first byte.
Previously, text-to-speech was particularly slow, taking as long as 2-3s to generate speech. However, state-of-the-art models like our Turbo engine are able to generate speech with just 300ms of latency the the new Flash TTS engine is even faster. Flash?has a model time of 75ms?and can achieve an e2e 135ms of time to first byte audio latency, the best score in the field (we have to brag a little!).
Additional Contributors
There will always be latency associated with sending data from one location to another. For some conversational AI applications, the ASR, TTI, LLM, and TTS processes should ideally be co-located, so the only source of non-trivial network latency is the paths between speaker and the entire system.?This gives us an advantage on latency as we can save two server calls since we have our own TTS and an internal transcription solution.
Network Latency
There will always be latency associated with sending data from one location to another. For some conversational AI applications, the ASR, TTI, LLM, and TTS processes should ideally be co-located, so the only source of non-trivial network latency is the paths between speaker and the entire system.
Function Calling
A lot of conversational AI applications exist to invoke functions (i.e. interface with tools and services). For instance, I might verbally ask the AI to check the weather. This requires additional API calls invoked at the text-processing layer, which can incur significantly more latency depending on the needs.
For example, if I need to order a pizza verbally, there might be multiple API calls that are necessary, some with excessive lag (e.g. processing a credit card).
However, a conversational AI system can combat the delays associated with function calling by prompting the LLM to respond to the user before the function call is finished (e.g. “Let me check the weather for you”). This models a real-life conversation and doesn’t keep the user without engagement.
These async patterns are typically accomplished by leveraging webhooks to avoid long-running requests.
Telephony
Another common feature for conversational AI platforms is to allow the user to dial-in via the phone (or, in some cases, make a phone call on behalf of the user). Telephony will incur additional latency—and this latency can be severely geography dependent.
As a base, telephony will incur an additional 200ms of latency if contained to the same region. For global calls (e.g. Asia → USA), the travel time can increase significantly, with latency hitting ~500ms. This pattern could be common if users have phone numbers outside of the region that their located in—forcing a hop to their base country’s phone networks.
A Closing Thought
We hope this round-trip exploration of conversational AI was interesting. In summary, applications should target a sub-second latency. This can typically be accomplished by choosing the right LLM for the task. They should also interface with the user whenever more complex processes are running in the background to prevent long pauses.
At the end of the day, the goal is to create realism. A user needs to feel the ease of talking to a human while getting the benefits of a computer program. By tightening the sub-processes, this is now possible.
At Elevenlabs we are optimizing every part of the piece of a conversational AI system with our state of the art STT and TTS models. By working on each part of the process, we can achieve seamless conversation flows. This top-down view on orchestration allows us to shave off a little latency—even 1ms—at every juncture.
CSBA @ USC | ex-SWE @ GSK | Questbridge
1 个月I loved reading the article. It was very insightful about design choices that can either minimize latency or utilize it to create realism.
Cybersecurity, IT Management, Business Automation, Digital Transformation
1 个月Very insightful!
Head of Digital Engineering Italy & Global Head of Chat Engineering Chapter Head of D&IT COPS & Wholesale
1 个月Pietro Capra ??
Rewarding Innovation & Excellence in Business Performance | Ex-dotcom-er, Comcast, acquired by Tata, Windstream & Nokia.
1 个月For those looking to geek ?? out on #network #latency check out Infinera's very thorough paper. https://www.infinera.com/white-paper/low-latency-how-low-can-you-go/
AI Solution Architect - Customer Experience Innovations at Performance Technology Partners
1 个月I think latency is even a bigger issue if building big scalable projects where complexity increases rapidly. Going through the companies network hoops, using proxies, and having a need to clean up and monitor responses coming out of the LLM would increase latency even more. It is one thing to have an LLM to play with, but another one to serve a real client. Such system cannot contain only one LLM that does everything. Maybe PII clean up, API calls, KB access, hallucination detection, etc... All this will add to the LLM response time.?? Using AI in conjunction with a deterministic call orchestrator gives a good blance as some of the more complex tasks (aka Authentication) can be executed through the deterministic call orchestrator, and AI components/tasks can be added gradually. This also opens another possibility to cover latency where the whole application can be slowed down to match perceived AI responses, so the user does not even feel the transition from deterministic call orchestrator to AI responses and vice versa.