LLM Routing: Bottleneck is compute not the WAN
Introduction
I've been spinning up virtual compute in different countries around the world to indulge my curiosity in intersecting areas such as LLMs, observability, and marketing; as things stand today, a complex LLM task can take 10 seconds or more for a complete response, which is far greater than the propagation delay through an IP wide area network, between parts of the world I have tested, and I suspect any two points in the world. What is the consequence? The relative penalty for using an LLM in another region/country/continent is small, and in some cases, the distant LLM may be more accurate and faster, saving time. This latency arbitrage may diminish over time as LLMs become faster, but for now, using LLMs in another region should not be eliminated as an part of a LLM strategy.
Architecture
Each site of the testbed runs as a virtual compute asset in a cloud provider. Early testing has focused on Northern California, Virginia, Sydney, Capetown, and Frankfurt, chosen randomly with no specific logic in mind. Further sites are planned for London, the Middle East, and Asia.
The software stack consists of Linux running on two vCPUs, 20 GB of Memory (not needed currently, but may be required for future test cases), and active-active proxy servers that send requests to a service process that processes the request. Each of the three processes is written in Python and uses Fast API for the messaging.
From my laptop at home, I send requests to the proxy servers at each site requesting an LLM task to be completed. The requests are passed from the proxy servers to a service process. The service process records the time from the LLM invocation to full completion, separating the time from the overall request time. The request specifies which region the LLM request will be sent to, including remote areas. Requests are usually sent straight from a site to an LLM, but a request can also be sent from one site to another, with the other site fulfilling the request.
Results
To get warmed up, we start by looking at some low-level timings. An ICMP ping from my couch to the Northern California site takes about 10ms. Using curl to a simple web server takes about 40ms, and a response from a Python process using FastAPI takes about 80ms. In general, this would have me reaching for some process optimizations, perhaps with Golang, but this soon fades into insignificance in the context of this blog.
So, let's call some LLMs with a very simple "Hi." A little wrinkle: the Northern California site does not offer the LLM service I am using, so it is off to Oregon!
领英推荐
The RTT quickly ramps up, with the LLM processing generally about 80%-to-90% of the total RTT. In this test, I used two of the most significant current models because they are more accurate for some tasks, and Anthropic's Sonnet 3.5 has impressed many. (Note: I use various models daily, including Meta's LLAMA3, Google's Gemma, and more - this is just a testing example).
Let's throw something more substantial at the LLMs: a well-structured blog on the drivers that led to the emergence of Calculus.
Now, the times ramp dramatically to 11-16 seconds (seconds, not milliseconds) for a completed response (streaming was not used in this testing). We are now at a point where any communications/propagation delay is completely irrelevant compared to the LLM processing time, which is around 99% of the RTT.
Let's switch our attention to a distant land, Sydney, Australia. Again, we start with something simple.
This test's local 600ms LLM time is substantially less than the total RTT to Oregon. Going to Oregon would not be unbearable for the user and may only be marginally perceptible, but clearly, there is a performance difference. Ok, let'sask the LLMs to do something more complicated.
Now the local LLM advantage goes away on two fronts. Firstly, Anthropic's Sonnet 3.5 is not available in Sydney (for the provider I am using), and secondly, we are now in the range of 12-17 seconds, so either way, there is a hit to customer experience. However, the total RTT for going to Oregon to use Sonnet 3.5 is consistently less than the local LLM processing time for Mistal 2402. If you operate in Sydney and use this provider, whether you want to shave a few seconds off or you prefer Sonnet 3.5, there is a reward for asking the Oregon region to fulfill complex LLM tasks (a one-page blog on calculus).
Conclusion
There are many considerations in using LLMs: financial cost, time cost, accuracy, specific training, etc. However, today, the LLM processing time can be so extensive that the distance penalty is not always relevant, and in some scenarios, traveling the distance may have benefits.
LLM processing time may decline in the future and latency arbitrage may go away. Perhaps every region worldwide will also have access to the same models, which is not the case. today (in cloud-managed scenarios). However, those responsible for LLM strategy today may consider where the best place to route different LLM requests is. In addition, regardless of where you are in the world, distance need not be a relative disadvantage.