登录查看更多内容

LLM Routing: Bottleneck is compute not the WAN

Mark Seery

Founder & Principal - B2B Tech Product/Business Strategy & GTM

发布日期: 2024年8月20日

Introduction

I've been spinning up virtual compute in different countries around the world to indulge my curiosity in intersecting areas such as LLMs, observability, and marketing; as things stand today, a complex LLM task can take 10 seconds or more for a complete response, which is far greater than the propagation delay through an IP wide area network, between parts of the world I have tested, and I suspect any two points in the world. What is the consequence? The relative penalty for using an LLM in another region/country/continent is small, and in some cases, the distant LLM may be more accurate and faster, saving time. This latency arbitrage may diminish over time as LLMs become faster, but for now, using LLMs in another region should not be eliminated as an part of a LLM strategy.

Architecture

Each site of the testbed runs as a virtual compute asset in a cloud provider. Early testing has focused on Northern California, Virginia, Sydney, Capetown, and Frankfurt, chosen randomly with no specific logic in mind. Further sites are planned for London, the Middle East, and Asia.

Current and planned sites in the test bed

The software stack consists of Linux running on two vCPUs, 20 GB of Memory (not needed currently, but may be required for future test cases), and active-active proxy servers that send requests to a service process that processes the request. Each of the three processes is written in Python and uses Fast API for the messaging.

From my laptop at home, I send requests to the proxy servers at each site requesting an LLM task to be completed. The requests are passed from the proxy servers to a service process. The service process records the time from the LLM invocation to full completion, separating the time from the overall request time. The request specifies which region the LLM request will be sent to, including remote areas. Requests are usually sent straight from a site to an LLM, but a request can also be sent from one site to another, with the other site fulfilling the request.

Having fun around the world from the comfort of my couch / working desk

Results

To get warmed up, we start by looking at some low-level timings. An ICMP ping from my couch to the Northern California site takes about 10ms. Using curl to a simple web server takes about 40ms, and a response from a Python process using FastAPI takes about 80ms. In general, this would have me reaching for some process optimizations, perhaps with Golang, but this soon fades into insignificance in the context of this blog.

So, let's call some LLMs with a very simple "Hi." A little wrinkle: the Northern California site does not offer the LLM service I am using, so it is off to Oregon!

领英推荐

Myro Smart Web: A Pioneering P2P Mega Platform Built…

Sebastien Marchi 3 个月前

Enchanting Monitoring and Caching with ColdFusion

Michaela Light 1 年前

Highly Available Kubernetes Cluster with Kubeadm using…

Aslam Chandio 5 个月前

Even a simple "Hello" to a LLM adds a significant amout of extra latency.

The RTT quickly ramps up, with the LLM processing generally about 80%-to-90% of the total RTT. In this test, I used two of the most significant current models because they are more accurate for some tasks, and Anthropic's Sonnet 3.5 has impressed many. (Note: I use various models daily, including Meta's LLAMA3, Google's Gemma, and more - this is just a testing example).

Let's throw something more substantial at the LLMs: a well-structured blog on the drivers that led to the emergence of Calculus.

A complex LLM request can add substantial delay to a completed response.

Now, the times ramp dramatically to 11-16 seconds (seconds, not milliseconds) for a completed response (streaming was not used in this testing). We are now at a point where any communications/propagation delay is completely irrelevant compared to the LLM processing time, which is around 99% of the RTT.

Let's switch our attention to a distant land, Sydney, Australia. Again, we start with something simple.

For a simple "Hello World" going local makes sense for an application in the Sydney Region

This test's local 600ms LLM time is substantially less than the total RTT to Oregon. Going to Oregon would not be unbearable for the user and may only be marginally perceptible, but clearly, there is a performance difference. Ok, let'sask the LLMs to do something more complicated.

For a complex request, traveling o another country, across the Pacific may make sense

Now the local LLM advantage goes away on two fronts. Firstly, Anthropic's Sonnet 3.5 is not available in Sydney (for the provider I am using), and secondly, we are now in the range of 12-17 seconds, so either way, there is a hit to customer experience. However, the total RTT for going to Oregon to use Sonnet 3.5 is consistently less than the local LLM processing time for Mistal 2402. If you operate in Sydney and use this provider, whether you want to shave a few seconds off or you prefer Sonnet 3.5, there is a reward for asking the Oregon region to fulfill complex LLM tasks (a one-page blog on calculus).

Conclusion

There are many considerations in using LLMs: financial cost, time cost, accuracy, specific training, etc. However, today, the LLM processing time can be so extensive that the distance penalty is not always relevant, and in some scenarios, traveling the distance may have benefits.

LLM processing time may decline in the future and latency arbitrage may go away. Perhaps every region worldwide will also have access to the same models, which is not the case. today (in cloud-managed scenarios). However, those responsible for LLM strategy today may consider where the best place to route different LLM requests is. In addition, regardless of where you are in the world, distance need not be a relative disadvantage.

Tech @markseery

298 位关注者

要查看或添加评论，请登录

Mark Seery的更多文章

The Art of Decision Making

2025年3月25日

The Art of Decision Making

Introduction This blog will be a journey of exploring the art of decision making, starting with an introduction of some…
The First Swim of the Year

2025年3月24日

The First Swim of the Year

I had my first swim of the year on Sunday and my second on Monday. Putting aside exceptional situations, whenever I…
Catching the Musk Falling Knife

2025年3月21日

Catching the Musk Falling Knife

Many years ago, I reposted an article about Elon Musk. I received a comment on the post that commented on a) the money…

1 条评论
Imitation Game: LLM Code Generators and LQ

2025年3月16日

Imitation Game: LLM Code Generators and LQ

Introduction: A Game of Imitation Every time we throw a new level of complexity at LLMs we get unpredictable results…
Who owns the pricing response to deep discounters?

2025年3月4日

Who owns the pricing response to deep discounters?

Introduction As often is the case when playing with AI, its value to me is the thinking it stimulates instead of a…

2 条评论
DOJ has been investigating HPE/Juniper for 12 months - 250227 5:25-cv-00951

2025年3月1日

DOJ has been investigating HPE/Juniper for 12 months - 250227 5:25-cv-00951

Not much happened this week in the DOJ case against HPE/Juniper. The parties have started discovery requests.

2 条评论
AI & Other Tsunamis: Don't Throw Stones at Barking Dogs

2025年2月27日

AI & Other Tsunamis: Don't Throw Stones at Barking Dogs

Introduction: The Churchillian Wisdom "As someone said, you will never get to the end of your journey if you stop to…
The Internet: Not a Choice, But a Necessity of Scale

2025年2月24日

The Internet: Not a Choice, But a Necessity of Scale

Preface The below tells in story format these principles: Complex systems like the Internet require intermediaries to…
No quick trial for HPE & Juniper: 250222 Update - 5:25-cv-00951-PCP

2025年2月22日

No quick trial for HPE & Juniper: 250222 Update - 5:25-cv-00951-PCP

Case management meetings have occurred, and the bottom line is HP/Juniper want a mid-year trial date fearing this…

1 条评论
Human vs. AI: How We Think, Learn, and Understand Differently—and Why It Matters

2025年2月22日

Human vs. AI: How We Think, Learn, and Understand Differently—and Why It Matters

I received some feedback that "Human Superiority in Reasoning, Knowledge, and Understanding over LLMs - Consequences…

See all articles

LLM Routing: Bottleneck is compute not the WAN

Mark Seery

Founder & Principal - B2B Tech Product/Business Strategy & GTM

Introduction

Architecture

Results

领英推荐

Conclusion

Tech @markseery

298 位关注者

Mark Seery的更多文章

社区洞察

其他会员也浏览了

Is scaling always the right answer? Insights from Performance Testing with JMeter

Scaling Socket.IO: Addressing Packet Loss and Event Routing in Horizontal Scaling

Distributed Systems and Technologies

OpenPOWER Keeps On Truckin’ At Annual Development Summit

Overusing caching & the power of memory management in Oraichain fork (Part 4 - Optimization series)

Multi-Node Kubernetes Cluster

How APM transformed and NPM has not

Distributed Consensus: Ensuring Agreement in Distributed Systems

From Cloud-ready Apps to App-ready Clouds

Scaling up or Scaling out?

Introduction

Architecture

Results

领英推荐

Conclusion

Tech @markseery

298 位关注者

Mark Seery的更多文章

The Art of Decision Making

The First Swim of the Year

Catching the Musk Falling Knife

Imitation Game: LLM Code Generators and LQ

Who owns the pricing response to deep discounters?

DOJ has been investigating HPE/Juniper for 12 months - 250227 5:25-cv-00951

AI & Other Tsunamis: Don't Throw Stones at Barking Dogs

The Internet: Not a Choice, But a Necessity of Scale

No quick trial for HPE & Juniper: 250222 Update - 5:25-cv-00951-PCP

Human vs. AI: How We Think, Learn, and Understand Differently—and Why It Matters

社区洞察

其他会员也浏览了

Is scaling always the right answer? Insights from Performance Testing with JMeter

Scaling Socket.IO: Addressing Packet Loss and Event Routing in Horizontal Scaling

Distributed Systems and Technologies

OpenPOWER Keeps On Truckin’ At Annual Development Summit

Overusing caching & the power of memory management in Oraichain fork (Part 4 - Optimization series)

Multi-Node Kubernetes Cluster

How APM transformed and NPM has not

Distributed Consensus: Ensuring Agreement in Distributed Systems

From Cloud-ready Apps to App-ready Clouds

Scaling up or Scaling out?