登录查看更多内容

Using round robin costs you money!

Ratnadeep Bhattacharya

Distributed Cloud Infrastructure Engineer, GDC Air-gapped Storage Everywhere

发布日期: 2020年5月28日

A lot of load balancers use the round robin (RR) policy as default. And most users leave it as such. Many users are too focused on getting the right results and SLOs; but if they are not looking at tuning their load balancers then they are very likely loosing money.

This typically can happen in two ways:

You never notice that your response times are really much worse than they should be and thus you offer SLAs that are much worse than they could be.
You do what most people do when faced with slower performance - you scale your infrastructure.

There is, of course, a third and better option - take a hard good look at what your load balancer is doing. It does not matter what environment you are on - Google, IBM, Amazon or in-house - it is highly likely that you are using one load balancer or the other. It is also likely that the load balancer uses round robin by default.

But according to my tests, albeit on a simulator, round robin is pretty bad. Here is a glimpse:

Will result in really bad SLAs

JSQ vs RR response times with 15 servers

This experiment was done with 15 simulated servers with the same service rate responding to jobs whose sizes are drawn from a Pareto distribution. The meaning of the axes is not really important but the diagram gives a sense of how much time jobs spend in the system at discrete time steps. And as you can see, with the round robin algorithm your response times are very very bad with many many large spikes.

By looking at the diagrams, it is not difficult to surmise that if not analyzed carefully, this really high response times will be used to construct SLAs. Comparing to the left hand side of the diagram, the result from the JSQ (Join the Shortest Queue) algorithm, we can see that we will be building our SLAs around really extreme response times.

Note: JSQ is proven to be an optimal algorithm for load balancing and it is unlikely that another algorithm will be able to improve more that 10% over JSQ. If interested, you can read about it here. However, I do think that it is a really bad idea to deploy JSQ in a microservices like environment.

Now, some others might observe that by scaling the environment out the response times get better with RR. Here is a glimpse of what happens when the number of servers is raised to 50 from 15:

Resources are idle

We can see that round robin is already doing much much better. We can keep raising the number of servers and will find that beyond a certain number, RR will perform as well as JSQ. And this sort of makes sense intuitively, because now we have many more servers while the load profile has remained the same (I am loath to say the load profile has remained "constant", though the distribution has). So basically the servers are much less loaded and thus are providing a far better response time because they are not seeing as many requests.

RR is not all bad though. If for a certain service, all the servers have the same service rate and the job size is almost constant then RR is the cheapest load balancing option. But this is pretty much the edge case.

Now, I would be the first to agree that the cases showcased here are a little extreme. However, it is also my experience that extreme cases tend to drive the point home better. Hopefully, I have made a good argument to convince many users that using the default RR is going to cost money and maybe at least a few people will find it worthwhile to spend both time and maybe even some money on their load balancer.

Zahir Almani

Leading digitalization & Information technology at Mineral Development Oman

4 年

Based on production environment, it is true when configuration a basic service with multiple real servers. But when we configured X-Forwarded-For we found RR is better.

1 次回应

Mohamed Saleem

Technical Consultant at Datacom

4 年

Bhai. After long time read a lengthy post like this. So JSQ is !!

1 次回应

查看更多评论

要查看或添加评论，请登录

Ratnadeep Bhattacharya的更多文章

Maximum Subarray Product Performance: Divide and Conquer vs Kadane's Algorithm

2021年11月16日

Maximum Subarray Product Performance: Divide and Conquer vs Kadane's Algorithm

Original leetcode article is here. I recently came across this list of problems to study from to become a better…
C/C++ vs Rust vs Go vs Python: Can you really compare them?

2021年4月20日

C/C++ vs Rust vs Go vs Python: Can you really compare them?

Let me start with a slight background about myself. I am an IT systems engineer who started writing code in 2014 and…
Safe Global State in Rust: Raw Pointers aboard!

2020年9月28日

Safe Global State in Rust: Raw Pointers aboard!

It is pretty common in almost all large projects, at least the ones I have seen, to use a global state of usually…
Container Network Interface (CNI) - A Summary

2020年9月16日

Container Network Interface (CNI) - A Summary

This is a topic that has been turning over in the back of my mind for a while. In short, sometime last year, I…
How I made a life-altering decision at 34

2018年5月16日

How I made a life-altering decision at 34

I am currently a 34-year-old married man with two kids, an IT system administrator by profession. And I am starting my…

47 条评论
Our Fear Holds us BACK!

2017年1月5日

Our Fear Holds us BACK!

It was 2002, when I passed out of high school and was looking to get started in a career. My father, a graduate of the…
Machine Learning for IT Admins

2016年3月29日

Machine Learning for IT Admins

A few years back, I took Andrew Ng's class on machine learning from Coursera. This class got me enthused about the…

1 条评论
Why does it make sense to be a part of the Open Source community

2016年3月10日

Why does it make sense to be a part of the Open Source community

I came across the open source movement in 2006 when I truly got introduced to the Linux operating system. As far as I…

1 条评论
System Engineer vs System Administrator

2014年8月14日

System Engineer vs System Administrator

Sometime back I had come across a discussion on LinkedIn on the distinction between System Engineers (core engineering)…

16 条评论
VXLAN by VMware - A complete compilation

2014年6月19日

VXLAN by VMware - A complete compilation

What was the problem anyway? This should be the first place to start and I will not say much here other than to provide…

See all articles

Using round robin costs you money!

Ratnadeep Bhattacharya

Distributed Cloud Infrastructure Engineer, GDC Air-gapped Storage Everywhere

Ratnadeep Bhattacharya的更多文章

社区洞察

其他会员也浏览了

Horizontally vs. Scaling Vertically

Create a public load balancer with an IP-based backend

Azure Front Door and CDN

Load Balancing in AWS: A Comprehensive Guide to ALB, NLB, GLB, and CLB

OCI Load Balancer

Scaling from zero to millions of users - Load balancer

The Power Behind Web Giants: An Insight into Load Balancers

You're Killin' Me, AWS!

Azure Load Balancer

Who moved my Reliability ? B.L.U.E ?

Ratnadeep Bhattacharya的更多文章

Maximum Subarray Product Performance: Divide and Conquer vs Kadane's Algorithm

C/C++ vs Rust vs Go vs Python: Can you really compare them?

Safe Global State in Rust: Raw Pointers aboard!

Container Network Interface (CNI) - A Summary

How I made a life-altering decision at 34

Our Fear Holds us BACK!

Machine Learning for IT Admins

Why does it make sense to be a part of the Open Source community

System Engineer vs System Administrator

VXLAN by VMware - A complete compilation

社区洞察

其他会员也浏览了

Horizontally vs. Scaling Vertically

Create a public load balancer with an IP-based backend

Azure Front Door and CDN

Load Balancing in AWS: A Comprehensive Guide to ALB, NLB, GLB, and CLB

OCI Load Balancer

Scaling from zero to millions of users - Load balancer

The Power Behind Web Giants: An Insight into Load Balancers

You're Killin' Me, AWS!

Azure Load Balancer

Who moved my Reliability ? B.L.U.E ?