Using round robin costs you money!

A lot of load balancers use the round robin (RR) policy as default. And most users leave it as such. Many users are too focused on getting the right results and SLOs; but if they are not looking at tuning their load balancers then they are very likely loosing money.

This typically can happen in two ways:

  • You never notice that your response times are really much worse than they should be and thus you offer SLAs that are much worse than they could be.
  • You do what most people do when faced with slower performance - you scale your infrastructure.

There is, of course, a third and better option - take a hard good look at what your load balancer is doing. It does not matter what environment you are on - Google, IBM, Amazon or in-house - it is highly likely that you are using one load balancer or the other. It is also likely that the load balancer uses round robin by default.

But according to my tests, albeit on a simulator, round robin is pretty bad. Here is a glimpse:

Will result in really bad SLAs
JSQ vs RR response times with 15 servers

This experiment was done with 15 simulated servers with the same service rate responding to jobs whose sizes are drawn from a Pareto distribution. The meaning of the axes is not really important but the diagram gives a sense of how much time jobs spend in the system at discrete time steps. And as you can see, with the round robin algorithm your response times are very very bad with many many large spikes. 

By looking at the diagrams, it is not difficult to surmise that if not analyzed carefully, this really high response times will be used to construct SLAs. Comparing to the left hand side of the diagram, the result from the JSQ (Join the Shortest Queue) algorithm, we can see that we will be building our SLAs around really extreme response times.

Note: JSQ is proven to be an optimal algorithm for load balancing and it is unlikely that another algorithm will be able to improve more that 10% over JSQ. If interested, you can read about it here. However, I do think that it is a really bad idea to deploy JSQ in a microservices like environment.

Now, some others might observe that by scaling the environment out the response times get better with RR. Here is a glimpse of what happens when the number of servers is raised to 50 from 15:

Resources are idle
No alt text provided for this image

We can see that round robin is already doing much much better. We can keep raising the number of servers and will find that beyond a certain number, RR will perform as well as JSQ. And this sort of makes sense intuitively, because now we have many more servers while the load profile has remained the same (I am loath to say the load profile has remained "constant", though the distribution has). So basically the servers are much less loaded and thus are providing a far better response time because they are not seeing as many requests.

RR is not all bad though. If for a certain service, all the servers have the same service rate and the job size is almost constant then RR is the cheapest load balancing option. But this is pretty much the edge case.

Now, I would be the first to agree that the cases showcased here are a little extreme. However, it is also my experience that extreme cases tend to drive the point home better. Hopefully, I have made a good argument to convince many users that using the default RR is going to cost money and maybe at least a few people will find it worthwhile to spend both time and maybe even some money on their load balancer.

Zahir Almani

Leading digitalization & Information technology at Mineral Development Oman

4 年

Based on production environment, it is true when configuration a basic service with multiple real servers. But when we configured X-Forwarded-For we found RR is better.

Mohamed Saleem

Technical Consultant at Datacom

4 年

Bhai. After long time read a lengthy post like this. So JSQ is !!

要查看或添加评论,请登录

Ratnadeep Bhattacharya的更多文章

社区洞察

其他会员也浏览了