Case Study: How Stackoverflow's monolith beats microservice performance.

Case Study: How Stackoverflow's monolith beats microservice performance.

Every Software Engineer's savior Stack Overflow operates immaculately, serving around 260,000,000 (260 Million) requests per month with an average latency of 18ms. In order to serve Millions I believed we need Hundreds of instances on multiple Availability domains.

Have a look at this diagram, an overview from: https://blog.bytebytego.com/i/76744008/how-will-you-design-the-stack-overflow-website

No alt text provided for this image
Image Courtesy: ByteByteGo.com

Here it is, the whole service is hosted with the following components

  • 9 - Web Server hosted on 3 Data Centers
  • 1 - HAP Proxy with a standby
  • 1 - Redis with its slave
  • 1 - SQL Database with standby.
  • 3 - Elastic Search engines.

A detailed overview of these components is discussed.


No alt text provided for this image
Simple Architecture: Stack Overflow

Scalable Monoliths: Critical Components

How is Stack Overflow utilizing these components so efficiently? There are multiple practices involved that help them achieve performance and scale, though some practices are even not coherent with the standard practices.

"An important aspect to note is that the components being a monolith the components won't scale thus the servers have to manage all load with the given resources themselves"

Ensuring Availability

Multiple Servers hosted over distributed data-centers

No alt text provided for this image

Availability is the primary goal for any application/web service. Stack Overflow achieves this by hosting its multi-tenant server count of 9 evenly distributed over 3 data centers. Hosting on multiple data centers is crucial as it helps with

  • Reduce service downtime due to hardware failure
  • Optimal latency with distributed load and smallest hop to nearest data-center

SQL Database with Hot Standby

No alt text provided for this image

Monoliths have only a Database for the whole service. Similarly, for Stack Overflow they adopted a SQL Database with a Readonly standby. The Standby automatically updates itself with the Live DB server in an async fashion when the read/write operations are low.

SQL Server Cache. The SQL server loads the whole DB in memory, "THE WHOLE OF IT". It saves time for queries as it requires minimal disk operations.

Ensuring Performance

Load Balancing through HAProxies

No alt text provided for this image

It's not just sufficient to install multiple servers over different Availability Domains. There should be a reverse proxy that distributes the load amongst servers. Stack Overflow uses HAP Proxy for the job.

In simpler terms, HAProxy receives the traffic and then balances the load across your servers. HAProxy can also deal with any one of your servers failing since the load balancer can detect if a server becomes unresponsive and automatically stop sending traffic to it.

In the figure above the Fail Over proxy you see is a stand-by that replaces the Live Proxy if it encounters any issues ensuring Availability.

Caching with Redis

No alt text provided for this image

Redis is the L3 cache used for in-memory caching backups for the server themselves. Redis is used to store questions, answers, related questions (mapped by the tag engine), and similar.

If a server skips a cache it comes to the Redis for the info, even when the Redis missed they go to the Database Server. Any information with the Redis server is shared with all servers. I am not very sure whether the DB updates the Redis on cache misses or not.

Redis is crucial as it saves both CPU and Latency for the user by hot hitting and waiting on DB read operations.

There are also non-cache reasons we use Redis, namely: we also use the pub/sub mechanism for the web sockets that provide real-time updates on scores, rep, and dynamic values.

More on this: https://nickcraver.com/blog/2019/08/06/stack-overflow-how-we-do-app-caching/#redis

Quick Search with ElasticSearch Engine

No alt text provided for this image

Being a Q/A service, StackOverflow has to be ready with a lot of queries shooting up their search bar. This is a critical component to optimize.

To support millions of Queries 3 instances of Elastic Search engines are deployed in a Load Balanced fashion. Elastic Search is almost everyone's first choice when dealing with full-text queries these days.

Stack Overflow maintains a table called Posts which has both questions and answers in a row indexed in a timely fashion.

According to them, the rows is a small sized entry thus require little to no time with indexing and updates. Kudos to people for not adding a lot of Images to their answers :)

More on how Elastic Search fetches relevant docs from indices: https://www.elastic.co/guide/en/enterprise-search/current/engines.html#engines-index


Conclusion

This case study helped me break my bias towards "microservices are essential to serving millions of requests". With proper practices and planning any service could be scaled to serve X amount of people.

Being a monolith Stack Overflow has to be careful in ensuring proper resource allocations to deal with bursts of traffic.

They will be well of with overallocation and underutilization. For all the images attached, there's a common pattern in all systems being allocated with a lot of CPU and Memory resources which are generally functioning at ~10% of their capacity.

You would ask aren't they wasting 90% of their infra costs? Someone asked a similar question here.


要查看或添加评论,请登录

Navjot Bansal的更多文章

社区洞察

其他会员也浏览了