登录查看更多内容

Case Study: How Stackoverflow's monolith beats microservice performance.

Navjot Bansal

Building Computer Vision Systems @Oracle | Software Architecture | System Design | ICPC Regionalist

发布日期: 2023年4月18日

Every Software Engineer's savior Stack Overflow operates immaculately, serving around 260,000,000 (260 Million) requests per month with an average latency of 18ms. In order to serve Millions I believed we need Hundreds of instances on multiple Availability domains.

Have a look at this diagram, an overview from: https://blog.bytebytego.com/i/76744008/how-will-you-design-the-stack-overflow-website

No alt text provided for this image — Image Courtesy: ByteByteGo.com

Here it is, the whole service is hosted with the following components

9 - Web Server hosted on 3 Data Centers
1 - HAP Proxy with a standby
1 - Redis with its slave
1 - SQL Database with standby.
3 - Elastic Search engines.

A detailed overview of these components is discussed.

Scalable Monoliths: Critical Components

How is Stack Overflow utilizing these components so efficiently? There are multiple practices involved that help them achieve performance and scale, though some practices are even not coherent with the standard practices.

"An important aspect to note is that the components being a monolith the components won't scale thus the servers have to manage all load with the given resources themselves"

Ensuring Availability

Multiple Servers hosted over distributed data-centers

Availability is the primary goal for any application/web service. Stack Overflow achieves this by hosting its multi-tenant server count of 9 evenly distributed over 3 data centers. Hosting on multiple data centers is crucial as it helps with

Reduce service downtime due to hardware failure
Optimal latency with distributed load and smallest hop to nearest data-center

SQL Database with Hot Standby

Monoliths have only a Database for the whole service. Similarly, for Stack Overflow they adopted a SQL Database with a Readonly standby. The Standby automatically updates itself with the Live DB server in an async fashion when the read/write operations are low.

SQL Server Cache. The SQL server loads the whole DB in memory, "THE WHOLE OF IT". It saves time for queries as it requires minimal disk operations.

Ensuring Performance

Load Balancing through HAProxies

It's not just sufficient to install multiple servers over different Availability Domains. There should be a reverse proxy that distributes the load amongst servers. Stack Overflow uses HAP Proxy for the job.

领英推荐

Performance vs. Scalability in System Design

Hari Mohan Prajapat 1 个月前

Deep Dive into Caching in System Design part 14

Hari Mohan Prajapat 1 个月前

9 Backend Questions Every Big Tech Companies Asks

Eleke Great 10 个月前

In simpler terms, HAProxy receives the traffic and then balances the load across your servers. HAProxy can also deal with any one of your servers failing since the load balancer can detect if a server becomes unresponsive and automatically stop sending traffic to it.

In the figure above the Fail Over proxy you see is a stand-by that replaces the Live Proxy if it encounters any issues ensuring Availability.

Caching with Redis

Redis is the L3 cache used for in-memory caching backups for the server themselves. Redis is used to store questions, answers, related questions (mapped by the tag engine), and similar.

If a server skips a cache it comes to the Redis for the info, even when the Redis missed they go to the Database Server. Any information with the Redis server is shared with all servers. I am not very sure whether the DB updates the Redis on cache misses or not.

Redis is crucial as it saves both CPU and Latency for the user by hot hitting and waiting on DB read operations.

There are also non-cache reasons we use Redis, namely: we also use the pub/sub mechanism for the web sockets that provide real-time updates on scores, rep, and dynamic values.

More on this: https://nickcraver.com/blog/2019/08/06/stack-overflow-how-we-do-app-caching/#redis

Quick Search with ElasticSearch Engine

Being a Q/A service, StackOverflow has to be ready with a lot of queries shooting up their search bar. This is a critical component to optimize.

To support millions of Queries 3 instances of Elastic Search engines are deployed in a Load Balanced fashion. Elastic Search is almost everyone's first choice when dealing with full-text queries these days.

Stack Overflow maintains a table called Posts which has both questions and answers in a row indexed in a timely fashion.

According to them, the rows is a small sized entry thus require little to no time with indexing and updates. Kudos to people for not adding a lot of Images to their answers :)

More on how Elastic Search fetches relevant docs from indices: https://www.elastic.co/guide/en/enterprise-search/current/engines.html#engines-index

Conclusion

This case study helped me break my bias towards "microservices are essential to serving millions of requests". With proper practices and planning any service could be scaled to serve X amount of people.

Being a monolith Stack Overflow has to be careful in ensuring proper resource allocations to deal with bursts of traffic.

They will be well of with overallocation and underutilization. For all the images attached, there's a common pattern in all systems being allocated with a lot of CPU and Memory resources which are generally functioning at ~10% of their capacity.

You would ask aren't they wasting 90% of their infra costs? Someone asked a similar question here.

The Service Principle

847 位关注者

要查看或添加评论，请登录

Navjot Bansal的更多文章

Copy of Thoughts over ? : Tech debt is just bad code?

2024年2月26日

Copy of Thoughts over ? : Tech debt is just bad code?

What's "Thoughts over ?" Thoughts over ? is a segment where I will be discussing "non-technical" problems that software…
Trash Talk and Garbage Collection.

2024年2月5日

Trash Talk and Garbage Collection.

For this newsletter, I have emphasized upon basics of Garbage collection in Python and what life would be like without…
Is More Caching = Efficient Application?

2024年1月29日

Is More Caching = Efficient Application?

For this newsletter, I emphasized upon Caching and how its overdose and inefficient integration can potentially slow…
Using the CAP Theorem to Analyze Microservices

2023年9月18日

Using the CAP Theorem to Analyze Microservices

(Us) Engineers experience multiple learning curves and take multiple ownerships while building software and backend…

1 条评论
Failproof micro-service: Retry Strategy for intermittent failures

2023年2月3日

Failproof micro-service: Retry Strategy for intermittent failures

This post is in continuation to Creating a Failure Resilient Application. I highly recommend reading this article…

2 条评论
Designing Microservices for failure Resiliency

2023年1月14日

Designing Microservices for failure Resiliency

In Microservices, we achieve "Segregation of Concerns" which prevents the whole system from crashing when a particular…
Tech in trend : Serverless!

2022年11月23日

Tech in trend : Serverless!

As per a survey by Oreilly, almost 40% of the companies leveraging Software services have moved to serverless…
Being proactive with reactive scaling with KEDA

2022年10月30日

Being proactive with reactive scaling with KEDA

Intro https://naruto.fandom.
Breaking the if-else logic trap with the Rule-based design pattern

2022年10月16日

Breaking the if-else logic trap with the Rule-based design pattern

Overview There are situations where you are presented to deal with legacy code or work upon modules that require you to…

13 条评论
Scaling up or Scaling out?

2022年9月7日

Scaling up or Scaling out?

Overview You are ready with your Stateless Application server and are inviting users to test it out. As soon as the…

See all articles

Case Study: How Stackoverflow's monolith beats microservice performance.

Navjot Bansal

Building Computer Vision Systems @Oracle | Software Architecture | System Design | ICPC Regionalist

Scalable Monoliths: Critical Components

Ensuring Availability

Ensuring Performance

领英推荐

Conclusion

The Service Principle

847 位关注者

Navjot Bansal的更多文章

社区洞察

其他会员也浏览了

Navigating the Scalability Maze: Ensuring Robust Performance Under Growing User Loads

Cloud-Native Essentials: Abstracted Endpoints

How to Migrate a .NET-based CMS Application from Azure to AWS

OSS Kubernetes and Container Storage Interface (CSI) drivers

Scale-up vs Scale-out

Mastering Distributed Cache: A Blueprint for Scalability, Performance, and Availability

System Design : SCALE FROM ZERO TO MILLIONS OF USERS: Part 2(Final)

How To Use Streams To React To Events In DynamoDB

Overcoming the many challenges faced by a Kafka support team

Architecting Distributed Systems with Azure Service Bus

Scalable Monoliths: Critical Components

Ensuring Availability

Ensuring Performance

领英推荐

Conclusion

The Service Principle

847 位关注者

Navjot Bansal的更多文章

Copy of Thoughts over ? : Tech debt is just bad code?

Trash Talk and Garbage Collection.

Is More Caching = Efficient Application?

Using the CAP Theorem to Analyze Microservices

Failproof micro-service: Retry Strategy for intermittent failures

Designing Microservices for failure Resiliency

Tech in trend : Serverless!

Being proactive with reactive scaling with KEDA

Breaking the if-else logic trap with the Rule-based design pattern

Scaling up or Scaling out?

社区洞察

其他会员也浏览了

Navigating the Scalability Maze: Ensuring Robust Performance Under Growing User Loads

Cloud-Native Essentials: Abstracted Endpoints

How to Migrate a .NET-based CMS Application from Azure to AWS

OSS Kubernetes and Container Storage Interface (CSI) drivers

Scale-up vs Scale-out

Mastering Distributed Cache: A Blueprint for Scalability, Performance, and Availability

System Design : SCALE FROM ZERO TO MILLIONS OF USERS: Part 2(Final)

How To Use Streams To React To Events In DynamoDB

Overcoming the many challenges faced by a Kafka support team

Architecting Distributed Systems with Azure Service Bus