Understanding Scalability: A Deep Dive
Credits - Google Gemini

Understanding Scalability: A Deep Dive

Imagine attending a sold-out music festival featuring your favorite artist. The stage lights up, and the crowd erupts in cheers as the performer takes the stage. But what if the festival organizers underestimated the demand and booked a sound system designed only for a small intimate gathering? The audio would be muffled, distorting and losing its clarity, ruining the experience for thousands of fans.

Similarly, when software is not designed to scale, it can't handle increased traffic or user load, leading to performance issues, errors, and ultimately, a poor experience. Scalability refers to the ability of a system to efficiently handle growing demands without compromising on performance, ensuring that every user can enjoy the music (or in this case, access the software) with clarity and quality.

Scalability is just as critical as reliability (which I covered in the last edition), to the performance of a system.

The Interconnection of Reliability and Scalability

Scalability refers to a system’s ability to handle an increased load effectively. Interestingly, reliability and scalability are interconnected in many ways. For instance, a system that performs reliably for 10k concurrent users may not necessarily maintain the same performance level with 200k concurrent users.

The ‘load’ in a system can be defined by various parameters. In some systems, it could be the number of concurrent users, requests per second, writes per millisecond into a database, or the number of reads on a cache. In others, the load could be corner case scenarios.

For example, consider YouTube. A popular channel like Mr Beast’s has 288M subscribers, which means certain parts of the system, such as new video notifications, must handle this extreme load. In contrast, an average channel may have fewer than 10 subscribers. Similarly, on X/ Twitter, making a tweet could scale linearly, but the fan-out of the tweets to all followers could be a heavy load operation.

Performance Metrics: Percentiles over Averages

Batch systems usually measure performance by throughput, while online systems measure performance by requests per second. Performance metrics are typically measured as percentiles rather than averages. They are represented as follows:

  • 50 Percentile notated as P50/ Median
  • 95 Percentile notated as P95
  • 99 Percentile notated as P99
  • 99.99 Percentile notated as P9999

The reason for this is to understand outliers, particularly the tail latencies like P99. For instance, in a retail website with 100M active users per week, if an average order with under 10 items gets executed, 99.99 percentile at 300ms, this means the 1st decile, 0.01 percentile users (about 1M) will have a latency higher than 300ms. Greater than 300ms could be 2 seconds or 10 seconds. If these 1M users happen to be heavy buyers with large numbers of items in the cart or with high-value items, yet in smaller numbers in the cart (like those Diamond Rings sold at the Costco). This means we are now affecting revenue of the business with these tail-end latencies. Perhaps additional backend anti-fraud check may be causing this latency, but the end user experience may suck.

Designing for Scalability

Some approaches to design for scalability include:

  • Vertical Scaling: Adding more resources to a single node (like a more powerful CPU, more Memory, low latency disks like flash drives, reducing RAID)
  • Horizontal Scaling: Adding multiple nodes and fronting a load balancer to distribute the requests.
  • Hybrid: Keeping a mix of both these strategies
  • Elastic scaling: A system which automatically adds capacity and images the functionality based on the load.
  • Predictive scaling up: To ensure seamless performance during peak periods, we employ predictive scaling strategies. This involves proactively adding nodes or capacity ahead of anticipated high-traffic events, such as major shopping holidays (e.g., Black Friday), global online sales promotions (like Amazon Prime Day or Singles' Day in China), or popular sporting events (like the World Cup). By doing so, we can confidently anticipate and accommodate increased demand, providing a level of predictability that's unmatched by elastic scaling models. This approach offers operational simplicity and minimizes the risk of last-minute scrambles to scale up and maintain performance during critical periods.
  • Caching: Techniques such as offloading static content to the Content Delivery Networks (CDNs), moving processing to the clients, distributed caching can be used to use compute only when necessary.

It is to be noted that there is no one-size-fits-all solution. Each application requires a unique design, informed by a deep understanding of its users' characteristics, usage patterns, and underlying assumptions. For instance, a streaming service will require a distinct solution from a payment tech system, which in turn differs significantly from a social media application. As our assumptions evolve or are disproven by changing user behaviors or market demands, it is crucial that we remain vigilant and prepared to reevaluate and evolve our architecture accordingly.

So, how do we measure Scalability?

When evaluating the scalability of a system, it is essential to consider reliability alongside this key performance indicator. To achieve optimal results, we must measure and analyze various metrics under simulated and actual load conditions.

The following quantified metrics are crucial in assessing a system's scalability:

  • Error rates and crash rates: Identifying potential bottlenecks and areas for improvement.
  • Response time for key use cases: Ensuring that critical functions operate efficiently.
  • Requests/transactions per second: Measuring the system's ability to handle high volumes of traffic.
  • Resource utilization: Monitoring CPU, memory, and other resource usage to prevent overload.

In addition to these metrics, we also continuously test and monitor any auto-scaling or pre-scaling up models in place. This ensures that our systems adapt effectively to changing demands.

However, not all systems scale linearly due to various choke points along the request control flow. In such cases, it is essential to have a thorough understanding of these limitations to architect and budget resources accordingly. For instance, if network bandwidth becomes a bottleneck, adding additional bandwidth may be a relatively cost-effective solution.

By adopting this quantifiable approach to scalability, we can ensure that our systems are not only highly available but also able to handle increased demands with confidence.

This is a perfect point to pause and go to the next topic - Maintainability, which I will cover in my next article.

Further Reading

For more insights into how large-scale systems handle these challenges, check out this blog post on how Walmart handles trillions of Kafka messages.

Also, Meta’s Engineering blogs are a pleasure reading for techies. Read this blog to understand how they think of their systems - Maintaining Large-Scale AI Capacity at Meta.

Jennifer DiFrancesca

Principal, Technical Data Operations at Dun & Bradstreet - Certified Kanban Practitioner

4 个月

I was left giggling at the visual used. Thank you! Ha! You are 100% right. I just experienced something similiar while watching World Superbike racing streaming this AM. Watching an entire race in chunky pixels was not the experience I was going for. Ha. You nailed it. Hyperscaling and Expectations. Since most of us live in Hyperscaling Environments, we EXPECT Hyperscale Performance. Most companies do not run under a Hyperscalers' Budget. This makes understanding these variables and points VITAL to designing a System that meets SLAs/KPIs. Thanks for the content.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了