登录查看更多内容

Building resilient Server-less systems using AWS

Vinayak Raghuvamshi

Driving Engineering Excellence | Cybersecurity, Distributed Systems | AWS | Azure | AI | Author | Mentor | Spirituality Coach

发布日期: 2021年1月3日

+ 关注

This article is inspired by and based upon a talk given by David Yanacek, Sr. Principal Engineer at AWS.

In one of the CS course works at Carnegie Mellon University, they define resiliency as:

A resilient system protects its critical capabilities (and associated assets) from harm by using protective resilience techniques to passively resist adverse events and conditions or actively detect these adversities, respond to them, and recover from the harm they cause.

When building services and systems using serverless architecture, we get resiliency pretty much out of the box. It would be helpful in understanding the various aspects of resiliency, the challenges and recommendations for addressing the same.

Overload:

A system is termed to be under overload when it is handling too many transactions (ineffectively) and slowing everybody.

It is almost impossible to build a linearly scalable system where the throughput keeps going up along with load. As per Amdahl's universal scalability law, you could parallelize a system up to the point where contention becomes the bottleneck. Beyond this point, you will start seeing diminishing returns. Here you can watch a great session on Applying The Universal Scalability Law to Distributed Systems by Dr. Neil J. Gunther himself.

Here is a sample graph of expected throughput vs latency

When our fastest response time exceeds the client timeout, we say the system has a brown out. In this example, the system's processing of payloads that have latency more than the client timeout are pretty much useless. System is doing a lot of work, utilization could be at 100% however it is not getting much done per unit of time and the clients are timing out before it can respond.

How to prevent this?

First we need to get an idea of how much load we can handle optimally. Load tests play a crucial role in helping us measure this. We basically need to find the tipping point. Needless to say, such load tests should be performed in in UAT / load test environment and not on production ??

Once we are aware what the optimum load is, we can do a few things to avoid overload.

Load shedding.

Which just means we reject extra work once we reach the optimum load / tipping point. Load shedding helps ensure that we don't land up impacting everybody because of overload, at the cost of rejecting the transactions made by a few.

We should design our systems to not waste work. Handling client timeouts is one key aspect to consider.

This can also have a cascading, cumulative effect when our server has other dependencies.

One of the ways to mitigate this issue is by setting server timeouts (on the serverless lambda ?? ) . AWS Lambda lets you configure timeouts anywhere between 3 seconds to 900 seconds. The caveat is that we may timeout on legitimate but expensive requests and also in scenarios where our dependencies have latency spikes, penalizing the client for a server side issue. So, when setting these timeouts it would be a good practice to keep the value close to the client timeout. If different types of clients have vastly different timeouts then we may want to consider providing them with different API endpoints.

The other alternative is to do 'bounded work'. Here we do input size validation, pagination and checkpointing. Bounded work basically means not taking on too much work (or not taking on more work than we can efficiently handle within the established SLAs).

Checkpointing is the ability to do incremental work and saving the state incrementally. A good example is a DynamoDB scan. When we scan a large DB, we do not get back the entire content in one go. Instead we get chunks at a time. If we have any intermittent failures, we can pick up from where we last left, instead of starting all over again.

In lambda execution environments (containers, micro vms, etc) we have fixed resources per unit of work. Which means we have the same amount of resources for every request. So every execution environment is working on only one task at a time. This is also called isolation of workload. This gives a predictable performance. Because there is no contention of resources between requests, the latency remains consistent across requests. This is one good way to enforce the tenet "Do not take on too much work". So the feature of 'bounded work' is also available pretty much out of the box with AWS serverless architecture.

To summarize, load shedding involves:

Rejecting excess work, or load
Reducing wasted work (server timeouts)
Doing bounded work (input size validation, pagination, checkpointing)
Not taking on too much work. Reserving same amount of resources for each request.

The other method for building resilient systems is queueing.

When there is a surge in traffic, the queue gets bigger. The problem is that when traffic spike goes away, we are still left with backlogged items in the queue. The bigger the queue, the farther we get from real time processing of jobs and this can be bad for time sensitive applications. An example is a calendar invite for an important, urgent event that is going to happen in 15 minutes, however because the queue had a large backlog, it did not get sent to the invitees until after the event was supposed to start.

One of the ways of mitigating this issue is by using priority queues. Of course, this works only when we have a good distribution of priorities across jobs. If all the jobs had the same priority then this behaves like the normal queue.

The other mechanism used to prevent the queue from getting larger beyond acceptable limits is to use backpressure or throttling. You can use API GW to configure throttling limits. You can also combine the use of priority queues with throttling to get the best of both worlds.

AWS lambdas allow for asynchronous execution and uses SQS to queue the jobs.

For asynchronous invocation, Lambda places the event in a queue and returns a success response without additional information. A separate process reads events from the queue and sends them to your function. You can check out the asynchronous invocation configuration API for more details.

As an example of queued execution, it is common for users to set up the execution of a lambda function when an object is pushed to an S3 bucket.

The other mechanism used to build resiliency into services is by using Shuffle sharding. It is a big topic in itself. AWS Sr. Principal Engineer Colm MacCarthaigh has published a nice article explaining shuffle sharding and I would highly recommend you to read it here.

Finally, we need the right tools for operating and monitoring our services to ensure better resiliency. Here are a few key tools that we should be familiar with.

AWS X-Ray tracing.

AWS Cloudwatch insights.

AWS Cloudwatch dashboards.

AWS Cloudwatch contributor insights.

AWS Cloudwatch ServiceLens.

Hope this article helped provide an overview into how resilient systems are built using AWS technologies. I have tried to keep it high level. If you have any specific queries feel free to DM me.

Cheers!

+Vinayak Raghuvamshi

raj vengunta

Sr. Project Manager, Oracle NetSuite

4 年

Vinayak - thanks for writing this article. I have a question: would this article be helpful in cracking the amazon TPM interview? I ask because the theme here is "server less" - would this approach work for the interview?

Abhijit Kolhatkar

4 年

Well written article Vinayak.

1 次回应

Dharni Dhar Dwivedi

Technical Program Management| AI| e-Commerce| ex-Amazon

4 年

Thanks for sharing, Vinayak!

1 次回应

查看更多评论

要查看或添加评论，请登录

Vinayak Raghuvamshi的更多文章

Azure Sentinel: Powering Zero Trust Security with Cloud-Native Intelligence

2025年3月21日

Azure Sentinel: Powering Zero Trust Security with Cloud-Native Intelligence

Introduction to Azure Sentinel Azure Sentinel is Microsoft's cloud-native Security Information and Event Management…
My Conversation with God in the form of Krishna

2024年12月13日

My Conversation with God in the form of Krishna

When I turned to Krishna with my struggles, I sought answers to questions that linger in every heart. His wisdom, as…

3 条评论
The Dark Night Of The Soul

2024年10月20日

The Dark Night Of The Soul

Navigating through the Dark Night of the Soul can feel overwhelming, as everything you once believed in seems to…

1 条评论
The Four Pillars of Personal and Leadership Strength: Lessons from the Mahabharata

2024年9月10日

The Four Pillars of Personal and Leadership Strength: Lessons from the Mahabharata

In the great epic, the Mahābhārata, we learn about the four fundamental powers that exist within every human being…

2 条评论
Your Mind: Best Friend or Worst Enemy" - Insights from the Bhagavad Gita

2024年3月5日

Your Mind: Best Friend or Worst Enemy" - Insights from the Bhagavad Gita

The Bhagavad Gita, an ancient Indian text, poignantly states, "Your mind can be your best friend or your worst enemy."…

3 条评论
Four questions to ask yourself before you present or publish anything

2021年5月19日

Four questions to ask yourself before you present or publish anything

What I am saying here are my personal opinions, although I am sure these ideas have been thought by and presented by…

1 条评论
Recent supply chain attack on PHP's Git server and old school mitigations

2021年4月18日

Recent supply chain attack on PHP's Git server and old school mitigations

You’ve probably heard that PHP’s Git repository was recently hacked, where hackers added backdoors to the source code…

3 条评论
The best productivity app for windows

2020年4月9日

The best productivity app for windows

Productivity apps have become so much more crucial in these WFH days. Nobody knocks on your desk to remind you of that…

3 条评论
Rust tip #1 for beginners

2020年2月8日

Rust tip #1 for beginners

How to enable debugging of your rust programs: IntelliJ IDEA does not support debugging Rust programs. For this, you…
Introduction to AWS CloudFormation

2019年7月19日

Introduction to AWS CloudFormation

This is an absolute beginners guide to getting started with AWS CloudFormation. The standard definition is "AWS…

1 条评论

See all articles

Building resilient Server-less systems using AWS

Vinayak Raghuvamshi

Driving Engineering Excellence | Cybersecurity, Distributed Systems | AWS | Azure | AI | Author | Mentor | Spirituality Coach

Vinayak Raghuvamshi的更多文章

社区洞察

其他会员也浏览了

AWS Lambda Features, Use Cases and Best Practices

AWS Project Documentation: Modernizing Infrastructure with AWS Services

Lambda vs Fargate

Using Terraform and Ansible creating K8S Cluster On AWS and launching Prometheus and Grafana in that K8S Cluster.

Run scalable, cost-effective workloads on AWS in parallel using AWS Batch and Step Functions

What is AWS Lambda

Infrastructure as Code using AWS CDK - A step-by-step guide

How to Set Up a Serverless Application Using AWS Lambda and API Gateway

-----EKS Task-----

How AWS SSM Parameter Store Replaced HashiCorp Vault in our Client's Infrastructure

Vinayak Raghuvamshi的更多文章

Azure Sentinel: Powering Zero Trust Security with Cloud-Native Intelligence

My Conversation with God in the form of Krishna

The Dark Night Of The Soul

The Four Pillars of Personal and Leadership Strength: Lessons from the Mahabharata

Your Mind: Best Friend or Worst Enemy" - Insights from the Bhagavad Gita

Four questions to ask yourself before you present or publish anything

Recent supply chain attack on PHP's Git server and old school mitigations

The best productivity app for windows

Rust tip #1 for beginners

Introduction to AWS CloudFormation

社区洞察

其他会员也浏览了

AWS Lambda Features, Use Cases and Best Practices

AWS Project Documentation: Modernizing Infrastructure with AWS Services

Lambda vs Fargate

Using Terraform and Ansible creating K8S Cluster On AWS and launching Prometheus and Grafana in that K8S Cluster.

Run scalable, cost-effective workloads on AWS in parallel using AWS Batch and Step Functions

What is AWS Lambda

Infrastructure as Code using AWS CDK - A step-by-step guide

How to Set Up a Serverless Application Using AWS Lambda and API Gateway

-----EKS Task-----

How AWS SSM Parameter Store Replaced HashiCorp Vault in our Client's Infrastructure