Failure Engineering - API Edition

Failure Engineering - API Edition

Introduction

The smallest crack in a mighty dam can bring it down. Just like that small crack, foundational pieces of all architectures are API’s. These omni-present blocks, whether they are hidden in monoliths, micro-services or an orchestra of edge workers, need our attention. We will discuss strategies that one must deploy when creating API’s for high scale.?

These are good practices in general, but become quickly mandatory when you do anything noteworthy with scale. At the end of this blog, you will know what Failure Engineering is and how it applies to thinking about API’s.?

This blog came out of a keynote?—?slides are here .?

What is Failure Engineering?

While most engineering starting and ending points are about “happy path”, failure engineering is all about focusing obsessively on what will fail. It’s about making failure cases as first class citizens that are solved at the same intensity as the “happy path”.?

Most teams might do some shades of failure cases, however, to do failure engineering, one must obsess about how things might fail.

We’ll talk about some high level technique here?—?but really it requires a focused sit-down on failure.

  • Think about adverse test data, not just edge cases, but flat out clients breaking contracts, inject some chaos into parameter exchanges.
  • Understand customer journeys, map those “waterfalls”, and then identify points of failures and retries in that path
  • Paranoia is good, pick out highly improbably scenarios that might occur.
  • Failures as imagined or injected, need not be very exotic, the simplest use-cases will bring you down, you will find.

An effective tool for Failure Engineering is to pick your adverse case, and then say magnify it 50–100x?, and see whether your system can hold up. This technique will lead to a fair bit of resiliency conversation and it will force you to focus on what is really critical, versus trying to keep everything running.

APIs?—?What we pay attention to?

Classically, when we deploy, test and monitor API’s, most teams will come up with the following list.

  • Status codes?—?200’s / 4xx / 5xx
  • Latency?—?are you meeting your self prescribed p90 response times. Necessarily percentiles, and not averages when you look at this metric.
  • Resource Utilization by pods / machines
  • Scaling Ladders?—?when traffic grows, how should one tackle auto-scaling or manual scaling
  • Performance Testing (Vertical)
  • Vertical?—?Scaling only your service, for your use-cases
  • Endurance Testing?—?Behavior of your service over long periods of time, to see if there is a run-degradation leading to eventual?, slow failures.

The above is a good list, and should tackle most failures. However, these leave the last mile lacking. Let’s talk about the last mile.

API’s?—?What do we?miss?

This post is all about what most teams learn after hard failures in production in their RCA calls. So, what’s missing from the stellar list in the previous section?


Problem?—?Focus on single service?, not?system

Conway’s law will ensure that your service teams will only focus on the services that are assigned to them. You are in especially fine trouble for “shared” services or “legacy” services where some good samaritans wrote the service, and are no longer there, or the concern of the service grew large enough where multiple teams started to contribute.

Now you are dealing with orphaned services, or lets call them “step-services”, since you can mandate that someone keep an eye on them, even if they don’t own all the code, or any of the code.

In any case, most teams are ensuring that their service works. Practically, it takes a focus on DevX to ensure that testing is not incredibly painful. Creating test data, test environments, data stores that play well with dev deployments.?

Even when all of that exists, teams simply run out of time to do their integration tests, beyond the happy path. Now you’re left with individual silos that scale and work, but fail spectacularly when they try to operate in tandem.

The easiest and quickest failure most teams will find is resource exhaustion as multiple services scale at the same time, and data store hot-spots.

So, testing in concert is NOT an option?—?and, if you are not doing rigorous horizontal tests where you are scaling to meet end to end customer scale journeys, in concert with other cooperating services, then you will find failure quickly when production traffic rises.


Problem?—?Focus on Uptime, not Serviceability

During a panel discussion, one of my co-panelist’s pointed out very pithily that we all tend to focus on uptime, but we need to start focusing on “Serviceability”. This is the most eloquent way of framing this smell. Just because your service is UP does not mean it’s doing its job.

In most incidents, when you bring the parties together on an incident response call, how often have all service owners said, my service is fine?—?my latencies are good and my response code splits are also OK.

However, when you ask whether the service traffic levels are within band, you find that the service isn’t getting the right levels of traffic for a variety of reasons. Therefore, it’s important to look at traffic levels along with the usual metrics, so, at all times you are asking whether clients are continuing to harvest the benefit that services provide.


Problem?—?All services have to?run

Life is all about compromises, but system architects sometimes don’t apply real life lessons to machines. Graceful degradation is a very valuable construct when it comes to scale designs. Some services can die, some services must die.

Culturally, this is tough to get the group to agree to, because, Conways Law. Teams are incented to ensure that their service is awesome and can never die. This fight alone causes so many failures.

The technique to use here is to ruthlessly prioritise as a CTO or Chief Architect. Write down the top 2 things that can never die, and in the rank order of failure. For example?—?in a video streaming service, the top two things for us are, in this order

  • Video must play
  • Ad monetisation must work
  • Subscriptions must work

When you frame the problem like this, then, as scale rises, everything that does not help these three items in this order will be shut down at a specific RPS / RPM, whatever metric you follow. So, if everything is burning, then we will make sure that at least video plays while everything else can be gracefully failed.


API’s?—?The Remainder Portions

Let’s look at strategies that will help you to ask these failure questions and strategies that help build for failure.

  • Observability?—?where is it smoking?
  • Causality with Topology?—?its smoking here but broken there
  • Lazy Origins?—?Do less at origins, create run-offs
  • Escape hatches?—?What’s your plan B, C?, X


Strategy?—?Not Just Observability

Tons has been written about Observability. The key goal with your Observability is to be able to use it in an incident.

If you aren’t using your Observability tool to

a) catch problems before your customers report them

b) Reduce your MTTR (Mean-Time-To-Recovery)

then, you aren’t doing it right.

Getting your entire topology “observed” is critical, often times, it’s not broken where it’s smoking. Topologies should be easily visualised, with indicators of where the “flow” is broken or problematic. This helps to quickly narrow down on where potential problems are, and can save valuable time in the recovery process.

Even before you get into Observability products, a basic monitoring and alerting program is a must. The hard journey to take here is to be able to review your topology and ensure that the right monitors are dropped and thresholds values for alerts are defined. Without a doubt, each of these monitors will undergo serious “noise” issues. Your team will need to persist through the noisy alerts and find a way to tune them to the right thresholds, resist the urge to turn them off!

Monitors are in place, alerts are in place, topologies are visualised?—?now the task at hand is to coach your teams to do “fire-drills”?, what to do when an incident occurs. Who responds, who “runs” the incident and whatnot. Tools are meaningless if they are not leveraged in an incident by teams that are very familiar with them to help mitigate the issue. Writing down and executing an Incident SOP becomes critical.

A huge gap in your Observability stack is thinking that your vendors are all working flawlessly. Vendor systems are brittle and will break before your systems do. Keep this paranoia mindset and ensure that you have some manner of visibility on all your vendors. This will force you to inform them when you have scale events and to engage in conversations about their incident response and Observability practices. It is always better to know the level of readiness you can expect from vendors when bad things happen.


Strategy?—?Protect Your?Origin

Your origin strategy should be one of a snob. Service teams should regularly inspect the requests they are serving and answer why their origin should be serving those requests. Cache offload is a legit metric to monitor and alert. Service owners believe that the cloud is infinitely scale-able (it’s not), and even if it was, they believe that the organisation has infinite money to spend on cloud (it does not). Purely basis these two facts, one must absolutely be frugal when it comes to scaling the origin to answer client questions.

Security is often looked at as a compliance item or one of the last things to get in place after analytics at most organisations. Bad idea. Security might be boring, but leveraging security as a way to reject traffic also plays in very well as an offensive strategy to keep your origin more protected. A common pattern here is to “wall” your origin with a CDN, so that client end-points whether they are legitimate or malicious, nobody can talk to your origin directly. This alone absolves your origin of a lot of boiler plate heavy-lifting that will give your team time back to focus on business logic and other goodness.

Runaway trucks on a downhill slope are a real world problem. If you have driven on highways, you will notice sometimes signs for “Runaway Truck” or “Truck Runoff”, these are an uphill section in a downhill main highway, intended to slow a speeding truck down safely.

API’s need run-offs too?—?most services tackle this with surge queues?—?when requests start to stack up. While this strategy is good and helps to take the heat off after you breach certain thresholds?—?we use a strategy called “panic handling”. Panic handling is about embracing mocks in production. Effectively you “can” a version of the API response when things are not heated?—?and if your origin goes down or is overwhelmed, you can route surplus traffic temporarily to your panic response served from CDN. Nobody is the wiser, except you. This strategy alone has helped MTTR in numerous incidents. It’s not free, and there are client side implications, but once you start to think like this, the strategy is far more potent in most use-cases than its cons.

You can do every strategy described here-in. Though, sometimes the most time-tested patterns are forgotten. Pre-Warming has existed forever. Anybody who has built caches has heard of this technique and has hopefully applied it at some time or the other. Don’t let customers “warm” your caches?—?sometimes initialisations might take longer than expected, create chains?—?or some minor wobble will turn into a storm. Do whatever you can to be ready for your customers when they come. To take a store analogy?—?on the day of a big sale, you don’t start with empty shelves, do you?


Strategy?—?When to?CDN?

As a high scale designer, if you are not considering leveraging CDN’s, I would urge you to re-consider. You might not use it still, yes, there are some high performance use-cases, but if you can afford it, one hopes that if you have the scale you are expecting, you can afford it, then as a designer, its a great weapon in your arsenal.

Nothing is easy or free. Leveraging CDN’s effectively requires a strong understanding of the underlying CDN, at no point is the system designer absolved of understanding basic caching rules and strategies. TTL’s and evictions are as certain as death and taxes.

Primarily I recommend using CDN’s as shock absorbers. Take away unwanted traffic from my origin. Slow down traffic to my origin. Answer my repetitive questions. Build presence in areas where my customers are, so they are not enduring long round trips. We have also used CDN’s to inject behaviours in the mid-tiers. Increasingly, you can solve interesting use-cases at the edge, which lines up well with the?—?don’t go to origin pattern.

Certain high performance use-cases should skip leveraging CDN’s?—?every “block” adds drag, so, if drag reduction is your goal, then don’t use a CDN. CDN’s themselves fail spectacularly?—?and require monitoring and alerting as well in conjunction with your provider.


Scale Failure Scenarios

Speaking of spectacular fails, some of the common scale failure scenarios


Row of houses fires?—?Cooperating API’s fail, mid tiers?fail

Why do builders keep a gap between adjacent homes? To break the spread of a fire if it happens?—?perhaps some earthquake resiliency use-cases as well. API’s are no different. Requests are intertwined by their nature. This is where topology helps, the fire is not usually where the smoke is. Finding causality helps to triage which service is at fault and causing stress upstream. Spend time looking through sequence diagrams so that appropriate circuit breakers are put in place, so that each service can protect its cooperating services from damage. Create that “gap”.


Dam Bursts?—?Complete failure coupled with untested “side-mitigations” bring tsunami’s

Sometimes you will just have hard failures. Systems fail to come up. Well meaning resiliencies that worked in labs, fail spectacularly in production. While these are rare, they are incredibly hard to recover from. If you are using CDN’s, these can be mid-tier failures that simply send a slew of traffic back to layers above them, including your origin. It’s usually all over?—?meaning everything burns to the ground in minutes. Dangerous.

One way to mitigate against this sort of “dam burst”?—?is to identify what cooperating layers do when there are catastrophic failures. Being able to exercise a dam burst on paper allows you to ask questions that you might otherwise ask in an RCA after a very public failure. You might still have one, but a paper exercise might give you a shot at a faster recovery.


Poor Cache Strategies?—?Don’t forget the?basics

TTL’s and evictions are as certain as death and taxes. Valuable to repeat this line again. Simply adding a CDN or a cache does not give you scalability. Understanding innately what your caching strategy is and how TTL’s telescope through the layers of the system is very important. Take the time to look at response TTL’s, why they are what they are.

Using an extreme questioning technique might help. Pick a service and ask why the TTL’s can’t all be increased by 24h, leads to lots of interesting conversations. Emotional TTL’s are a thing.


Failure Patterns?—?Takeaways

If you skipped everything and got here to see how it all ends?—?here’s some parting patterns to take away.


Delegate Obsessively?—?Be Selective at the?Origin

Cloud capacity is not infinite. No matter how good your service engineers, your Devops talent is, your origins will go down. Be lazy, be snobby?—?don’t let your origin do more than it really needs to do. Ringfence it, talk only to known actors and even with them create buffers.


Facades Facades?—?Nobody needs to know your origin is?down

Create escape hatches. If you were asked to design a system where all your origins crashed, you would come up with solves to run most of your product with some deftness. Especially in a scale event, when customers are funneling into ONE thing only, this is a very useful pattern.


Cache Mandatorily?—?DRY

Yes, caching adds another block. It adds cost, coordination, latency. Yes. Unless you are targeting a very focused use-case where you cannot cache, and you require utmost performance, do seriously consider why you should not add a caching block that takes traffic away from your origin. DRY is a worth repeating to yourself.


Ear to the ground?—?Monitor as close to bare metal, monitor mindfully

APM’s are useless in production. Any monitoring that does not tell you as soon as smoke emanates is not valuable. Delayed monitoring leads to delayed alerting, leads to delayed MTTR. What you monitor also matters. Spray and pray does not work?—?understanding your service classify your services as per its function, then ask what is the most important thing it does?—?serviceability and not uptime, is a thing.

Mohit Malhotra

Senior Platform Engineer @ Barco | Open Source Contributor | Designing scalable system | Developed software for World's smallest ventilator

1 个月

One of the best articles on LinkedIn ???? Akash Saxena

回复
Arpit Gupta

Sr. Software Engineer at InTimeTec | Full Stack Developer | Vue.js | Node.js

1 个月

??????

回复
Yash Paliwal

Full Stack Engineer at Holidify.com Ideas are mine, written by GPT.

1 个月

Very informative??

回复

要查看或添加评论,请登录

Akash Saxena的更多文章

  • Be Memorable

    Be Memorable

    Introduction It’s been an intimidating experience to think about what to say today?—?this is the first time I’m…

    6 条评论
  • SRE Playbook - Step By Step

    SRE Playbook - Step By Step

    I say SRE..

    11 条评论
  • Observability — That Last 9

    Observability — That Last 9

    TL;DR: A stitch in time, saves 9. Discussion on key blocks of observability.

  • Value Streams - Notes on Planning with OKR’s

    Value Streams - Notes on Planning with OKR’s

    TL;DR: Planning is hard, what is helping lately is to zero down on identifying value streams, ascribing a metric and…

    4 条评论
  • Cricket & Agile Software Delivery

    Cricket & Agile Software Delivery

    Tldr; Ever since the Indian men’s cricket team pulled off an improbable, once in a generation heist, I couldn’t help…

    11 条评论
  • Scaling the Hotstar Platform for 50M

    Scaling the Hotstar Platform for 50M

    TL;DR; Hotstar is the home of Indian cricket and scale. However, it’s not rocket science, we did use some rocket…

    7 条评论
  • Scaling Is Not An Accident

    Scaling Is Not An Accident

    TL;DR; The entire Hotstar team has spent the last 6+ months getting ready for our marquee event, the IPL on Hotstar…

    7 条评论
  • Daring — Culture Tenets @ Hotstar

    Daring — Culture Tenets @ Hotstar

    TL;DR; At Hotstar, we are building a very special Engineering team. As we grow in strength and surround ourselves with…

    8 条评论
  • Locks In the Time Of Lock-pickers

    Locks In the Time Of Lock-pickers

    TL;DR; How much security is enough in a world where the malicious agents are always devising newer attack vectors to…

  • T for Tsunami : Dealing with traffic spikes

    T for Tsunami : Dealing with traffic spikes

    TL;DR: In India, Cricket is religion. At the recently concluded Champions Trophy, Hotstar broke it’s own previous best…

    14 条评论

社区洞察

其他会员也浏览了