?? Don’t depend on good luck for reliability! Use these best practices instead.
?? How-tos and best practices
Not too long ago, a customer came to us with a high-value use case. The customer, a major apparel company with retail and e-commerce applications, needed to prove that a critical service of their payment applications could failover correctly between regions in case of an outage.
Fortunately, we had a solution. Using Failure Flags, they were able to get up and running, and accurately run the failover test, in less than 30 minutes.
Check out the blog to see how it all went down!
?
The AI (artificial intelligence) landscape wouldn’t be where it is today without AI-as-a-service (AIaaS) providers like OpenAI, AWS, and Google Cloud. These companies have made running AI models as easy as clicking a button.
However, under the shiny exterior, AI models are like any other software dependency: they can crash, throw errors, and lose network connectivity. Worse, these failures can cascade up the rest of your application and cause unexpected impacts.
How do you prepare for this, and more importantly, how do you verify that you can handle AI failures before they impact your customers? We’ll answer all of these questions in this blog.
?
One of the biggest causes of outages and incidents is good old-fashioned human error. Despite all of our best intentions, we can still make mistakes. Fortunately, reliability testing can help you catch these errors before they cause outages. But we’ve recently seen the rise of a different source of failures: AI errors.
That doesn’t mean you have to avoid AI agents, but it is another variable you need to account for in your testing. Fortunately, the three best practices in this blog will help you catch AI (and human) errors before they cause outages or incidents that impact your customers.
?
It’s one thing to trust the systems are set up correctly to be resilient to failure, but it’s another to have the systems and processes in place to know that they’re reliable. That’s where Chaos Engineering, Reliability Management, and Gremlin come in.
Rather than hoping and praying that systems don’t crash, reliability testing helps you know which services are resilient to failure—and which ones aren’t. In this short video, Gremlin Principal Engineer Sam Rossoff explains how this approach led to Chaos Engineering and Reliability Management.?
——
?? Now GA!
Test serverless and application-level reliability with Failure Flags
It’s been a year and a half since Failure Flags was released. Since then, customers have used Failure Flags to run thousands of tests for applications running on serverless, container, and service meshes. (Check out this blog post to see how easy it was for a major retailer to set up and test a critical service on AWS Lambda in less than 30 minutes.)
We’ve also been hard at work improving Failure Flags capabilities and ease of use, which is why it’s time to officially announce it as out of Beta!
Read this blog post to take a look at Failure Flags, how it works, and some of the significant improvements from the last year.
领英推荐
——
?? AWS Webinar
ON-DEMAND
With more companies integrating GenAI into their systems and products, it’s essential to make sure GenAI workloads and applications are highly available to deliver an exceptional user experience.
But how do you actually do that? How is it different than standard resilience efforts, and how is it the same?
In this webinar with AWS and Gremlin, we’ll go over how customers are using GenAI workloads on AWS, how the reliability pillar best practices of the Well-Architected Framework apply, and what you can do to improve the resilience and uptime of your GenAI-related workloads.
——
??? Office Hours
Upcoming!
DATE:? April 17th TIME: 11am PT/2pm ET
Most Kubernetes clusters have reliability risks lurking just below the surface. You could spend hours or even days manually finding these risks, but what if someone could find them for you?
In this Office Hours session, you’ll learn how Gremlin uses automatic risk detection to scan your Kubernetes clusters for reliability risks. You’ll also learn where to find your risks in the Gremlin web app, strategies for resolving risks, and how to generate a risk report for leadership.
?
ON-DEMAND
Reliability testing is ongoing, and tracking that work can be difficult in large organizations. According to our own product metrics, teams run an average of 200 to 500 tests each day! With so much happening, it’s hard to keep track of everything going on—unless you use Gremlin.
In this Office Hours session, you’ll learn how to track your reliability work using Gremlin’s “Now Running,” “What Ran,” and “What’s Scheduled” screens. You’ll learn what data each screen provides, how to access them, and how to use them to manage ongoing testing activities.
——