?? Don’t depend on good luck for reliability! Use these best practices instead.

?? Don’t depend on good luck for reliability! Use these best practices instead.

?? How-tos and best practices

How a major retailer tested critical serverless systems with Failure Flags

Not too long ago, a customer came to us with a high-value use case. The customer, a major apparel company with retail and e-commerce applications, needed to prove that a critical service of their payment applications could failover correctly between regions in case of an outage.

Fortunately, we had a solution. Using Failure Flags, they were able to get up and running, and accurately run the failover test, in less than 30 minutes.

Check out the blog to see how it all went down!

?

Simulating artificial intelligence service outages with Gremlin

The AI (artificial intelligence) landscape wouldn’t be where it is today without AI-as-a-service (AIaaS) providers like OpenAI, AWS, and Google Cloud. These companies have made running AI models as easy as clicking a button.

However, under the shiny exterior, AI models are like any other software dependency: they can crash, throw errors, and lose network connectivity. Worse, these failures can cascade up the rest of your application and cause unexpected impacts.

How do you prepare for this, and more importantly, how do you verify that you can handle AI failures before they impact your customers? We’ll answer all of these questions in this blog.

?

Three reliability best practices when using AI agents for coding

One of the biggest causes of outages and incidents is good old-fashioned human error. Despite all of our best intentions, we can still make mistakes. Fortunately, reliability testing can help you catch these errors before they cause outages. But we’ve recently seen the rise of a different source of failures: AI errors.

That doesn’t mean you have to avoid AI agents, but it is another variable you need to account for in your testing. Fortunately, the three best practices in this blog will help you catch AI (and human) errors before they cause outages or incidents that impact your customers.

?

VIDEO: Application reliability: Do you think or do you know?

It’s one thing to trust the systems are set up correctly to be resilient to failure, but it’s another to have the systems and processes in place to know that they’re reliable. That’s where Chaos Engineering, Reliability Management, and Gremlin come in.

Rather than hoping and praying that systems don’t crash, reliability testing helps you know which services are resilient to failure—and which ones aren’t. In this short video, Gremlin Principal Engineer Sam Rossoff explains how this approach led to Chaos Engineering and Reliability Management.?

——

?? Now GA!

Test serverless and application-level reliability with Failure Flags

It’s been a year and a half since Failure Flags was released. Since then, customers have used Failure Flags to run thousands of tests for applications running on serverless, container, and service meshes. (Check out this blog post to see how easy it was for a major retailer to set up and test a critical service on AWS Lambda in less than 30 minutes.)

We’ve also been hard at work improving Failure Flags capabilities and ease of use, which is why it’s time to officially announce it as out of Beta!

Read this blog post to take a look at Failure Flags, how it works, and some of the significant improvements from the last year.

——

?? AWS Webinar

Improving Resilience for GenAI Workloads on AWS

ON-DEMAND

With more companies integrating GenAI into their systems and products, it’s essential to make sure GenAI workloads and applications are highly available to deliver an exceptional user experience.

But how do you actually do that? How is it different than standard resilience efforts, and how is it the same?

In this webinar with AWS and Gremlin, we’ll go over how customers are using GenAI workloads on AWS, how the reliability pillar best practices of the Well-Architected Framework apply, and what you can do to improve the resilience and uptime of your GenAI-related workloads.

REGISTER NOW

——

??? Office Hours

Upcoming!

How to find Kubernetes reliability risks with Gremlin

DATE:? April 17th TIME: 11am PT/2pm ET

Most Kubernetes clusters have reliability risks lurking just below the surface. You could spend hours or even days manually finding these risks, but what if someone could find them for you?

In this Office Hours session, you’ll learn how Gremlin uses automatic risk detection to scan your Kubernetes clusters for reliability risks. You’ll also learn where to find your risks in the Gremlin web app, strategies for resolving risks, and how to generate a risk report for leadership.

REGISTER HERE

?

How to keep track of what’s running in your Gremlin team

ON-DEMAND

Reliability testing is ongoing, and tracking that work can be difficult in large organizations. According to our own product metrics, teams run an average of 200 to 500 tests each day! With so much happening, it’s hard to keep track of everything going on—unless you use Gremlin.

In this Office Hours session, you’ll learn how to track your reliability work using Gremlin’s “Now Running,” “What Ran,” and “What’s Scheduled” screens. You’ll learn what data each screen provides, how to access them, and how to use them to manage ongoing testing activities.

WATCH NOW

——


要查看或添加评论,请登录

Gremlin的更多文章

社区洞察

其他会员也浏览了