登录查看更多内容

?? Don’t depend on good luck for reliability! Use these best practices instead.

Gremlin

The Reliability Management Platform for high-velocity engineering teams

发布日期: 2025年3月21日

+ 关注

?? How-tos and best practices

How a major retailer tested critical serverless systems with Failure Flags

Not too long ago, a customer came to us with a high-value use case. The customer, a major apparel company with retail and e-commerce applications, needed to prove that a critical service of their payment applications could failover correctly between regions in case of an outage.

Fortunately, we had a solution. Using Failure Flags, they were able to get up and running, and accurately run the failover test, in less than 30 minutes.

Check out the blog to see how it all went down!

?

Simulating artificial intelligence service outages with Gremlin

The AI (artificial intelligence) landscape wouldn’t be where it is today without AI-as-a-service (AIaaS) providers like OpenAI, AWS, and Google Cloud. These companies have made running AI models as easy as clicking a button.

However, under the shiny exterior, AI models are like any other software dependency: they can crash, throw errors, and lose network connectivity. Worse, these failures can cascade up the rest of your application and cause unexpected impacts.

How do you prepare for this, and more importantly, how do you verify that you can handle AI failures before they impact your customers? We’ll answer all of these questions in this blog.

?

Three reliability best practices when using AI agents for coding

One of the biggest causes of outages and incidents is good old-fashioned human error. Despite all of our best intentions, we can still make mistakes. Fortunately, reliability testing can help you catch these errors before they cause outages. But we’ve recently seen the rise of a different source of failures: AI errors.

That doesn’t mean you have to avoid AI agents, but it is another variable you need to account for in your testing. Fortunately, the three best practices in this blog will help you catch AI (and human) errors before they cause outages or incidents that impact your customers.

?

VIDEO: Application reliability: Do you think or do you know?

It’s one thing to trust the systems are set up correctly to be resilient to failure, but it’s another to have the systems and processes in place to know that they’re reliable. That’s where Chaos Engineering, Reliability Management, and Gremlin come in.

Rather than hoping and praying that systems don’t crash, reliability testing helps you know which services are resilient to failure—and which ones aren’t. In this short video, Gremlin Principal Engineer Sam Rossoff explains how this approach led to Chaos Engineering and Reliability Management.?

——

?? Now GA!

Test serverless and application-level reliability with Failure Flags

It’s been a year and a half since Failure Flags was released. Since then, customers have used Failure Flags to run thousands of tests for applications running on serverless, container, and service meshes. (Check out this blog post to see how easy it was for a major retailer to set up and test a critical service on AWS Lambda in less than 30 minutes.)

We’ve also been hard at work improving Failure Flags capabilities and ease of use, which is why it’s time to officially announce it as out of Beta!

Read this blog post to take a look at Failure Flags, how it works, and some of the significant improvements from the last year.

领英推荐

Dynatrace Counts On AI To Continue Its Market…

Sramana Mitra 3 年前

What’s the difference between observability and…

NordVPN 2 年前

Observability: The Combined Power of eBPF and…

Cristiano Messina 3 周前

——

?? AWS Webinar

Improving Resilience for GenAI Workloads on AWS

ON-DEMAND

With more companies integrating GenAI into their systems and products, it’s essential to make sure GenAI workloads and applications are highly available to deliver an exceptional user experience.

But how do you actually do that? How is it different than standard resilience efforts, and how is it the same?

In this webinar with AWS and Gremlin, we’ll go over how customers are using GenAI workloads on AWS, how the reliability pillar best practices of the Well-Architected Framework apply, and what you can do to improve the resilience and uptime of your GenAI-related workloads.

——

??? Office Hours

Upcoming!

How to find Kubernetes reliability risks with Gremlin

DATE:? April 17th TIME: 11am PT/2pm ET

Most Kubernetes clusters have reliability risks lurking just below the surface. You could spend hours or even days manually finding these risks, but what if someone could find them for you?

In this Office Hours session, you’ll learn how Gremlin uses automatic risk detection to scan your Kubernetes clusters for reliability risks. You’ll also learn where to find your risks in the Gremlin web app, strategies for resolving risks, and how to generate a risk report for leadership.

?

How to keep track of what’s running in your Gremlin team

ON-DEMAND

Reliability testing is ongoing, and tracking that work can be difficult in large organizations. According to our own product metrics, teams run an average of 200 to 500 tests each day! With so much happening, it’s hard to keep track of everything going on—unless you use Gremlin.

In this Office Hours session, you’ll learn how to track your reliability work using Gremlin’s “Now Running,” “What Ran,” and “What’s Scheduled” screens. You’ll learn what data each screen provides, how to access them, and how to use them to manage ongoing testing activities.

WATCH NOW

——

?? Don’t depend on good luck for reliability! Use these best practices instead.

Gremlin

The Reliability Management Platform for high-velocity engineering teams

?? How-tos and best practices

?

?

?

——

?? Now GA!

领英推荐

——

?? AWS Webinar

——

??? Office Hours

?

——

Gremlin Reliability Newsletter

2,054 位关注者

Gremlin的更多文章

社区洞察

其他会员也浏览了

Welcome to Hard Wired—Your IT Fix, Without the Boring Stuff

January 17, 2022

IT Observability: The Fellowship of the Logs

Best Practices for Observability: Monitoring, Logging, and Tracing in Distributed Systems

Don't Collect, Detect First

Amazon S3 Event Notifications

The Evolution of High-Performance Scale

Achieving Five Nines: Advanced Observability for Seamless Uptime

Building the Future: IT Infrastructure for the AI Era

Role of Auto-Scaling in Managing Increased System Load

?? How-tos and best practices

?

?

?

——

?? Now GA!

领英推荐

——

?? AWS Webinar

——

??? Office Hours

?

——

Gremlin Reliability Newsletter

2,054 位关注者

Gremlin的更多文章

?? Give your system some reliability love with these webinars, tips, and best practices!

Tips, webinars, and releases for a reliable 2025!

?Dec. Newsletter: May your days be merry and reliable!

?? A bountiful harvest of reliability tips

?? Tips to help you avoid your worst reliability nightmares

Release roundup, customer webinar, office hours, and compliance!

AWS tips, new RBAC release, TLS/WR SSL certificate tests, and more!

Check out these new releases! Plus: why observability and testing go together

Gremlin for AWS release, migration tips for Kubernetes, and microservice reliability

New testing how-tos, CI/CD office hours, and how to deal with layoffs

社区洞察

其他会员也浏览了

Welcome to Hard Wired—Your IT Fix, Without the Boring Stuff

January 17, 2022

IT Observability: The Fellowship of the Logs

Best Practices for Observability: Monitoring, Logging, and Tracing in Distributed Systems

Don't Collect, Detect First

Amazon S3 Event Notifications

The Evolution of High-Performance Scale

Achieving Five Nines: Advanced Observability for Seamless Uptime

Building the Future: IT Infrastructure for the AI Era

Role of Auto-Scaling in Managing Increased System Load