登录查看更多内容

?? Tips to help you avoid your worst reliability nightmares

Gremlin

The Reliability Management Platform for high-velocity engineering teams

发布日期: 2024年10月21日

+ 关注

??How-tos and best practices

Best Practices for Testing Zone Redundancy

Zone redundancy is an essential part of designing resilient, reliable architectures, but how do you go from believing that you’re resilient to knowing that your systems can handle a zone going down?

In this blog from Gremlin Principal Engineer Sam Rossoff , you’ll learn best practices for setting up zone redundancy and how you can verify that redundancy using Fault Injection testing.

Three serverless reliability risks you can solve today using Failure Flags

Fault Injection is a method of testing a system’s resilience by creating controlled failures. Most Fault Injection tools require an agent running on a host, but the design of serverless platforms makes this approach impossible.

Gremlin Failure Flags is a code-level Fault Injection solution that injects faults directly into your applications. In this blog , you’ll learn how it can help you uncover and address three common reliability risks in serverless applications.

Interpreting your reliability test results

Gremlin’s default suite of reliability tests analyzes critical functions of modern services: scalability, redundancy, and resilience to dependency failures. Services that pass this suite of tests can be trusted to remain available during unexpected incidents. But what happens when a service fails a test? How do you take failed test results and turn them into actionable insights?

This blog aims to answer that question . We’ll walk through all seven tests in the Gremlin Recommended Test Suite and explain what they test, what happens if your service fails, and what actions you can take to turn that failure into success.

——

?? Customer Webinar On-Demand

How Visa Cross-Border Solutions Reduces Outages by Testing System Resilience in Their SDLC

ON-DEMAND

In this Gremlin-hosted webinar, Chris Kempster, a Sr. NFT Engineer at Visa Cross-Border Solutions, shares their journey from early Chaos Engineering experiments to integrating reliability test suites into their staging environments and build pipelines.

Chris will share the lessons he’s learned and best practices for building an effective testing process—and rolling it out across the organization.

领英推荐

SRE Playbook - Step By Step

Akash Saxena 1 年前

The Evolution of Site Reliability Engineering at VGW:…

VGW 1 年前

Service Threat Engineering: Taking a Page from Site…

Jason Bloomberg 2 年前

WATCH NOW

——

??? Office Hours

Upcoming! Integrating Gremlin with your observability tools

DATE:? November 14th TIME: 11am PT/2pm ET

To get the most value out of Chaos Engineering and reliability testing, you need a way to observe your service’s behavior. Observability tools offer insight into how your systems are performing, but observability on its own isn’t enough. You need a way to monitor your systems while testing their reliability so you can determine whether your service passed or failed a test.

In this Office Hours session , we’ll show you how to connect Gremlin to your observability tool via Health Checks. We’ll also discuss which metrics you should choose when creating Health Checks, and why they’re important for reliability.

Have questions about observability and Gremlin? Just reply to this email and we’ll make sure to cover them in the live Q&A portion of the webinar.?

How to test serverless applications using Failure Flags

ON-DEMAND

Serverless applications are ideal for deploying scalable applications without having to manage infrastructure, but this also makes it difficult to test their reliability.

Failure Flags is Gremlin’s answer to serverless reliability. In this Office Hours session , we’ll show you how to run Chaos Engineering experiments on AWS Lambda functions. You’ll see how you can safely inject faults directly into your applications, how to scope experiments from individual functions to entire availability zones, and even how to create your own custom faults.?

Have questions about Failure Flags? Just reply to this email and we’ll make sure to cover them in the live Q&A portion of the webinar.?

WATCH NOW

——

?? Tips to help you avoid your worst reliability nightmares

Gremlin

The Reliability Management Platform for high-velocity engineering teams

??How-tos and best practices

?? Customer Webinar On-Demand

领英推荐

??? Office Hours

Gremlin Reliability Newsletter

1,855 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Designing for Reliability and Resilience

Best Practices for Error Handling in Event-Driven Architectures.

Application fleet robustness and resilience: strategic imperatives to protect and grow business

Measuring Success in SRE - Part#1

24/7 Site Reliability

How to Control System Complexity: Our SRE Approach to Incident Resolution

Chaos Engineering and resilience

Our Performance Optimization Services Uncovered

SRE concepts part 2 (SLI/SLO)

??How-tos and best practices

?? Customer Webinar On-Demand

领英推荐

??? Office Hours

Gremlin Reliability Newsletter

1,855 位关注者

Release roundup, customer webinar, office hours, and compliance!

2024年9月26日

AWS tips, new RBAC release, TLS/WR SSL certificate tests, and more!

2024年8月23日

Check out these new releases! Plus: why observability and testing go together

2024年7月16日

Gremlin for AWS release, migration tips for Kubernetes, and microservice reliability

2024年6月27日

New testing how-tos, CI/CD office hours, and how to deal with layoffs

2024年5月14日

社区洞察

其他会员也浏览了

Designing for Reliability and Resilience

Best Practices for Error Handling in Event-Driven Architectures.

Application fleet robustness and resilience: strategic imperatives to protect and grow business

Measuring Success in SRE - Part#1

24/7 Site Reliability

How to Control System Complexity: Our SRE Approach to Incident Resolution

Chaos Engineering and resilience

Our Performance Optimization Services Uncovered

SRE concepts part 2 (SLI/SLO)