How did the world's largest auctioneer of commercial assets and vehicles cut downtime by 50%? Keep watching… 🚀 https://hubs.la/Q03f-0H10
Gremlin
软件开发
San Jose,California 11,715 位关注者
The Reliability Management Platform for high-velocity engineering teams
关于我们
Gremlin’s Reliability Management Platform enables high-velocity engineering teams to standardize and automate reliability across their organizations without slowing down software delivery. Gremlin's Reliability Score sets the standard for reliability so there's no guesswork, and an automated suite of Reliability Management tools makes it easy to integrate reliability throughout the software lifecycle so there's no slowdown.
- 网站
-
http://www.gremlin.com
Gremlin的外部链接
- 所属行业
- 软件开发
- 规模
- 51-200 人
- 总部
- San Jose,California
- 类型
- 私人持股
- 创立
- 2016
- 领域
- Distributed Systems、Resilience、Failures as a Service、DevOps和Chaos Engineering
地点
Gremlin员工
动态
-
Join us TOMORROW, April 17th, for our latest Office Hours session! We're covering how you can find Kubernetes reliability risks with Gremlin. You'll learn how Gremlin uses automatic risk detection to scan your Kubernetes clusters for reliability risks. You’ll also learn where to find your risks in the Gremlin web app, strategies for resolving risks, and how to generate a risk report for leadership. Register at the link in the comments, and if you can't make it, no worries- we'll send you the recording afterwards. See you there!
-
-
It’s been a year and a half since Failure Flags was released. Since then, customers have used Failure Flags to run thousands of tests for applications running on serverless, container, and service meshes. We’ve also been hard at work improving Failure Flags capabilities and ease of use, so let's take a brief look at Failure Flags, how it works, and some of the significant improvements from the last year. ⬇️ https://hubs.la/Q03f-1350
-
"The end to end complexity of an application has become complex beyond the capability of a human to understand it." -Josh Leslie, CEO So how do you prepare your systems for an increasingly complex world? Gremlin's Reliability Management tools are made for enterprise teams to reduce friction and establish confidence, so you can build an organization-wide culture of reliability. Learn more at the link in the comments.
-
"You trusted best intentions the first time. That's what got you into this mess. The only way you're ever going to build your way out is by building a system." Principal Software Engineer Samuel Rossoff dives into the issue with relying on a culture of best intentions rather than establishing clear, defined systems for system-wide reliability.
-
AI-powered applications rely on more infrastructure, compute power, and dependencies than traditional software. That means more failure points—and a higher chance of something going wrong. If your AI model is slow, unreliable, or inaccurate, users don’t care why—they just stop using it. So what causes AI outages? ⚠️ GPU resource exhaustion—AI workloads are compute-heavy, and if GPU capacity isn’t managed properly, requests get delayed or dropped. ⚠️ Model degradation—AI performance doesn’t stay the same forever—it needs continuous validation to catch accuracy loss before it impacts customers. ⚠️ Third-party dependencies—Many AI applications rely on external AI APIs, cloud providers, or LLM services that can go down at any time. How do we prevent AI failures? ✔️ Run AI-specific resilience tests—Stress-test GPU loads, API failures, and degraded predictions before they happen in production. ✔️ Monitor model accuracy, not just system health—Tracking response times and uptime isn’t enough—you need visibility into how the model is actually performing. ✔️ Build failover strategies—If an AI service becomes unavailable or unreliable, have a backup model or alternative processing method ready to go. We’re entering a world where AI failures will impact customer trust as much as security breaches. How is your team preparing for AI outages?
-
What happens when an AWS region fails? CEO Josh Leslie shares how Gremlin allows customers to define failure conditions for complex systems ⬇️
-
ICYMI, Failure Flags is now Generally Available for all Gremlin customers!🚀 What can you do with this new feature? ➡️Run resilience tests for serverless, containers, Kubernetes, and service mesh ➡️Create application errors, latency, and data issues ➡️Run Failure Flags in Node.js, Python, Java, Go—and .NET ➡️ Set up experiments in the Gremlin UI Learn more at the link in the comments.