登录查看更多内容

Integrating SRE into Daily Development

Jan Varga

Innovative Technology Leader | Automation, AI & Cloud Evangelist | Collaborative Leadership and Team Building

发布日期: 2023年11月28日

In my last post, I discussed the importance of bringing SRE into the development process early, rather than as an afterthought. Operational excellence cannot be an add-on; it must be integrated into our systems from the outset.

When I made the case for early SRE integration, I didn't provide practical guidance on how to actually adopt these practices as an engineer. In this article, I will cover practical ways in which you, as a software engineer, can integrate core SRE principles into your individual workflow. Rather than reinventing the wheel, it's more about augmenting your existing skills and habits deliberately to enhance reliability. By integrating aspects of SRE into your daily workflows, you can make significant impacts on system reliability and uptime.

Understanding the ethos of SRE

Before exploring practical integration, it is vital for engineers to fundamentally understand SRE concepts. At its core, SRE focuses on ensuring system reliability, scalability efficiency and uptime through continuous monitoring, automated testing, and building operability into the software lifecycle. SRE elevates reliability to a first-class concern, rather than treating it as an afterthought.

Writing code with SRE in mind

The journey begins with the code we write. It's not just about functionality; it's about resilience. Consider potential failure scenarios like network issues or high traffic volumes right from the get-go. This foresight in coding can dramatically reduce headaches later on.

Building reliability in from scratch - When architecting systems and writing code, consider potential failure scenarios and design with resiliency in mind from the beginning. Stress test to uncover weaknesses.
Automated testing - Complement unit tests with automated tests that simulate real-world conditions at scale, like load tests, integration failure tests, and chaos engineering experiments.
Reliability focused peer reviews - Expand the criteria of code reviews to include factors like graceful degradation, throttling, and retry mechanisms for subsystems.

Participating in post-incident reviews

Post-incident reviews are more than administrative tasks; they are goldmines of learning. Actively participating in these sessions allows us to understand not just what went wrong, but why. This understanding is crucial in preventing future issues and enhancing system reliability.

Ask why - Delve into root causes, not merely quick fixes. Why did the failure happen and how can it be systematically prevented
Broaden perspective - Engineers should actively listen to operations teammates during these reviews to see the system holistically.
Evangelise learnings - Share post-incident takeaways across the org to multiply their impact and build institutional knowledge.

领英推荐

Platform vs. DevEx teams: What’s the difference?

Abi Noda 4 个月前

Why Automated Testing is the Future of SRE Best…

Yoseph Reuveni 4 个月前

Scaling Engineering Culture with SRE and Observability

Yoseph Reuveni 2 个月前

Gathering telemetry for continuous improvement

Incorporating real-time monitoring into our workflow does wonders. It’s about keeping a finger on the pulse of our applications, being alerted to anomalies, and reacting swiftly to prevent minor issues from becoming major disruptions.

Instrument code - Build in logging, metrics and tracing to gain visibility into all system components.
Review trends - Regularly analyse monitoring dashboards to spot anomalies and opportunities to improve reliability.
Tighten feedback loops - Setup alerts for key metrics and issues. Rapid detection and remediation is crucial.

Conclusion

As software engineers, our role in integrating SRE principles is pivotal. By embedding these practices into our daily routines, we contribute not only to building more reliable systems but also to shaping an operational culture that values stability as much as innovation.

Next Steps

I encourage all engineers to take a closer look at SRE practices and think critically about how they can adopt these within their unique environments. Start small, focus on progress over perfection, and remain patient.

Set up a meeting with your SRE team to understand what practices they follow. Ask questions and find areas you can directly apply within your coding.
Analyse your recent incidents and outages. Identify at least three areas in your code or system architecture that could be made more resilient based on SRE best practices.
Implement one new SRE-aligned practice per sprint, whether it's adding throttles, expanding testing, enabling finer-grained metrics, or more. Measure the incremental reliability lift and demonstrate value.
Evangelise each success with your peers. Share practical examples of how SRE integration helped avoid incidents or made remediation easier. Offer to help them implement similar practices.

By taking these initial steps towards SRE integration, you will not only enhance your own engineering practices but also contribute significantly to the overall resilience and reliability of your organisation's software systems.

Embracing reliability engineering may require upfront effort but pays invaluable dividends. Let's build a more resilient future together.

I’d love to hear your thoughts and experiences on integrating SRE into your development work. How have you adopted similar practices, and what impact have they had on your projects?

Ira Bailey

Founder at Coshop.nz, a community-led food platform. Cloud and Security Architect.

1 年

I think the core difference between of “doing SRE” and “doing DevOps” is understanding the business goals and product. DevOps is usually “blindly automate to production” whereas with an SRE approach you design and run based on metrics that align with the actual business outcomes. Things like adding in proxying layer or an additional DNS lookup become much more significant when you know that you’ve just added another point of failure to a critical business service.

查看更多评论

要查看或添加评论，请登录

Jan Varga的更多文章

Slack Smarter: Knowledge from Chat

2025年3月2日

Slack Smarter: Knowledge from Chat

Building on the idea of making knowledge sharing easier for engineers, as discussed in my previous article - How to Get…
How to Get Your Engineers Engaged in Knowledge Sharing

2025年2月26日

How to Get Your Engineers Engaged in Knowledge Sharing

If you’ve ever tried to encourage engineers to share knowledge, you know it’s not easy. In theory, everyone benefits…

1 条评论
Engineering Reimagined: A GenAI Roadmap for a Future of Innovation

2025年2月20日

Engineering Reimagined: A GenAI Roadmap for a Future of Innovation

Laying the Groundwork for a Revolution: Building Your GenAI Foundation with the Right Tools Before we can unlock the…

2 条评论
Exploring Smol Agents: Building an Intelligent Shopping List Assistant

2025年1月20日

Exploring Smol Agents: Building an Intelligent Shopping List Assistant

Introduction The world of AI development is experiencing a fascinating shift toward more lightweight, specialized tools…

1 条评论
Reimagining Banking: A Glimpse into the Future with Generative AI

2024年10月28日

Reimagining Banking: A Glimpse into the Future with Generative AI

Imagine a world where your bank understands you like a close friend, anticipates your needs before you even voice them,…
Coding Tests Are Irrelevant: Why It’s Time for a New Approach

2024年10月24日

Coding Tests Are Irrelevant: Why It’s Time for a New Approach

The traditional coding test, once a hallmark of technical interviews, is quickly losing its relevance in today’s…

4 条评论
Command Line Rules: A Nostalgic Rant

2024年10月17日

Command Line Rules: A Nostalgic Rant

Back in the day, it was just you, your terminal, and a handful of scripts that got the job done. A time when control…
The Grand Compendium

2024年6月20日

The Grand Compendium

Over the last few months I've posted almost 60 articles across a variety of topics. I've spent the last week organising…

1 条评论
AI in Banking

2024年6月18日

AI in Banking

A consolidated list of my articles on AI in Banking Over the last few months I've posted almost 60 articles across a…

1 条评论
GenAI for Data Analytics

2024年6月17日

GenAI for Data Analytics

A consolidated list of my articles on GenAI for Data Analytics Over the last few months I've posted almost 60 articles…

2 条评论

See all articles

Integrating SRE into Daily Development

Jan Varga

Innovative Technology Leader | Automation, AI & Cloud Evangelist | Collaborative Leadership and Team Building

Understanding the ethos of SRE

Writing code with SRE in mind

Participating in post-incident reviews

领英推荐

Gathering telemetry for continuous improvement

Conclusion

Next Steps

Jan Varga的更多文章

社区洞察

其他会员也浏览了

The Ultimate Goal in Production Incidents

Driving Cultural Change with Observability: An SRE Perspective

A Deep Dive into the Role of SRE in Automated Testing Pipelines

The Hype About Platform Engineering: Echoes of the SRE Revolution

Driving Operational Efficiency: The Intersection of SRE and MLOps

Bridging the Gap: A Practical Guide to Platform Engineering for IT Decision Makers

The Cultural Shift in Engineering: SRE as a Change Agent

Enabling Engineers to Detect and Resolve Issues 10x Faster: Our Investment in Checkly

Level Up Your Development: Embracing DevSecOps

Quality Engineering: Transforming The Business Landscape

Understanding the ethos of SRE

Writing code with SRE in mind

Participating in post-incident reviews

领英推荐

Gathering telemetry for continuous improvement

Conclusion

Next Steps

Jan Varga的更多文章

Slack Smarter: Knowledge from Chat

How to Get Your Engineers Engaged in Knowledge Sharing

Engineering Reimagined: A GenAI Roadmap for a Future of Innovation

Exploring Smol Agents: Building an Intelligent Shopping List Assistant

Reimagining Banking: A Glimpse into the Future with Generative AI

Coding Tests Are Irrelevant: Why It’s Time for a New Approach

Command Line Rules: A Nostalgic Rant

The Grand Compendium

AI in Banking

GenAI for Data Analytics

社区洞察

其他会员也浏览了

The Ultimate Goal in Production Incidents

Driving Cultural Change with Observability: An SRE Perspective

A Deep Dive into the Role of SRE in Automated Testing Pipelines

The Hype About Platform Engineering: Echoes of the SRE Revolution

Driving Operational Efficiency: The Intersection of SRE and MLOps

Bridging the Gap: A Practical Guide to Platform Engineering for IT Decision Makers

The Cultural Shift in Engineering: SRE as a Change Agent

Enabling Engineers to Detect and Resolve Issues 10x Faster: Our Investment in Checkly

Level Up Your Development: Embracing DevSecOps

Quality Engineering: Transforming The Business Landscape