Never ending saga of broken staging environments for testing Micro-services : Challenges, Risks and approaches.
Samarth Handur, CSM?,CSPO?
Engineering Manager|Cloud|Restful APIs|Python|Automation|CI/CD|QA|CSM?|CSPO?|
It’s often noticed that after developing a feature, the development team deploys the code to the staging environment, only to discover that it’s broken. This enduring frustration with the perpetually faulty staging environment is as old as software development itself.
Why anyone would care about staging environment?
Simplest justification would be that engineering team wants to make sure that code would be bug free when deployed to production. In this era where micro-services architectures are becoming complex, vast and digging their foot deeper day by day, the role of a staging environment cannot be overshadowed.?
As per the report of 2021 State of DevOps Report, on an average, an IT organisation deploys code around 973 times per day. Now that’s a lot of code hitting the repository everyday. Considering the scale, the reliability of these environments are very much need of the hour.? Any wrong? move, disaster is inevitable.
What are the consequences of broken staging environments?
Let’s look at some of the major bottlenecks developers face because of issues with lower environments.
The paradox of Staging Environments
Staging environments often present a paradoxical dilemma: they are essential for validating and testing new features, but their inherent instability and unreliability can undermine their own purpose. Here’s the catch
How are we dealing with them currently? (Or rather, working around with them)
Over the past years, teams have smartly innovated few workarounds to tackle this. Let’s see through them and understand their pros and cons-
Developer A: “Deploying JIRA-XXXX on stage-env. Dont destroy/touch!”
Developer B: “But I need to test my critical bug fix!”
Developer A: “Too bad, I was here first!
This might work out for smaller team size, however, it poses scale threats. It’s like having a single water fountain for a crowded event — long waits and frustration are bound to occur.
2. Feature Flags - This process allows developers to turn off the feature on staging environment until it is ready for testing. But actually, some or the other time developers need to enable the flag to test the feature, and guess what, we are back at square one, fighting for a stable environment to test.
3. Numerous Low Level Environments - In order to increase time to market, few organisations, spin up multiple lower environments so that developers can get early feedback for their features. However, this approach might help initially, but will fail on a larger scale. This approach has few drawbacks which can’t be overnighted. Lower environment lacks full integration support with 3rd party services. Here mostly developers have to rely on mocks and stubs. Dvelopers have to face lots of coordination challenges across multiple environments.?
4. Overuse of Mocks. - Mocks often do not fully replicate the behavior of real services or components, which can lead to discrepancies between staging and production environments. Over-reliance on mocks can hide integration issues that only become apparent when interacting with actual services, potentially leading to unexpected problems in production. Mocks might not accurately simulate real-world conditions such as network latency, data variations, or service downtime, which can affect the validity of tests.
How to resolve these paradoxes and find a sustainable solution?
Are we destined to always deal with broken staging environments? Traditional methods for managing staging are clearly problematic. To address these issues, we need to adopt a new approach.
Canary Style Testing in shared environments.?
Canary Style Testing in shared environments is a strategy for gradually rolling out changes to a subset of users or systems before a full-scale deployment. The idea is to test new features or updates in a controlled manner to minimise risk and identify potential issues early.?
The main advantage is that developers can use the same environment without interfering with each other's work. When a developer needs to test a change, the system sets up a distinct path within the environment that incorporates their modified services, while keeping the existing versions of all other services intact.?
Canary testing provides an opportunity to observe how new changes perform under real conditions. This helps assess whether the changes meet performance expectations and handle actual usage scenarios.
领英推荐
Successful canary testing builds confidence in the stability and functionality of new updates before they are fully deployed, reducing the likelihood of major issues in the broader release.
This technique delivers the required isolation for extensive testing without the burden of handling multiple full environments. It allows teams to perform thorough testing in a production-like setting, identify issues early, and keep a stable shared environment all at once.
Some of the real world solutions!
Let’s look at some of the most innovative solutions, which few companies have developed to tackle this.
These industry leaders have understood the use of safely sharing environments to resolve staging environment issues. This has led them to improve developer productivity, reduce time to market with improved quality.
Benefits of Sharing Staging Environments Safely
Improved Collaboration Cross-Functional Testing: Teams from different departments (e.g., development, QA, and operations) can collaborate more effectively by accessing the same staging environment. Unified Testing: Shared environments allow multiple teams to test their changes in a consistent setup, improving coordination and reducing integration issues.
Cost Efficiency Resource Optimization: Maintaining a single, shared staging environment reduces the need for multiple separate environments, which can be costly and resource-intensive. Reduced Overhead: Fewer environments mean less overhead in terms of management, maintenance, and infrastructure costs.
Consistent Testing Conditions Uniform Setup: A shared environment ensures that all teams are testing against the same configuration and data, leading to more consistent and reliable test results. Accurate Replication: It closely mirrors the production environment, improving the accuracy of tests and increasing confidence in the deployment process.
Faster Feedback and Iteration Immediate Testing: Teams can test their changes and receive feedback more quickly in a shared environment, accelerating the development and bug-fixing process. Quicker Integration: Changes from different teams can be integrated and tested together more efficiently, reducing the time to identify and resolve integration issues.
Improved Risk Management Early Issue Detection: Testing in a shared environment helps identify potential conflicts or issues that might arise from integrating changes from multiple teams, leading to early problem detection. Controlled Access: Implementing proper access controls and monitoring within the shared environment helps mitigate the risk of unauthorized changes or interference.
Streamlined Deployment Process Unified Approach: A shared staging environment facilitates a more streamlined deployment process by providing a common platform for final validation before going to production.
Consistent Release Practices: Teams can adhere to consistent release practices and standards, leading to more predictable and reliable deployments.
Best Practices for Safe Sharing
With the evolving of cloud-native technologies like Kubernetes and service meshes such as Istio and Linkerd, implementing advanced routing for securely sharing staging environments is more straightforward than ever. This setup allows teams to adopt a "canary on staging" approach, effectively identifying issues early while keeping a stable and reliable shared environment.
Furthermore, this method enables capabilities such as "feature previews," allowing development teams to build temporary, isolated versions of their services to demonstrate specific features or updates.
Conclusion :?
The new era of staging environments marks a transformative shift towards more efficient, scalable, and collaborative testing practices. By integrating cloud-native technologies, advanced automation, and sophisticated testing strategies, organizations can overcome traditional limitations and enhance their software development processes. This modern approach enables rapid feedback, cost-effective resource management, and improved coordination among teams, all while ensuring that staging environments are reliable and reflective of production conditions. Embracing these innovations allows for faster delivery of high-quality software, reduced risk of issues, and greater agility in adapting to evolving requirements.