Never ending saga of broken staging environments for testing Micro-services : Challenges, Risks and approaches.

Never ending saga of broken staging environments for testing Micro-services : Challenges, Risks and approaches.

It’s often noticed that after developing a feature, the development team deploys the code to the staging environment, only to discover that it’s broken. This enduring frustration with the perpetually faulty staging environment is as old as software development itself.

Why anyone would care about staging environment?

Simplest justification would be that engineering team wants to make sure that code would be bug free when deployed to production. In this era where micro-services architectures are becoming complex, vast and digging their foot deeper day by day, the role of a staging environment cannot be overshadowed.?

As per the report of 2021 State of DevOps Report, on an average, an IT organisation deploys code around 973 times per day. Now that’s a lot of code hitting the repository everyday. Considering the scale, the reliability of these environments are very much need of the hour.? Any wrong? move, disaster is inevitable.

What are the consequences of broken staging environments?

Let’s look at some of the major bottlenecks developers face because of issues with lower environments.

  1. Time to Market - Careful planning of releases goes for a toss when the staging environment is unstable, which eventually leads to delay in release, which in turn leads to GMS impact.
  2. Team Velocity - Few things undermine productivity faster than waiting for a staging environment to stabilize or become available. It’s like sitting in a traffic jam when you’re already running late for an important meeting.
  3. Quality challenges - This is perhaps, the biggest challenge. How would you trust your test results when staging environment is shaky.? If staging environments don’t accurately reflect production, feedback from testing may not be useful for improving the final product. It’s like cooking a meal, when ingredients are keep on changing.
  4. Bug leakages - This is an obvious outcome of improper staging environment. QA Team can’t catch the bugs in staging environments and bugs gets slipped into production.?
  5. Frustration and Morale: Ongoing issues with staging can lead to frustration among team members, impacting overall morale and productivity.
  6. Difficulty in Replicating Issues: Debugging problems becomes harder when the staging environment is unreliable, making it difficult to replicate and fix issues.
  7. Wasted Resources: Time and effort spent dealing with a faulty staging environment can lead to inefficient use of resources and increased project costs.
  8. Coordination Challenges: Teams may struggle with coordination and communication when facing ongoing issues with the staging environment.

The paradox of Staging Environments

Staging environments often present a paradoxical dilemma: they are essential for validating and testing new features, but their inherent instability and unreliability can undermine their own purpose. Here’s the catch

  1. Essential for Testing: Staging environments are crucial for simulating real-world conditions and identifying issues before production. They provide a critical checkpoint for ensuring that features work as intended.
  2. Inherently Unstable: Despite their importance, staging environments frequently suffer from instability and inconsistencies. This unreliability can lead to false negatives or missed bugs, complicating the validation process.

How are we dealing with them currently? (Or rather, working around with them)

Over the past years, teams have smartly innovated few workarounds to tackle this. Let’s see through them and understand their pros and cons-

  1. Slack Lock - Majority of teams use slack as their communication means, and they lock the staging environment using this technique. Its like -

                Developer A: “Deploying JIRA-XXXX on stage-env. Dont destroy/touch!”
                Developer B: “But I need to test my critical bug fix!”
                Developer A: “Too bad, I was here first!                        

This might work out for smaller team size, however, it poses scale threats. It’s like having a single water fountain for a crowded event — long waits and frustration are bound to occur.

2. Feature Flags - This process allows developers to turn off the feature on staging environment until it is ready for testing. But actually, some or the other time developers need to enable the flag to test the feature, and guess what, we are back at square one, fighting for a stable environment to test.

3. Numerous Low Level Environments - In order to increase time to market, few organisations, spin up multiple lower environments so that developers can get early feedback for their features. However, this approach might help initially, but will fail on a larger scale. This approach has few drawbacks which can’t be overnighted. Lower environment lacks full integration support with 3rd party services. Here mostly developers have to rely on mocks and stubs. Dvelopers have to face lots of coordination challenges across multiple environments.?

4. Overuse of Mocks. - Mocks often do not fully replicate the behavior of real services or components, which can lead to discrepancies between staging and production environments. Over-reliance on mocks can hide integration issues that only become apparent when interacting with actual services, potentially leading to unexpected problems in production. Mocks might not accurately simulate real-world conditions such as network latency, data variations, or service downtime, which can affect the validity of tests.

How to resolve these paradoxes and find a sustainable solution?

Are we destined to always deal with broken staging environments? Traditional methods for managing staging are clearly problematic. To address these issues, we need to adopt a new approach.

Canary Style Testing in shared environments.?

Canary Style Testing in shared environments is a strategy for gradually rolling out changes to a subset of users or systems before a full-scale deployment. The idea is to test new features or updates in a controlled manner to minimise risk and identify potential issues early.?

The main advantage is that developers can use the same environment without interfering with each other's work. When a developer needs to test a change, the system sets up a distinct path within the environment that incorporates their modified services, while keeping the existing versions of all other services intact.?

Canary testing provides an opportunity to observe how new changes perform under real conditions. This helps assess whether the changes meet performance expectations and handle actual usage scenarios.

Successful canary testing builds confidence in the stability and functionality of new updates before they are fully deployed, reducing the likelihood of major issues in the broader release.

This technique delivers the required isolation for extensive testing without the burden of handling multiple full environments. It allows teams to perform thorough testing in a production-like setting, identify issues early, and keep a stable shared environment all at once.

Some of the real world solutions!

Let’s look at some of the most innovative solutions, which few companies have developed to tackle this.

  • DoorDash’s fast feedback loop is a practice designed to rapidly gather and act on feedback to improve their services and features. This was used by Kubernetes product development to enhance the feedback look within production environment. Based on the feedback and data analysis, this setup quickly makes adjustments to refine and improve their offerings. This iterative process ensures that the product evolves in response to user needs and feedback. Read more here - https://doordash.engineering/2022/06/23/fast-feedback-loop-for-kubernetes-product-development-in-a-production-environment/
  • UBER's SLATE is a system designed to enhance the efficiency and reliability of their software deployment and testing processes. SLATE stands for “S”sort Lived Application Testing Environment) and is a framework that aims to improve the management and performance of test environments. It’s a testing environment that is spun up for a limited duration to conduct specific tests and then terminated once the testing is complete. These environments are usually temporary and may be automatically created? using automated tools, such as Infrastructure as Code (IaC) or containerization technologies, and destroyed as needed. Read more - https://www.uber.com/blog/simplifying-developer-testing-through-slate/
  • Lyft’s environment management refers to the strategies and tools Lyft uses to manage their development, testing, and production environments. The goal is to ensure consistency, efficiency, and stability across different stages of their software lifecycle. Lyft leverages containerization technologies such as Docker and orchestration platforms like Kubernetes to manage their environments. Lyft uses CI/CD pipelines to automate the process of building, testing, and deploying applications. This automation helps streamline the workflow from code commit to deployment, ensuring that changes are tested and deployed in a controlled manner.

These industry leaders have understood the use of safely sharing environments to resolve staging environment issues. This has led them to improve developer productivity, reduce time to market with improved quality.

Benefits of Sharing Staging Environments Safely

Improved Collaboration Cross-Functional Testing: Teams from different departments (e.g., development, QA, and operations) can collaborate more effectively by accessing the same staging environment. Unified Testing: Shared environments allow multiple teams to test their changes in a consistent setup, improving coordination and reducing integration issues.

Cost Efficiency Resource Optimization: Maintaining a single, shared staging environment reduces the need for multiple separate environments, which can be costly and resource-intensive. Reduced Overhead: Fewer environments mean less overhead in terms of management, maintenance, and infrastructure costs.

Consistent Testing Conditions Uniform Setup: A shared environment ensures that all teams are testing against the same configuration and data, leading to more consistent and reliable test results. Accurate Replication: It closely mirrors the production environment, improving the accuracy of tests and increasing confidence in the deployment process.

Faster Feedback and Iteration Immediate Testing: Teams can test their changes and receive feedback more quickly in a shared environment, accelerating the development and bug-fixing process. Quicker Integration: Changes from different teams can be integrated and tested together more efficiently, reducing the time to identify and resolve integration issues.

Improved Risk Management Early Issue Detection: Testing in a shared environment helps identify potential conflicts or issues that might arise from integrating changes from multiple teams, leading to early problem detection. Controlled Access: Implementing proper access controls and monitoring within the shared environment helps mitigate the risk of unauthorized changes or interference.

Streamlined Deployment Process Unified Approach: A shared staging environment facilitates a more streamlined deployment process by providing a common platform for final validation before going to production.

Consistent Release Practices: Teams can adhere to consistent release practices and standards, leading to more predictable and reliable deployments.

Best Practices for Safe Sharing

  • Access Controls: Implement strict access controls and permissions to ensure that only authorized users can make changes to the shared environment.
  • Monitoring and Logging: Use monitoring and logging to track activities and changes within the environment, helping to detect and address issues promptly.
  • Clear Communication: Establish clear communication channels and protocols to coordinate activities and avoid conflicts in the shared environment.
  • Environment Management: Use automated tools and practices to manage the configuration, provisioning, and maintenance of the shared environment effectively.


With the evolving of cloud-native technologies like Kubernetes and service meshes such as Istio and Linkerd, implementing advanced routing for securely sharing staging environments is more straightforward than ever. This setup allows teams to adopt a "canary on staging" approach, effectively identifying issues early while keeping a stable and reliable shared environment.

Furthermore, this method enables capabilities such as "feature previews," allowing development teams to build temporary, isolated versions of their services to demonstrate specific features or updates.

Conclusion :?

The new era of staging environments marks a transformative shift towards more efficient, scalable, and collaborative testing practices. By integrating cloud-native technologies, advanced automation, and sophisticated testing strategies, organizations can overcome traditional limitations and enhance their software development processes. This modern approach enables rapid feedback, cost-effective resource management, and improved coordination among teams, all while ensuring that staging environments are reliable and reflective of production conditions. Embracing these innovations allows for faster delivery of high-quality software, reduced risk of issues, and greater agility in adapting to evolving requirements.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了