The Value of Prevention in Quality Assurance: QA Spend is Always justifiable after an “Incident”

The Value of Prevention in Quality Assurance: QA Spend is Always justifiable after an “Incident”

Justifying spend on quality engineering and software testing is a constant struggle for technology-first organizations. The risks are significant, but the value is often misunderstood -- perhaps more than misunderstood. In fact, it is often prioritized lower than producing more code or features. As leaders and veterans in the software quality world, we see this firsthand every day, and we know that not everything can or should be tested.?We know that the process of quality assurance is not infallible, and defects will escape.?So, what is an IT organization to do? We have long sought the right balance between risk and spend when it comes to quality. It is why QA Consultants has invested so much in building processes and helping our customers focus on engineering higher quality software.????

On July 19, 2024, the massive CrowdStrike defect caused a system crash that affected over 8 million computers running Windows operating systems and their software (TechRadar, 2024 ), resulting in major complications throughout various healthcare, banking, and travel systems, among other organizations, at an estimated financial impact of $10 billion (ISG, 2024 ). At this time, investigations are continuing, but it is understood that the incident occurred due to a defect in an update specific to integrations. Many details will need to be investigated to confirm if system integration testing wasn’t adequately completed before the release was automatically delivered to customers worldwide (ISG, 2024 ). ?It is highly possible that the testing that was done was what had been previously determined as the right balance between risk and cost.?In fact, that balance always works well…until it doesn’t (like in this case).?Hindsight is always 20/20, and the many after-action reports on this incident may point to known and unknown decisions, incompetence, lack of clarity, etc. Culpability and accountability will be large, as there are many cooks in this “kitchen.” ? ??

What can we learn from this situation? The time, effort, and money spent on quality assurance far outweigh the loss incurred with a catastrophic incident. But this is made clear only after the incident took place. Prior to the incident, the lack of spend on this particular part of QA might be heralded as fiscal responsibility and efficient software development practices. It takes an incident to highlight the impact of those trade-offs. As part of ALTEN Technology, QA Consultants plays a significant role in customer organizations to reduce the impacts of integration issues, including risk versus reward prioritization decisions, like the CrowdStrike situation.??

Here are 5 ways organizations can better position themselves in the future:?

1. Always have a "production beta:" Whether it is termed Beta (waterfall), or canary (DevOps), always have a small production group for deployments and analysis before rolling to the larger/global group. In fact, this very item is one that CrowdStrike has realized would have provided an air brake to prevent the worldwide disaster from occurring. Customer environments are complex.? Even with generous budgets, it becomes impractical to reproduce all possible combinations of software and environment configurations.? This results in an unknown production risk after completing quality assurance efforts.? In waterfall development shops, identification and mitigation of this remaining risk have historically been addressed through alpha & beta software deployments to limited audiences.? With DevOps, this has become “the canary model” for rolling out software to groups of high complexity to check for performance issues.? Your rollout to production should include progressively larger canary groups to limit customer risk in production and identify and address remaining risk items before a catastrophic event occurs.?

  • Example:?Deploy with dormant features and feature flags present. The new features are enabled for a subset of the population, either by geography or percentage.? Avoid your highest-risk production customers.? Leverage modern APM solutions to monitor this group for errors and performance in production.? With your second canary, include a subset of your higher production clients.? Your number of canary levels should match the risk to your clients should your software fail to function.? Kernel-level software that has the potential to “brick” a client system will require more canary levels than user-level software with a new feature that fails to work as expected.?

?

?2. Adopt a Staggered Deployment Strategy Based on Impact: Roll out updates in phases, starting with regions or customers with lower impact and gradually progressing to larger, more critical customers only after ensuring stability.??

  • Example: Deploy updates first to internal systems or smaller customers with less critical operations. Monitor the performance and stability of these initial deployments closely. Once the update is confirmed to be stable, proceed to roll out updates to larger customers or those whose business operations are more critical and would cause a greater impact if issues arise. This approach helps minimize potential disruptions and ensures that any problems are identified and resolved before affecting major customers.??


3. Enhance Monitoring and Feedback Mechanisms with Automated Rollback and Update Pausing: Implement robust monitoring and logging systems to provide real-time feedback on updates and integrate automated rollback procedures. Additionally, ensure that updates can be paused in other regions until further investigation is completed if issues are detected.??

  • Example: Utilize tools that provide real-time alerts and dashboards to track the health of systems receiving updates. If anomalies or issues are detected, the automated rollback system should immediately revert the affected systems to the previous stable version. Simultaneously, pause the rollout of updates to other regions to prevent further impact until a thorough investigation is conducted, and the issue is resolved. This comprehensive approach ensures quick identification, mitigation of issues, and control over the update distribution process.??


4. Strengthen Cross-Functional Collaboration with System Integration Focus: In a heavily integrated environment, ensure that testing strategies include broad system integration testing, not just local validations. Collaboration between development, QA, and operations teams (DevOps practices) should be emphasized to understand and address the dependencies and interactions between different applications and the operating system. This approach helps identify potential issues that could arise from the integration points and ensure the stability of the entire system.??

  • Example: Conduct end-to-end integration testing that includes all components interacting with the update, such as middleware, databases, and third-party services. Regular cross-functional reviews and joint testing sessions should be held to align on integration impacts and validation criteria. This comprehensive testing strategy helps detect and resolve issues that could affect the broader system, ensuring seamless operation and reducing the risk of widespread failures.??


5. Increase Customer Control and Enhance QA Diligence: Provide customers with greater control over when updates are applied, allowing them to schedule updates during times of lower business impact. Additionally, encourage customers to implement their own QA strategies to validate changes within their IT landscape before updates are promoted to their production environment.??

Example: Develop features that allow customers to define specific dates and times for updates, ensuring minimal disruption to their operations. This can be particularly useful for businesses that experience peak operational periods. Additionally, customers should invest in their own QA strategy and perform their own testing and validation of updates in staging environments. This extra layer of diligence helps identify any potential issues within the context of their unique IT setup before the updates are deployed to production.??


Proper preventative testing can save significant time, money, and business losses when seamlessly integrated. Proactive and concise testing is necessary to keep instances like Windows update failure at bay and keep your business running smoothly. Quality assurance experts like QA Consultants play an essential role in successful integration testing from the start. Don’t wait until a global outage causes flight delays, important healthcare appointments to be missed, or significant financial repercussions. ??

??

Discover how QA Consultants specialists can address your integration quality assurance needs. Speak to an engineer today.??

要查看或添加评论,请登录

社区洞察

其他会员也浏览了