The Value of Prevention in Quality Assurance: QA Spend is Always justifiable after an “Incident”
QA Consultants
Now an ALTEN Company, a global engineering and IT services leader.
Justifying spend on quality engineering and software testing is a constant struggle for technology-first organizations. The risks are significant, but the value is often misunderstood -- perhaps more than misunderstood. In fact, it is often prioritized lower than producing more code or features. As leaders and veterans in the software quality world, we see this firsthand every day, and we know that not everything can or should be tested.?We know that the process of quality assurance is not infallible, and defects will escape.?So, what is an IT organization to do? We have long sought the right balance between risk and spend when it comes to quality. It is why QA Consultants has invested so much in building processes and helping our customers focus on engineering higher quality software.????
On July 19, 2024, the massive CrowdStrike defect caused a system crash that affected over 8 million computers running Windows operating systems and their software (TechRadar, 2024 ), resulting in major complications throughout various healthcare, banking, and travel systems, among other organizations, at an estimated financial impact of $10 billion (ISG, 2024 ). At this time, investigations are continuing, but it is understood that the incident occurred due to a defect in an update specific to integrations. Many details will need to be investigated to confirm if system integration testing wasn’t adequately completed before the release was automatically delivered to customers worldwide (ISG, 2024 ). ?It is highly possible that the testing that was done was what had been previously determined as the right balance between risk and cost.?In fact, that balance always works well…until it doesn’t (like in this case).?Hindsight is always 20/20, and the many after-action reports on this incident may point to known and unknown decisions, incompetence, lack of clarity, etc. Culpability and accountability will be large, as there are many cooks in this “kitchen.” ? ??
What can we learn from this situation? The time, effort, and money spent on quality assurance far outweigh the loss incurred with a catastrophic incident. But this is made clear only after the incident took place. Prior to the incident, the lack of spend on this particular part of QA might be heralded as fiscal responsibility and efficient software development practices. It takes an incident to highlight the impact of those trade-offs. As part of ALTEN Technology, QA Consultants plays a significant role in customer organizations to reduce the impacts of integration issues, including risk versus reward prioritization decisions, like the CrowdStrike situation.??
Here are 5 ways organizations can better position themselves in the future:?
1. Always have a "production beta:" Whether it is termed Beta (waterfall), or canary (DevOps), always have a small production group for deployments and analysis before rolling to the larger/global group. In fact, this very item is one that CrowdStrike has realized would have provided an air brake to prevent the worldwide disaster from occurring. Customer environments are complex.? Even with generous budgets, it becomes impractical to reproduce all possible combinations of software and environment configurations.? This results in an unknown production risk after completing quality assurance efforts.? In waterfall development shops, identification and mitigation of this remaining risk have historically been addressed through alpha & beta software deployments to limited audiences.? With DevOps, this has become “the canary model” for rolling out software to groups of high complexity to check for performance issues.? Your rollout to production should include progressively larger canary groups to limit customer risk in production and identify and address remaining risk items before a catastrophic event occurs.?
?
?2. Adopt a Staggered Deployment Strategy Based on Impact: Roll out updates in phases, starting with regions or customers with lower impact and gradually progressing to larger, more critical customers only after ensuring stability.??
3. Enhance Monitoring and Feedback Mechanisms with Automated Rollback and Update Pausing: Implement robust monitoring and logging systems to provide real-time feedback on updates and integrate automated rollback procedures. Additionally, ensure that updates can be paused in other regions until further investigation is completed if issues are detected.??
领英推荐
4. Strengthen Cross-Functional Collaboration with System Integration Focus: In a heavily integrated environment, ensure that testing strategies include broad system integration testing, not just local validations. Collaboration between development, QA, and operations teams (DevOps practices) should be emphasized to understand and address the dependencies and interactions between different applications and the operating system. This approach helps identify potential issues that could arise from the integration points and ensure the stability of the entire system.??
5. Increase Customer Control and Enhance QA Diligence: Provide customers with greater control over when updates are applied, allowing them to schedule updates during times of lower business impact. Additionally, encourage customers to implement their own QA strategies to validate changes within their IT landscape before updates are promoted to their production environment.??
Example: Develop features that allow customers to define specific dates and times for updates, ensuring minimal disruption to their operations. This can be particularly useful for businesses that experience peak operational periods. Additionally, customers should invest in their own QA strategy and perform their own testing and validation of updates in staging environments. This extra layer of diligence helps identify any potential issues within the context of their unique IT setup before the updates are deployed to production.??
Proper preventative testing can save significant time, money, and business losses when seamlessly integrated. Proactive and concise testing is necessary to keep instances like Windows update failure at bay and keep your business running smoothly. Quality assurance experts like QA Consultants play an essential role in successful integration testing from the start. Don’t wait until a global outage causes flight delays, important healthcare appointments to be missed, or significant financial repercussions. ??
??
Discover how QA Consultants specialists can address your integration quality assurance needs. Speak to an engineer today.??