A little "What if?" can save a lot of "What on earth!"
Martyn Walmsley
I've transformed weekly releases into multiple daily releases in regulated FinTechs, becoming ISO27001 certified at the same time. What can I do for you?
I've just had a conversation with my MP about the National Air Traffic Service (NATS) system failure in August. As with all hindsight, being wise after the fact is easy; being wise before situations arise is where wisdom lies. I hope this will enable some to evidence wisdom in their own situation.
The issue was caused by a flight plan coming into the NATS system which the software wasn't coded to handle. Simply that.
When the primary system found itself unable to process the flight plan, it raised a critical exception, placing itself into maintenance mode. The Control and Monitoring system picked up that the primary had failed and passed operations to the backup system. When backup system couldn't process the flight plan either. It is no surprise that less than 20 seconds after the initial receipt of the erroneous flight plan by NATS, both primary and backup systems had raised the same critical error against the same flight plan and automatic processing of flight plans in the UK ceased. The same lack of software resilience was also evidenced when Ariane 5 flight 501 broke up and exploded in 1996. The inquiry board report can be found here for those who may be interested.
When I was running QA and Testing for HBOS Treasury services, later part of Lloyds bank, we had processes for dealing with incoming data which couldn't be processed. We de-queued and stored the message, put the funds into a suspense account, raised an error with the system support team and processed the next inbound message. The support team would then review the erroneous message, contact the sender, diagnose the issue and arrange a data fix and re-submission. No down time and the funds still got to their intended destination.
领英推荐
The rules around flight plans for the NATS state that the flight plan has to be received 4 hours prior to the relevant aircraft entering UK airspace. Placing the erroneous plan into a holding location, sounding the alarm to get the attention of senior Air Traffic Controllers whilst allowing automatic processing of other flight plans to continue would have prevented
to mention just some of the impacts.
It can be argued that software will never be 100% perfect. I've written some and so have contributed to that body of imperfection. However, asking "What if ... ?", putting processes in place to handle the consequences and learning from experiences of others over a quarter of a centuary can go along way to preventing "What on earth"