What can the UK Post Office Scandal tell us about AI?
The Horizon Scandal: A Lesson in Software Reliability
The UK Post Office Scandal serves as a grim reminder of the consequences of software failure. Due to flaws in the Horizon system, many Subpostmasters were falsely accused of financial discrepancies, leading to imprisonment, financial ruin, and in some cases, suicides. This situation underscores the importance of reliable and transparent software systems.
In software engineering, financial systems like bank accounts are often used as examples. This is because the correctness of transactions in these systems is black and white. For instance, a money transfer should either complete fully or not at all. Despite its apparent simplicity, ensuring such atomicity in transactions is a significant challenge in computer science.
The Horizon system, while complex, essentially had a straightforward task: accurately record financial transactions. Yet, due to numerous bugs, some known to the developers and the Post Office, the Subpostmasters couldn't prove their innocence. They lacked the necessary expertise and system access. This points to a crucial flaw in the system's design - its lack of transparency and verifiability.
领英推荐
No AI in Horizon: Implications for AI Systems
Interestingly, the Horizon system did not use AI. There were no complex judgments or subjective decisions to be made by the system. The reasons the issues weren't addressed earlier were human – fear of embarrassment, career impacts, and international politics. The technical flaws could have been corrected much earlier, preventing the ensuing tragedy.
This raises questions about AI in high-stakes, simple applications. AI typically handles more complex problems. The challenge with AI is in evaluating the correctness of its decisions, which is often not as clear-cut as in traditional systems. If we struggle with reliability in simpler systems, how do we ensure the trustworthiness of more complex AI systems?
The Dual Role of AI in Software Development and Evaluation
AI is undoubtedly speeding up the implementation of complex systems. It's also aiding in the verification of systems with known expected outcomes. Theoretically, AI could reduce the likelihood of errors as seen in the Horizon case. However, AI doesn't simplify the evaluation of complex systems, including AI systems themselves.
This paradox is creating a new challenge in the software industry. While AI boosts development productivity, it does not proportionally ease the validation burden. As we depend more on AI for complex applications, the need for thorough validation is increasing.
I anticipate a shift in the software industry's focus. There should be a move from just creating software to placing a higher value on its evaluation. Validating that software, particularly AI-driven systems, performs as intended is crucial. The Post Office scandal exemplifies the need for software systems to be not just technologically advanced but also reliable, transparent, and verifiable. As AI evolves, so must our approach to ensuring these systems serve their purpose without causing harm.
Helping companies scale AI technology | Board Advisor | Coach | Consultant | ex-Amazon, Toshiba, University of Cambridge
1 年Absolutely right! WIth AI we can’t forget the lessons that we already know from software development, and testing is a big one. As you say, spending proportionally more time to test and validate is likely the way forward.
Senior Software Engineer at Qualis Flow
1 年Very true. I think a key seam for assessing the quality of an AI system is the data it's trained on. (Not 'how much does this data breach copyright?' as in the latest words on the viability of OpenAI's business model.) Does the data actually cover what's expected, or is it some kind of proxy to the desired data (which might be a poor approximation, encode dependencies that aren't in the original data, etc.)? Is it as complete as expected, in terms of rows and columns being non-null, and containing valid values? Does the text in your training data have typos in it? Whichever the answer is, is that the correct one? (It could have too many or too few typos, depending on the context.) Do all parts of the domain show up in the data with the correct frequency? E.g. you happen to have an awful lot of COBOL source code in your training data, and only a tiny amount of Python source code, but you hope to train an AI to act as a copilot for Python programmers. Or your training data is mostly white males in their 20s, and so your facial recognition software doesn't work so well for people of other ethnicities, ages, genders etc. Without sound data, I wouldn't want to start building an AI system.