Conventional Software vs. Machine Learning Application - a tester’s perspective
Abstract
In this article, I will briefly discuss the high level idea behind two major AI world views - Connectionism (Neural Network based Machine Learning and Deep Learning techniques) that learns about the world by using real world examples, and Symbolism (Expert Systems) that is encoded with human defined specifications in the form of symbolic representation and programming logic.
Besides, I also list the differences between conventional software testing which has relatively stable and clear test oracles, and machine learning testing that needs to deal with a constantly moving target with bugs that exist not just in the algorithm but also in the training dataset.
There are two main schools of thought in the field of Artificial Intelligence, namely Connectionism (Artificial Neural Network) and Symbolism (Expert System).
Symbolism Vs. Connectionism
Symbolism which let the human experts to encode prior knowledge in the form of rule-based specifications and symbolic representations into the IT system has been dominating the news headlines and fundings for several decades since the mid of 1950s.
On the other hand, Artificial Neural Network techniques (e.g., Machine Learning and Deep Learning) which let the IT system to grow its intelligence by learning from the training data finally regain momentum and outperform Expert Systems on many business fronts in recent years. These series of successes were gradually unfolded following the Godfather of Deep Learning Geoffrey Hinton's team handsomely won the yearly large scale visual recognition challenge ImageNet contest in 2012 by a wide margin with Convolutional Neural Network AlexNet.
In my opinion, the current forms of AI regardless of Connectionism (Artificial Neural Network) or Symbolism (Expert System) are the outcomes of the mapping of human intelligence to machine intelligence.
Classical Artificial Intelligence displays human like intelligence programmed by software engineers via hand-crafted rules and logic based on subject matter experts' design.
Artificial Neural Networks which are more commonly known as Machine Learning and Deep Learning these days learn to make sense of the world by processing lots of high quality, domain specific and context driven data (real world examples) that are carefully curated by semi-skilled workers and human experts. If human experts wouldn’t use certain information in the decision-making process, then that data is likely won’t be valuable from a machine learning perspective.
Therefore, I believe that any quality assurance process, policy or standard that we intend to apply in the development life cycle of an AI-based information system should be human centered, value driven, and collaboration oriented so that we can help the cross-functional teams to maximize human intelligence throughput in the form of efficient data pipeline, clean and accurate data, quality coding (algorithm), insightful AB testing and etc.
Software Testing Vs. Machine Learning Testing
For a lot of software testers who have spent a vast number of years in testing traditional software applications developed by programmers through rule-based specifications mimicking Expert Systems, we need to learn how to conduct meaningful software testing for non-deterministic Neural Network based applications that continue to adapt their responses to what they have learned from the most recent transactions.
Indeed, testing machine learning applications which do not always return the same answers require new approaches.
According to Zhang et al (2019), Traditional Software Testing and Machine Learning Testing are different in many aspects as follows:
1) Component to test (where the bug may exist): traditional software testing detects bugs in the code, while ML testing detects bugs in the data, the learning program, and the framework, each of which play an essential role in building an ML model.
2) Behaviours under test: the behaviours of traditional software code are usually fixed once the requirement is fixed, while the behaviours of an ML model may frequently change as the training data is updated.
3) Test input: the test inputs in traditional software testing are usually the input data when testing code; in ML testing, however, the test inputs in may have more diverse forms. Note that we separate the definition of ‘test input’ and ‘test data’. In particular, we use ‘test input’ to refer to the inputs in any form that can be adopted to conduct machine learning testing; while ‘test data’ specially refers to the data used to validate ML model behaviour. Thus, test inputs in ML testing could be, but are not limited to, test data. When testing the learning program, a test case may be a single test instance from the test data or a toy training set; when testing the data, the test input could be a learning program.
4) Test oracle: traditional software testing usually assumes the presence of a test oracle. The output can be verified against the expected values by the developer, and thus the oracle is usually determined beforehand. Machine learning, however, is used to generate answers based on a set of input values after being deployed online. The correctness of the large number of generated answers is typically manually confirmed. Currently, the identification of test oracles remains challenging, because many desired properties are difficult to formally specify. Even for a concrete domain specific problem, the oracle identification is still time-consuming and labour-intensive, because domain-specific knowledge is often required. In current practices, companies usually rely on third-party data labelling companies to get manual labels, which can be expensive. Metamorphic relations are a type of pseudo oracle adopted to automatically mitigate the oracle problem in machine learning testing.
5) Test adequacy criteria: test adequacy criteria are used to provide quantitative measurement on the degree of the target software that has been tested. Up to present, many adequacy criteria are proposed and widely adopted in industry, e.g., line coverage, branch coverage, dataflow coverage. However, due to fundamental differences of programming paradigm and logic representation format for machine learning software and traditional software, new test adequacy criteria are required to take the characteristics of machine learning software into consideration.
6) False positives in detected bugs: due to the difficulty in obtaining reliable oracles, ML testing tends to yield more false positives in the reported bugs.
7) Roles of testers: the bugs in ML testing may exist not only in the learning program, but also in the data or the algorithm, and thus data scientists or algorithm designers could also play the role of testers.
The above constraints that we face when trying to validate the results produced by AI driven applications outline the ambiguity and uncertainty nature of the real world - it is borderless, full of unexpected dynamics and unavoidable noise, and therefore cannot be easily confined in the conventional information system domain with clearly defined boundary, relatively static conditions and some design assumptions that will always hold true.
Usefulness Vs. Correctness
Rather than too obsessed with achieving absolute correctness, accuracy and consistency in the application of modern AI domains by using the lens of traditional software engineering, we could rethink how to make the most out of the AI predictive models when the ground is constantly shifting - do I want the system to be right or do I want the system to be useful?
The way to think about the trade-off we need to make when using data driven AI predictive models has no difference than the famous George Box's quote that was first recorded in the “Science and Statistics†paper published in the Journal of the American Statistical Association in 1976 -
"All models are wrong, but some are useful."
To certain extent, we might conclude that machine learning and deep learning applications are largely programmed by using meaningful data specific to a particular problem domain. The data centric nature of machine learning applications means that these types of AI techniques are more vulnerable to bad data and suffer the most from "Garbage In Garbage Out" effect.
As such, the quality criteria of machine learning and predictive analytics driven applications could be very different from the conventional software testing quality metrics. The central focus of machine learning applications should be quality input data driven rather than accurate output result centered. More quality gates should be placed along the data pipeline as better preventive measures to avoid nonsensical output.
If you take this into perspective, the quality assurance process and policy that we aim to design and apply to any big data driven AI applications should be directed at the core of the events that usually make or break the predictive analytics AI models, namely the end to end data pipeline that made up of data acquisition, data cleansing, data ETL (Extract, Transform, Load) and data enrichment.
Of course, the testers together with data scientists, machine learning engineers, subject matter experts, developers, product owners and so on also need to continuously evaluate the interpretability, explainability and the reproducibility of the AI training models (that are constantly challenged by the fast-changing real world dynamics) with meaningful and well represented test data and up-to-date real world data.
AI breakthrough as results of excellent engineering
What we should also appreciate is that the human intelligence like output of the most advanced data driven AI models such as Google's BERT and OpenAI's GPT according to technology analysts are the results of excellent engineering, and the use of increasing number of parameters. Therefore, the best practices in conventional Software Engineering should always serve as the guiding principles of the development of AI driven applications.
The recent 2020 June 11th edition of The Economist magazine cited the estimation made by Cognilytica, an AI-focused consultancy that data wrangling takes up about 80% of the time consumed in a typical AI project. The reason why data issues are one of the most common sticking-points in any AI project despite the world has generated 33 zettabytes of data in 2018 alone is that the required data of rare cases may not exist at all, the data might be locked up in the vaults of a competitor or the data might not be suitable for feeding to computers.
One of the workaround solutions to overcome underfitting for certain rare data classes is to generate synthetic data with simulation effect that is close to the reality (e.g., data augmentation) as part of the AI model training data. According to The Economist, the hope is that all this data-related faff will be a one-off - once trained, a machine-learning model will repay the effort over millions of automated decisions until the next data drift happens.
COVID-19 global pandemic unfortunately has resulted machine-learning models trained on normal data showing cracks and failed to make the desired accurate predictions. Coursera co-founder cum Landing AI CEO who was also former head of Google Brain Andrew Ng reminded that building practical machine learning systems almost always requires going beyond achieving high performance on a static test set.
In one of the recent issues of deepeplearning.ai The Batch weekly newsletters, Andrew Ng suggested that we may need to build an alert system to flag the world’s ongoing unusual changes, use human-in-the-loop deployments to acquire new labels for machine learning models, assemble a robust MLOps team who can perform post-deployment monitoring and alerting to make sure issues are fixed when they arise.
Conclusion
The current forms of AI regardless of Connectionism (Artificial Neural Network) or Symbolism (Expert System) are the outcomes of the mapping of human intelligence to machine intelligence.
AI's weakness is no different from humans' which we don't know what we do not have historical data and it is important to avoid falling into the trap of biased implementation orchestrated by machine algorithms that are fed with data skewed towards certain data classes.
My humble opinion is that at current stage, the strengths of data-driven artificial intelligence are being broad and deep, as well as down to minute detail, and occasionally display unexpected tactical and strategic brilliance; but not necessarily being accurate all the time due to the constraint that it is impossible for a single AI model or application to own all the data for all domains. As such, Neural Network based application responds poorly to data that is "out of distribution" and is unable to generalize beyond training data.
AI driven predictive analytics models normally better at predicting a range of possible outcomes based on historical data and the changing variables rather than definitive/deterministic answers that would hold true under all circumstances. Therefore, A/B testing could be a better yardstick to gauge the acceptance of the training models rather than strictly adhering to the conventional test cases' absolute result matching passed/failed criteria. Expert opinions by the algorithm/model designers and subject matter experts will also help to explain why some differing results of the current training cycle or test cycle are acceptable given the evolution of the external conditions that are presented in the form of data point parameters or variables.
Rather than hoping to build a be-all know-all Artificial General Intelligence (AGI) system into a single comprehensive AI model which is virtually impossible, the industry should focus on how to better organize and coordinate multiple specialized functional Neural Networks to automate operation workflows, as well as optimize resource planning, deployment and monitoring in order to achieve cost saving besides serving the needs of the envisioned business goals and corporate strategies. An effective organization of multiple well-trained and rigorously tested machine learning models can provide enterprises with comprehensive business operation solutions.
Reference
Beck, M. and Libert, B. (2018). The Machine Learning Race Is Really a Data Race. [online] MIT Sloan Management Review. Available at: https://sloanreview.mit.edu/article/the-machine-learning-race-is-really-a-data-race/ [Accessed 11 July 2020].
deeplearning.ai. (2020). The Batch Newsletter Issue April 29, 2020. [online] Available at: https://blog.deeplearning.ai/blog/the-batch-tesla-parts-the-curtain-detecting-dangerous-bugs-mapping-disaster-zones-detecting-humans-from-wi-fi-toward-trustworthy-ai [Accessed 11 July 2020].
Ray, T. (2020). Devil’S In The Details In Historic AI Debate | Zdnet. [online] ZDNet. Available at: https://www.zdnet.com/article/devils-in-the-details-in-bengio-marcus-ai-debate/ [Accessed 25 July 2020].
The Economist. (2020). For AI, data are harder to come by than you think. [Online] (June 11th). Available at: https://www.economist.com/technology-quarterly/2020/06/11/for-ai-data-are-harder-to-come-by-than-you-think [Accessed 11 July 2020].
Varhol, P. (2019). How To Test Software In The Age Of Machine Learning. [online] TechBeacon. Available at: https://techbeacon.com/enterprise-it/moving-targets-testing-software-age-machine-learning [Accessed 27 March 2020].
Wang, J. (2017). Symbolism vs. Connectionism: A Closing Gap in Artificial Intelligence. [Blog] Jieshu's Blog, Available at: https://wangjieshu.com/2017/12/23/symbol-vs-connectionism-a-closing-gap-in-artificial-intelligence/ [Accessed 14 June 2020].
Zhang, J.M., Harman, M., Ma, L. and Liu, Y. (2020). Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering. [Online]. Available at: https://arxiv.org/pdf/1906.10742.pdf [Accessed: 26 March 2020].