Simple, Elegant, Convincing, and Wrong: The fallacy of ‘Explainable AI’ and how to fix it, part 6
This week we have a series of blog posts written by our CEO&Founder Luuk van Dijk on the practicalities and impracticalities of certifying “AIâ€, running up to the?AUVSI panel discussion on April 25.
6. General Izability to save the day
I just spent 5 blog posts explaining in some detail how we cannot hope to achieve certifiability of AI by adding a magic explainability sauce to make things DO-178C style traceable, but I also argued that to solve the next generation of problems in vehicle or flight control, we have no choice but to use Machine Learning based systems. So we better come up with some way to verify which ones are good enough and which ones should not be allowed into operation.
Fortunately, there is a way. It requires us to take a step back and rethink what we actually need.?
While the methods of DO-178C are an “acceptable means†to demonstrate the adequacy of software, they are not the end. The end is what rules like ‘14 CFR part 25.1309’ prescribe: the systems shall be designed to perform their intended functions under any foreseeable operating condition. This is a requirement of the system as a whole. So if the system contains a component with limited accuracy, it better be designed to cope with that.?
As an example, consider a neural network that, when given a single image, decides if and where there is a runway in the picture sufficiently precise only 96% of the time. If our system during the flight only ever looked at a single image and then further blindly landed on whatever it thought was a runway, we would crash or at least badly mess up one in 25 landings, which would clearly be terrible, and also a very bad design. But such a neural network component is part of a larger system that, if properly designed, can deal with this finite per-image error.?
Daedalean’s Visual Landing Guidance system engages during an approach to an area where we know there should be a runway because we carry a database of runways. During the approach, the probability of having at least 1 frame with bad guidance asymptotically approaches 1 (i.e. certainty). At the same time, we expect the neural network to become more and more consistently sure of what it sees. This is monitored by a separate subsystem. Only when we have a consistent reading of where we think the runway is, does the system provide guidance; otherwise, it flags that it can’t lock on to the proper approach path, so a higher-level system can decide to abort the landing or use an alternate source of navigation.?
By carefully analyzing the dependencies in this system, we can come to something that can deal with the 4% failure rate on a single image and becomes something that will fail to properly identify the runway without a warning fewer than once in 106 landings. By combining this with other sensors, we can reduce this further.
领英推è
Similarly, the Traffic Detection function doesn’t have but a single chance to spot other traffic (which it would miss 3-4% of the time); instead, it has to try to detect an approaching aircraft before it is too close, for which it has a new opportunity with every frame from the camera. The probability of getting it right in each frame is not independent from one frame to the next, but it will be after some time elapses. Again, by carefully designing the system to deal with the finite failure rate per image, we can achieve acceptable performance (and definitely a lot better than human performance) for the overall system. This dependency can be analyzed completely ‘classicallyâ€, as we saw in the NTSB report on the Uber crash, which perfectly described the design flaws stacked on top of an unreliable image recognition component.
But being able to design systems to deal with finite failure rates at the single-shot image level in a component does require one very important property: that the, say, 4% failure rate we observed on our test dataset in the lab will also hold when we deploy the system in real flights. If the failure rate goes up to 40% when confronted with reality, for example, in fog or against the sun, we have a disaster in the making.
And this is the crux of the problem of certifying a machine-learned component. How can we be sure that if we measure a precision, recall, or accuracy number for our model on a dataset, it will hold “under all foreseeable operating conditions†in the operational domain?
This problem was widely studied in the 1970s in a field called ‘learning theory’, well before neural networks were invented and way before they became popular. The property we look for is ‘generalizability’, and the means to tame it is a quantifiable domain gap.?
When any quantity is computed over a sample dataset, it becomes an estimate for that quantity on the broader population it was sampled from. When the quantity at hand is a distribution of an error metric, we have a probability distribution over a probability distribution of how that error metric will be in the ‘reality’ we drew the sample from, provided we drew the sample from the same distribution.?
That may sound a bit abstract. Say we measure 96% recall (fraction of aircraft we saw divided by how many were there) and 90% precision (fraction of aircraft that were actually there if we said we saw them) on a dataset; then there are theorems that say that in reality, we won’t be worse than, say 90% resp 89%, depending on the size of our sample and the capability of our model to fit (and overfit) anything that is thrown at it.??
These theorems are not all straightforward, and may produce unusable, so-called ‘vacuous’ bounds, like ‘the probability that you are wrong is smaller than 200%’ (we knew that!) or lead to the requirement that your dataset contains 100 billion samples. Like with all fields of engineering, it helps if you know what you are doing.?
All such ‘generalisation bounds’ crucially depend on the dataset being sampled from the same distribution as you will find during operation, otherwise, it is impossible to make any meaningful statement at all. Conversely, when making a statement that the performance of the machine learning component is X with confidence Y, this statement is meaningless without specifying on what dataset and how this dataset was drawn from reality.?
Consequently, it is not possible to build something that just “simply always†works, an illusion we may have gotten away with for the simpler avionics systems that we have today. Instead, the requirements on the system will have to be traceable to requirements on the error function the machine learning algorithm is trying to minimize and the dataset on which we evaluate it.
Where in the past our system, high-level and low-level software requirements were assuring stories on how one level explained the other, the machine learning aerospace engineer additionally will have to put the same effort into explaining why the dataset is representative and sufficiently large, which starts with as precise as possible a characterisation of the systems operational domain.
This, also, is not entirely new to safety-critical engineering. In software, running the same unit test twice should give the same result, but in any other part of the aircraft, tests are usually of a statistical nature. A strut is put on a test bench and hammered 45000 times to establish that, on average, it breaks after 35000 hammerings, and the service manual will say “replace after 25000 hammeringsâ€. (My favorite example is the bird strike test for jet engines. Depending on the size of the inlet, there is a prescribed weight of bird that jet engine builders have to throw into a very expensive, running, engine to see if that doesn’t destroy it. In practice, these tests are not even independent: you throw in 1 bird, and if it destroys the engine, you build another one and test it until there is one that passes. But 70 years of building jet engines have apparently shown that this is a sufficiently rigorous test. Perhaps the fact that a failed test costs a complete jet engine makes the manufacturer over-engineer the machines rather than game the statistical flaw in the procedure.)
This is where ‘data science’ enters the stage, a different skill set than traditionally found in avionics. When using machine learning, we will have to come up with methods to characterize the operational domain and the datasets drawn from it. How to do that will be the subject of another series of blog posts someday.?