The Bitter Lesson of Domain Knowledge
Since the beginning of the field, machine learning researchers have been tempted by the alluring idea of creating models with pre-existing domain knowledge baked into the model architecture. This lives on today in applications of knowledge graphs, symbolic AI, and feature engineering. In general, I don’t think this is a good idea and, in fact, I think there’s a much better way to provide a model with domain knowledge.
In my opinion, there are at least three reasons why the idea of creating models with pre-existing domain knowledge is alluring. First, it feels like a model is more interpretable if you forced a particular structure on it. Second, it makes you feel clever when you find a neat architecture trick that makes your model better. And, third, it does, in fact, usually make your model better — but only in the short run. The first two reasons are delusions; although, I guarantee that myself and every other machine learning researcher who’s been at it for a while has fallen for them. The third reason, however, is worth exploring further.
To paraphrase Richard Sutton, the bitter lesson of AI that has repeatedly presented itself over the past 30 years is that unconstrained architectures eventually surpass architectures that incorporate specific domain knowledge once the training dataset and compute are large enough. Therefore, an architecture that incorporates pre-existing domain knowledge may perform better today, but probably not in a year, and definitely not in a decade.?
Creating architectures that incorporate pre-existing domain knowledge solves a today problem by creating a tomorrow problem.?
Obviously, injecting domain knowledge into the architecture of a model is bad if that domain knowledge is actually wrong. Even though it’s obvious, this has happened frequently in the history of machine learning for biology, particularly in the applications of knowledge graphs built from literature. The literature says that increasing protein A decreases protein B, so I put that in my knowledge graph, but later on it turns out the experiment wasn’t performed correctly and that fact isn’t true at all. That’s a clear failure mode and, unfortunately, it can hide itself for a long time if that part of the knowledge graph isn’t needed for the examples encountered during training. I may not find out until my faulty incorrect knowledge graph surprises me and messes up my model predictions when I actually needed them most.
领英推荐
A second problem with incorporating domain knowledge into the model architecture is that even if the knowledge is correct, it may not be obvious how to directly encode it into the model. I know that if there is a four way stop sign, then the car that came to a stop first has the right of way. But, that’s not true if one of the other vehicles is an ambulance with its siren on, for example. Or, what if I came to a stop first but there’s a group of pedestrians crossing the street in front of me? Can one of the other cars go instead? What if the other car is speeding down the road like a madman and doesn’t look like it’s going to stop? Should I go anyway and tempt fate? And the list of possibilities goes on. How do I code all this stuff up into some kind of machine learning architecture with the right pre-existing knowledge??
Even though it’s difficult for me as a human to formulate my knowledge in a precise mathematical way that I can force onto a machine learning architecture, it’s actually quite easy to generate examples. I don’t know how to precisely describe all the possibilities in the four way stop sign problem, but I can easily provide some examples that illustrate the concept. It’s through providing a model with these generated examples during training that I can teach it my domain knowledge without forcing it into the architecture. This is the right way to inject domain knowledge into a machine learning model, in my opinion.?
By using synthetic data, I’m acting like a teacher and my model is acting like a student. I’m teaching it to make certain types of predictions that agree with my pre-existing domain knowledge, but I’m not constraining how it makes those predictions. This immediately circumvents the second problem of having to figure out how to turn my knowledge into a precise algorithm. I no longer have to do that, all I have to do is generate examples, which is usually pretty easy. In addition, it helps to mitigate the first problem because I can always just delete bad training examples and retrain the model if I have to. In addition, if I get new knowledge I can just make some new examples to add to the training set and, voila, the model will have the knowledge too. Even further, if there’s a new type of architectural or methodological breakthrough in machine learning, then I can leverage it immediately by training it on my synthetic examples to get a state-of-the-art model that incorporates my domain knowledge. This way, I’m never getting left behind.?
To be more succinct, I think it’s better to use domain knowledge to create synthetic examples for training machine learning models with unconstrained architectures than to try to directly incorporate that knowledge into the architecture itself.
Professor of Statistics at Arizona State University
8 个月In old-fashioned Bayesian statistics, priors in exponential models can be exactly interpreted as pseudo-data. Although modern machine learning looses that mathematical exactness, I agree with you that it's the right analogy and the right approach. That said, it should also give us pause, because "prior elicitation" is notoriously difficult, especially in "unstructured" models. And, as you point out, an unconstrained architecture may require a lot of data, meaning it would require a lot of synthetic data as well. But here's the cool part. Simulating from a complex constrained model is computationally easier than fitting the constrained model, both in terms of compute time and also in terms of man hours. All of our "modeling" effort can go into forward simulation and we can lean on the full suite of neural network technology for the training and deployment. It's still very, very hard and requires lots of deep thought and effort and revisions, but the virtue is in the division of labor. No longer is our modeling imagination hamstrung by what we are able to fit, nor do we have to re-invent a novel model-fitting procedure every time we branch out with new domain constraints.
Industrial-Strength Intelligent Autonomous Agents
8 个月Charles Fisher, there is a term for applying domain knowledge to guide the learning process: teaching. And yes, teaching is a skill too, so it can be done well, or it can be done poorly. Teaching draws boundary boxes around promising areas in a practice space. When done well, it leads to radically improved learning in machines and people. Varsha Raju Sarah Cohen Andy Wylie (CMILT) Mark Hammond Philip Harvey Octavio Santiago, MBA
Intern @ NHS Data Science, PhD Candidate in AI for Healthcare @ Imperial, ex-Intern @ Microsoft Research
8 个月Your description of the problem of pseudo-precise interpretable models makes sense and I think what you’re suggesting is to train unconstrained models on lots of synthetic data? But how do you know your synthetic data reflects the domain... don't you end up with a similar problem? You’d still need domain knowledge to tell you what a “bad”/”good” synthetic training example is, right? In my head, the ideal solution would be collecting data with higher signal-to-noise ratio i.e. better RCT trial design. But maybe that's impractical...
Head of AI, Data Science, & Digital Automation | Techstars Advisor
8 个月Kence Anderson - Machine teaching!