What makes NLP hard (and fun).
Chris Pedder
Chief Data Officer @ OBRIZUM | Board advisor | Data transformation leader | Posting in a personal capacity.
So it's 2020, and the much-anticipated AI-powered robot uprising is still very much in the indiscernible mists of the (possibly unreachable) future. No need to look nervously at your IoT kettle, we still have a long way to go before ml-enabled subjugation of the human race is anything more than a distant nightmare. In fact, based on my experiences in the field in the last few years, it feels like we are further from machine learning nirvana than we were a year ago. So, what happened?
Well, like all fields that suffer from early success, we seem to have run out of low-hanging fruit. And to be fair, in those five years (wow, is it really only five years?!) since the publication of the now-famous Nature paper on deep learning, a lot has happened. Humans have ceased to be the best at recognising images on ImageNet, and have lost our primacy in Go, and even before Hinton's tour de force, we had been outclassed at Jeopardy. So why on earth were these low-hanging fruit, and where are we now?
The crucial point to remember is that machine learning intelligence is still very much narrow intelligence. Whilst machine can very much outplay us at Go, we are able to build mental models which they cannot - playing go by moving pieces on a board with wavy lines would be completely beyond modern machine intelligence, but well within the capabilities of a six-year-old. And this is where things become complicated.
Ultimately, what makes human beings capable of doing more with fewer (or at the very least the same number of) computational units is the ability to generalise efficiently. In particular, we are pretty good at building Bayesian or even causal models of our world, where we infer the relatedness of patterns. To give a simple example, show a three-year old fives cats from the front, and they will be able to spot a cat from behind - something that most image recognition machine learning models really struggle with. How do we do this? We have a model for what a cat looks like, and we have another model for the world in which we live - we know how things are likely to look if we rotate them, move them away or towards us, light them differently etc. The interplay of this knowledge of the generalities of physics, and the specifics of feline species allows us to deal with a much broader array of cases than the current state of the art in machine learning would manage.
The recently released language model from OpenAI by the catchy name of GPT-3 is a case in point. It's a very impressive piece of technology, with 175bn free parameters which are trained on 2TB of textual data, comprising a significant chunk of the entire internet via common crawl. It's also remarkably good at finishing your writing, much better than GPT-2 and it's four-horned unicorns. But it still can't do this interplay of common sense and knowledge: Kevin Lacker discovered if you ask the innocuous question "How many eyes does my foot have?", GPT-3's best guess is "your foot has two eyes". Statistically, not a bad guess, but maybe not the best $4.6M humanity has ever spent...
So how do we fix this? Well, the issue clearly isn't data. There are 499bn tokens (words) in the training set for GPT-3, so about 4000 years of continual reading for an average-speed reader. It's obviously also not compute power, since OpenAI have that in spades thanks to their supporter Microsoft. So maybe it's what we're *doing* with the data. A lot of the techniques that have been used in machine learning are inspired by how human beings appear to think. The attention mechanism in NLP is based on how we seem to read - we focus on particular passages, and give them more weight in our interpretation. If you do this over big enough scales of text, presumably you can compress whole books, right? Well, maybe, but the real problem is scale - complexity is key, here, if you have lots of objects interacting in different ways, your story can become factorially complex.
How do you get around complexity? Speaking as a physicist - you build models. A good model helps you concentrate on what is essential to solving a problem, and what is irrelevant detail. When working out how to catch a ball, we should be much more concerned with accurately estimating the pitch and velocity with which it was thrown, or the wind speed than the changing gravitational field due to uneven ground. We learn such estimation behaviour by a few hundred experiences.
Fundamentally, how we do this is very poorly understood - there are many competing schools of thought in how humans develop their consciousness of the world, and little in the way of consensus between them. Unfortunately, one thing is very clear - the way in which we do this as we grow from babes in arms is very different from how we are doing machine learning. The cynic in me says this might have something to do with the companies at the forefront of machine learning research (and be in no doubt, it's companies, not universities) having enormous data centres that they would like to utilise, and with their business models at stake, it's unlikely we will see a shift to model-based learning unless there is great and certain advantage to be had.
In NLP, there are particular issues with this - mental models of the physical and emotional world that humans inhabit, along with models even of how language tends to be constructed all play a significant role in the decoding of a sentence in a given context. I'm afraid if you came here hoping for answers, I have little to offer (although I strongly encourage you to read about Hopfield networks, which are having something of a renaissance currently), other than some guidance about what to look for in upcoming developments. If you see a new model like GPT-3 come out, ask questions like "can it infer properties accurately out of domain?" (like "how many eyes does my foot have?"), or "can it accurately deal with causality?" on top of the usual "does it write amazing extended prose?" or "can it reach state of the art on XYZ dataset?". Those first two questions are much harder than the latter two (which could have been copied straight from this MIT Technology Review article), but when the answer changes from "no" to "maybe" it's time to start getting really excited...
Health Economist at Covalence Research Ltd
4 年I think it was Maslow who said “when all you have is a data center, everything looks like a 499 billion token multilingual training corpus”.