Large Language Models: From Prototype to Production
Thanks to everyone who came to my EuroPython keynote on LLMs from prototype to production ? Here are my slides and a walkthrough of the talk.
I'm the co-founder of Explosion , best known for our open-source library spaCy . It's one of the most popular libraries for building NLP solutions, and it's been around quite a while now — long enough that ChatGPT is pretty good at writing code for it!
Our other project is our annotation tool Prodigy . Prodigy helps you label data to train or evaluate machine learning components. You can build fully scriptable workflows, using custom automation to make the tasks faster or to connect to your own data sources.
Before I dive in, it’s worth giving an overview of what we mean by NLP and the distinction between generative and predictive tasks. LLMs do really well at generative tasks. If we’re much better at generative tasks, do we need predictive tasks less? How will this all be used?
LLMs are making futurists of us all. Lots of different visions around how the technology will be deployed. I like to look at other periods of rapid technological change, and look at examples of how people predicted the future. There are some patterns that are revealing.
If you look around at work at any given point, what you see is a bunch of human-shaped tasks. So it’s tempting to imagine human-shaped solutions — some technology that will step in and do exactly the same thing as a human.
However, the work someone is doing isn’t the tasks they’re performing — it’s the value they’re providing. The history of technology is mostly the history of solutions which provide the same value, but differently.
So, bear that in mind when you imagine how tomorrow’s systems will change today’s tasks. Visual interfaces are really strong. If you want to book a meeting, talking to a virtual assistant will often be a worse user experience than just clicking a Calendly link.
The future is definitely anybody’s guess. But short of AGI that kills us all, we can basically break down the question in two dimensions, as far as NLP goes. First, how disruptive will dialogue be? What percentage of human-computer interaction will be LLM-assisted dialogue?
Second, how will we build NLP things? Assuming we want a model that works in this structured sort of way — rather than just as part of a dialogue system — what approach will we use? Will we still label data and train models, or will we just use LLMs?
Here's a more concrete example. Let's say we've got an information extraction task like this. We want that information in some structured format, so that we can compute with it deterministically — put it in a database, display it in summaries, search for it predictably, etc.
One vision for NLP in the future is that we just won't really need to do this sort of thing anymore: if you have text data, a "chat with your data" experience will be fully sufficient. So in this vision, mapping text to structured data is sort of obsolete.
The second vision is that LLMs step in and take over the individual predictive tasks. We won't build machine learning models in the same way we did — we'll just prompt LLMs. This vision agrees that we need to do this sort of thing, but has LLMs totally transforming the mechanism.
The third vision is for LLMs to help us build ML systems. We'll get to the same end result of a pipeline of task-specific models, but LLMs will help us build it cheaper, better and more reliably. Here, the LLM is more like a compiler, while in vision 2, it's is the runtime.
LLMs have transformed our ability to do generative tasks: here the model should answer with text, images or some other piece of content. But we need to do predictive tasks as well — the two are more powerful in combination. LLMs do the generative tasks “natively”, but they can also he coopted to do the predictive tasks. You can give them a few examples, and parse out the response as structured data. So how well does this perform?
LLMs can solve some text classification problems really well, even with few or no examples. Sentiment analysis is a good example of this. GPT-3 gets basically the same accuracy as spaCy's model, with pretty much no data. However, it's a really easy task.
Here's some results from another experiment on news data. By fine-tuning a transformer model we can get better accuracy than the LLM with just 1% of the available data, ~1080 examples. This would take one person an hour or two to annotate. The supervised approach does much better, exceeding the LLM with just 5% of the training data available (450 samples) and increasing steadily. It hasn't even topped out here — if we kept annotating and doubled the size of the training corpus here, we'd probably get to 95%.
领英推荐
Here's the current SOTA in few-shot NER, published a few weeks ago. On ConLL 2003, Ashok and Lipton get GPT-4 to 83.5% accuracy. This is great for a prototype, but doesn't get close to today's or even 2003's SOTA.
LLMs and task-specific models have different advantages. Task-specific models have less background knowledge, but you can give them hundreds or thousands of examples. We can use an LLM to help us create training data — and once we have a smaller model, we send that to production.
Mapping this back to our two questions before, the idea is that we do need to do these predictive tasks — dialogue won't be all you need. And no, prompting won't be all we need either. We're going to want to build task-specific models, and LLMs can help us get there.
So, what do we need for LLM-powered NLP? Explosion's vision is a collaborative data development environment. You can get LLMs to help out with annotation on the tasks where they're good enough — or send tasks to multiple LLMs, and integrate the answers to get better accuracy. Use LLMs to help you label faster, while maintaining the human view of the data to keep the quality high enough to train from. Tune prompts, and compare them empirically. Keep a strong, human evaluation methodology even when working on subjective generative tasks.
Here's an example of the annotation interface in our tool Prodigy . Here, the data is sent to OpenAI for initial annotation and then you get to correct it. You can then also mark examples as significant, and have them incorporated into the prompt.
You can also skip the annotation step and just have an LLM power an NLP component directly via our library spacy-llm . The NER component calls into a local or remote LLM, constructs a prompt, parses out the entities and sets them into the spaCy Doc object.
This lets you use LLM-powered components in the context of a larger NLP pipeline. You might have a rule-based approach to lemmatization, classify with a supervised model and use an LLM for NER, and later replace it with a task-specific model.
Much of the discussion has focused on how much easier LLMs make things. Just write a prompt! This is a really compelling advantage. But we should be asking for more. We shouldn't settle for an easier way to build systems that are worse than what we were building before.
If we can define a subtask that a statistical model should perform, we shouldn't have to call into a massive general-purpose model. We shouldn't have to worry that the model changes underneath us, or returns an invalid response. We should train and deploy a task-specific model.
We shouldn't have to worry about latency spikes into the seconds, or what capacity constraints a third-party provider is suddenly under. We should be able to deploy models ourselves that are a reasonable size for the specific task we're trying to do.
We shouldn't have to worry that our data is being sent to third-party providers, who might train on it and thereby expose it to end users. We should be able to deploy the solutions ourselves, without undue expense.
Finally, we should expect to be working on systems that are valuable enough to be worth building better. LLMs should not change our appetite for better solutions. We shouldn't be happy with good enough — we should be aiming for better.
Thank you!
?? Explosion: https://explosion.ai
?? spaCy: https://spacy.io
? Prodigy: https://prodi.gy
?? Twitter: https://twitter.com/_inesmontani
?? Mastodon: https://sigmoid.social/@ines
Founder & Cloud Architect @ deployr
1 年Thanks a lot Ines! I'm aligned with your third vision for the future of NLP
Cloud & Data Architect @Datwave
1 年Thank you for sharing the slides and for the brilliant talk at EP. It was very insightful!
Marketing Manager at Valuepitch - Tags24x7 | Data annotation services for AI/ML Industries| Software Testing| Background Checks / kgs Services.
1 年Thanks for sharing! As a fan of spaCy, I'm intrigued by LLMs. Let's not settle for easy but subpar solutions. Embracing task-specific models will lead to more reliable and efficient outcomes. Can't wait to see what Explosion brings next in the NLP field! ????.
Assistant Professor, Department of Computer Science and Engineering, Jorhat Engineering College, Jorhat (Assam)
1 年Thank you for the post
Software Architect & Developer
1 年Thank you for this post. Due to internal circumstances, I didn't have the chance to be in London. However, your post is also easily understandable. I use Spacy for parsers and categorizers, but I think I would dare to try a more daring qualitative application that has so far seemed unattainable.