GPT-3, the most complex AI in the world: Finally (slightly) useful
A few days ago, I was invited by OpenAI to test their new super-massive natural language model. If you are new to GPT-3, it is a much-hyped model able to write human-like language given a prompt. It has caused a big fuzz in the media (link, link, link), due to the sheer quality of the responses making it hard to distinguish from human writing. This has caused concern ranging from spam to fake news but also much excitement.
Disclaimer: The license term of the invite forbids me from sharing specific input/output publicly. Throw me a message if you have any specific questions and we’ll figure something out ??
Briefly, on the mechanics of GPT-3
What GPT-3 does is amazingly simple, and it’s important to understand both the limits and why it is so impressive.
GPT-3 is trained on a large chunk of the raw text off the internet. Fed word for word, it predicts the next one. Like if you keep pressing the middle of a predictive keyboard, in technical terms it is an autoregressive model. What GPT-3 does is that it is so massive it is able to consider a lot (thousands) of previous words and their incredibly complex interactions and meanings to produce the most plausible next word, and the word after that, and the next again and so forth.
That is all. No external databases, no scripted rules.
Test 1: Zero-shot learning on hard NLP tasks, aspect-based sentiment analysis
It has been documented GPT-3 can work as a sort of “universal API”. Hand-write an input and an output and then a second input and it will often try to produce a second output, following similar rules from input-output set number one.
I created a little aspect-based sentiment analysis test, writing a complicated text about Implement (my employer) and myself with multiple sentiments associated to different aspects and then wrote:
Aspect: “Implement”. Sentiment: “Positive” Aspect: “Adam”. Sentiment:
And hit auto complete. It is by no means perfect, but in upwards of 70% of cases it gets the right response (positive or negative). Give it a few more examples of both texts and aspects and that goes higher. Another fun observation was that it was eager to identify other aspects and sentiments in the text as well (all correct, but not what I was going for.
This can be useful. As a zero-shot baseline, for a complex task, it is very good.
Of note is that it also performed well in danish. Telling me that there is an implicit translation happening somewhere inside GPT-3. It might have learned to generalize across languages.
Test 2: Knowledge intensive interviews
With help from a couple of experts I know, I sat up knowledge intensive interviews with GPT-3 around topics such as homelessness policy in Denmark, supply chain digitalization, the interaction between democratization and foreign direct investment as well as data modelling and data strategy. And finally, Scottish whisky distilleries!
I would write an intro-header for the system, letting it know the subject and they kind of expert I expected the system to be, and then conducted an interview around the topic.
At a high level, GPT-3 produces good, believable, and even factually correct texts. Especially when asked on opinions (“what are the three most critical areas of data strategy”) it did exceedingly well (my consulting friends should start watching out ??).
When pushed further, like into the specifics of individual danish municipalities policy on homelessness, it mostly refused to answer. This – to me – is a healthy sign, instead of producing outright lies.
When push came to shove (“what is the homelessness policy on Mars?”), it was however willing to produce complete fabrications.
Test 3: Actual valuable work
Finally, has GPT-3 been useful?
I am surprised to say: YES!
It’s not much, but in the few days I’ve had it, I’ve used it twice to produce actual valuable help.
One case was to generate interview questions. I needed 8-10 questions and was quick to hammer out the first two. I provided that to GPT-3 along with a headline for the interview, and it happily produced seven additional questions. Three of these where on topic and good, so I kept them, removed to rest, and hit the API once again, now with five reference questions. This pattern of using “cleaned” results from the API to get more of the content I look for became common.
Fundamentally, it was actually helpful at brainstorming.
A second case was when I needed a list of random company names. They could even be fictional, I just needed company-sounding-names to train a basic algorithm. Given a short of companies, GPT-3 happily supplied an additional 100 examples.
Final verdict
I am very impressed, but there are still key challenges. GPT-3 is good, it is even very good. It can supply reasonably correct explanations and facts on a wide variety of topics at a high school level. This is a major achievement, and it will take time before such a technology finds it place in our society (for a very cool example check AI Dungeon or AI|Writer).
On the flipside, the core issue with GPT-2 is not fixed with GPT-3. It does not know what it’s doing. There are no guarantees, and with insufficient context I’ve had it both be comically wrong (it went on a tangent about pizzas when I tried to emulate a job interview) and downright nasty (the amount of erotica this has consumed must be truly staggering). This riskiness of these issues currently severely limits the usability of GPT-3.
If you are interested in knowing more about the technology, potential and limits, don’t hesitate to shoot me a message J I’m always happy to chat!