Spot the difference - how do you know if LLMs deliver?
Laszlo K Varga
Lead Researcher and Senior Consultant @ Nimdzi. Strategic thinker, practical doer, forever curious and hungry to learn.
#GenAI is so easy to access and use. Many adopt the “shut up and take my money and data” approach to get some new efficiencies from these tools.
But how do you know if #LLMs are doing a good job? Especially with open-ended NLP tasks such as summarization that - by nature - may have ambiguous outcomes??
You can adopt various strategies:
Here goes my little case study on summarization with GenAI and some clear takeaways.
The topmost outcome: Essentially, LLMs are far from replacing the human element in mission-critical tasks that have no single, easily verifiable answer.
Looking forward to your success stories (and failures) in the comments!
The AI summarization experiment
Working with #DuetAI in Google Docs, I asked it to summarize a 1-page document. The first, 3-sentence result was not bad for a brief gist. The custom prompt of “Create a 5 sentence summary” didn’t help either: it resulted in the same 3 sentences (see the screenshot above).
So I used the 'Elaborate' function to get a longer summary. To my surprise, the elaboration was relevant and on-point, but completely made up (i.e., none of the additions were part of the original document). So Duet AI didn’t ‘remember’ that the 3 sentences came from a summary it just created, but creatively (?) added new content.
Lucky for me (or, rather, by design ??) I used a 1-pager for the experiment, so that I could easily check whether the summary and its elaboration were up to my requirements - and they weren’t quite.
领英推荐
But what if this was a 50 page document? How would I tell then if the AI summary is accurate? How much time would I need to spend on verification?
This is where it gets interesting, as machines checking machines becomes a tempting option at scale.
I used the opportunity to compare 3 popular, free LLMs: Google’s #Bard (also based on #PaLM 2, just like Duet AI), #Anthropic’s #Claude2 (via poe.com, as it’s US and UK only) and #OpenAI’s #ChatGPT (the free version with GPT-3.5 running under the hood). For the experiment, I took the original document and Duet AI’s elaborated hallucination as a summary, and asked each of the LLMs to evaluate if the summary is accurate, misses key items, or includes items not in the document (hint: the last option was the right answer).
The upshot:
Sure, with some elaborate prompt engineering - which should be slightly different for each model - the results could be improved. But if you’re lazy and greedy like I was for the experiment (why else would you use GenAI?), you will also probably stop after 2-3 attempts.
Since I was there, I also asked the 3 LLMs to summarize the same piece of content in separate chat threads. The results:?
What were the key takeaways?
Great Post