Spot the difference - how do you know if LLMs deliver?
The 5 sentence summary in 3 sentences.

Spot the difference - how do you know if LLMs deliver?

#GenAI is so easy to access and use. Many adopt the “shut up and take my money and data” approach to get some new efficiencies from these tools.

But how do you know if #LLMs are doing a good job? Especially with open-ended NLP tasks such as summarization that - by nature - may have ambiguous outcomes??

You can adopt various strategies:

  1. the traditional 2 pairs of eyes,
  2. machine first with human check,
  3. human first with machine verification, or…
  4. why not let one machine check another?

Here goes my little case study on summarization with GenAI and some clear takeaways.

The topmost outcome: Essentially, LLMs are far from replacing the human element in mission-critical tasks that have no single, easily verifiable answer.

Looking forward to your success stories (and failures) in the comments!

The AI summarization experiment

Working with #DuetAI in Google Docs, I asked it to summarize a 1-page document. The first, 3-sentence result was not bad for a brief gist. The custom prompt of “Create a 5 sentence summary” didn’t help either: it resulted in the same 3 sentences (see the screenshot above).

So I used the 'Elaborate' function to get a longer summary. To my surprise, the elaboration was relevant and on-point, but completely made up (i.e., none of the additions were part of the original document). So Duet AI didn’t ‘remember’ that the 3 sentences came from a summary it just created, but creatively (?) added new content.

Lucky for me (or, rather, by design ??) I used a 1-pager for the experiment, so that I could easily check whether the summary and its elaboration were up to my requirements - and they weren’t quite.

But what if this was a 50 page document? How would I tell then if the AI summary is accurate? How much time would I need to spend on verification?

This is where it gets interesting, as machines checking machines becomes a tempting option at scale.

I used the opportunity to compare 3 popular, free LLMs: Google’s #Bard (also based on #PaLM 2, just like Duet AI), #Anthropic’s #Claude2 (via poe.com, as it’s US and UK only) and #OpenAI’s #ChatGPT (the free version with GPT-3.5 running under the hood). For the experiment, I took the original document and Duet AI’s elaborated hallucination as a summary, and asked each of the LLMs to evaluate if the summary is accurate, misses key items, or includes items not in the document (hint: the last option was the right answer).

The upshot:

  • Bard didn't do a great job with the summarization assessment at all. It even kept insisting that the Duet AI’s creative additions were indeed part of the article (they weren’t).
  • Claude2 fell short the same way as Bard on the first prompt. However, at least it corrected itself upon reflection (I mean, reflexion).?
  • ChatGPT did reasonably well, pinpointing which items in the summary did not come from the article.??

Sure, with some elaborate prompt engineering - which should be slightly different for each model - the results could be improved. But if you’re lazy and greedy like I was for the experiment (why else would you use GenAI?), you will also probably stop after 2-3 attempts.

Since I was there, I also asked the 3 LLMs to summarize the same piece of content in separate chat threads. The results:?

  • Bard hallucinated a lot (it’s Google's tool, just like DuetAI),?
  • Claude created a solid list of bullet points,
  • ChatGPT pretty much nailed it.

What were the key takeaways?

  1. For summarization tasks with simple prompting, out-of-the-box ChatGPT did best. Claude may work fine, but Bard probably needs a lot of 'prompt engineering', and Duet AI is of limited use at this point.?
  2. That said, both Bard and Duet AI were relevant and creative in their "hallucinations" - albeit that's not really useful for summarizing.?
  3. If I hadn't read the source text beforehand, I would not have been able to evaluate the outputs. For a long document, I may end up spending more time and effort on experimenting with the prompts, and on verifying and correcting the GenAI output than creating my own summary (which has the added benefit of really getting to understand the source content).
  4. GenAI in its current state may be useful in non-mission critical NLP tasks such as gisting, but for summarization, a verifying pair of human eyes is essential. With practice, LLMs may help you become more efficient in some tasks - at least in those that you could and would perform yourself anyway.?
  5. You are still the much needed human-in-the-loop.

要查看或添加评论,请登录

Laszlo K Varga的更多文章

  • Content with ChatGPT: a new quality bar that everyone can jump

    Content with ChatGPT: a new quality bar that everyone can jump

    There are a lot of voices saying that #ChatGPT (and other #GenerativeAI tools) is a straight path to mediocrity…

    6 条评论
  • ChatGPT alla bolognese

    ChatGPT alla bolognese

    I think of #copywriting as making an enchanting dinner for your date. #ChatGPT gets you as far as a bottle of bolognese…

    5 条评论

社区洞察

其他会员也浏览了