登录查看更多内容

Spot the difference - how do you know if LLMs deliver?

Laszlo K Varga

Lead Researcher and Senior Consultant @ Nimdzi. Strategic thinker, practical doer, forever curious and hungry to learn.

发布日期: 2023年9月11日

#GenAI is so easy to access and use. Many adopt the “shut up and take my money and data” approach to get some new efficiencies from these tools.

But how do you know if #LLMs are doing a good job? Especially with open-ended NLP tasks such as summarization that - by nature - may have ambiguous outcomes??

You can adopt various strategies:

the traditional 2 pairs of eyes,
machine first with human check,
human first with machine verification, or…
why not let one machine check another?

Here goes my little case study on summarization with GenAI and some clear takeaways.

The topmost outcome: Essentially, LLMs are far from replacing the human element in mission-critical tasks that have no single, easily verifiable answer.

Looking forward to your success stories (and failures) in the comments!

The AI summarization experiment

Working with #DuetAI in Google Docs, I asked it to summarize a 1-page document. The first, 3-sentence result was not bad for a brief gist. The custom prompt of “Create a 5 sentence summary” didn’t help either: it resulted in the same 3 sentences (see the screenshot above).

So I used the 'Elaborate' function to get a longer summary. To my surprise, the elaboration was relevant and on-point, but completely made up (i.e., none of the additions were part of the original document). So Duet AI didn’t ‘remember’ that the 3 sentences came from a summary it just created, but creatively (?) added new content.

Lucky for me (or, rather, by design ??) I used a 1-pager for the experiment, so that I could easily check whether the summary and its elaboration were up to my requirements - and they weren’t quite.

领英推荐

Falling in love with AI (GPT-3)? ...only if you can…

Prescott Paulin 4 年前

Tools and Materials: A Mental Model for AI

Morten Rand-Hendriksen 2 年前

RAG Demystified: A Dual-Depth Dive

Matteo Sorci 5 个月前

But what if this was a 50 page document? How would I tell then if the AI summary is accurate? How much time would I need to spend on verification?

This is where it gets interesting, as machines checking machines becomes a tempting option at scale.

I used the opportunity to compare 3 popular, free LLMs: Google’s #Bard (also based on #PaLM 2, just like Duet AI), #Anthropic’s #Claude2 (via poe.com, as it’s US and UK only) and #OpenAI’s #ChatGPT (the free version with GPT-3.5 running under the hood). For the experiment, I took the original document and Duet AI’s elaborated hallucination as a summary, and asked each of the LLMs to evaluate if the summary is accurate, misses key items, or includes items not in the document (hint: the last option was the right answer).

The upshot:

Bard didn't do a great job with the summarization assessment at all. It even kept insisting that the Duet AI’s creative additions were indeed part of the article (they weren’t).
Claude2 fell short the same way as Bard on the first prompt. However, at least it corrected itself upon reflection (I mean, reflexion).?
ChatGPT did reasonably well, pinpointing which items in the summary did not come from the article.??

Sure, with some elaborate prompt engineering - which should be slightly different for each model - the results could be improved. But if you’re lazy and greedy like I was for the experiment (why else would you use GenAI?), you will also probably stop after 2-3 attempts.

Since I was there, I also asked the 3 LLMs to summarize the same piece of content in separate chat threads. The results:?

Bard hallucinated a lot (it’s Google's tool, just like DuetAI),?
Claude created a solid list of bullet points,
ChatGPT pretty much nailed it.

What were the key takeaways?

For summarization tasks with simple prompting, out-of-the-box ChatGPT did best. Claude may work fine, but Bard probably needs a lot of 'prompt engineering', and Duet AI is of limited use at this point.?
That said, both Bard and Duet AI were relevant and creative in their "hallucinations" - albeit that's not really useful for summarizing.?
If I hadn't read the source text beforehand, I would not have been able to evaluate the outputs. For a long document, I may end up spending more time and effort on experimenting with the prompts, and on verifying and correcting the GenAI output than creating my own summary (which has the added benefit of really getting to understand the source content).
GenAI in its current state may be useful in non-mission critical NLP tasks such as gisting, but for summarization, a verifying pair of human eyes is essential. With practice, LLMs may help you become more efficient in some tasks - at least in those that you could and would perform yourself anyway.?
You are still the much needed human-in-the-loop.

INTELLITHING

1 年

Great Post

查看更多评论

要查看或添加评论，请登录

Laszlo K Varga的更多文章

Content with ChatGPT: a new quality bar that everyone can jump

2023年3月4日

Content with ChatGPT: a new quality bar that everyone can jump

There are a lot of voices saying that #ChatGPT (and other #GenerativeAI tools) is a straight path to mediocrity…

6 条评论
ChatGPT alla bolognese

2023年2月24日

ChatGPT alla bolognese

I think of #copywriting as making an enchanting dinner for your date. #ChatGPT gets you as far as a bottle of bolognese…

5 条评论

Spot the difference - how do you know if LLMs deliver?

Laszlo K Varga

Lead Researcher and Senior Consultant @ Nimdzi. Strategic thinker, practical doer, forever curious and hungry to learn.

The AI summarization experiment

领英推荐

The upshot:

What were the key takeaways?

Laszlo K Varga的更多文章

社区洞察

其他会员也浏览了

Fear of AI is Overblown

AI Vs. AI – The fight has just begun.

DeepSeek R1: Pioneering the New Frontier in AI Innovation

Things I don't hear enough about AI

The Power of Function Calling: Unlocking the Potential of LLMs

How Can YOU Start Using #AI? POST 9: What are LLMs?

AI is Actually Self-Aware – A Little Bit, At Least!

Let's Talk about AI: Single Model Solution or Multi-Agent System - A Short Overview

Oh Great, Not Another Article About DeepSeek

3 Ways AI, GPT-3 & Technology Buzzwords Are Going to Ruin Your Weekend

The AI summarization experiment

领英推荐

The upshot:

What were the key takeaways?

Laszlo K Varga的更多文章

Content with ChatGPT: a new quality bar that everyone can jump

ChatGPT alla bolognese

社区洞察

其他会员也浏览了

Fear of AI is Overblown

AI Vs. AI – The fight has just begun.

DeepSeek R1: Pioneering the New Frontier in AI Innovation

Things I don't hear enough about AI

The Power of Function Calling: Unlocking the Potential of LLMs

How Can YOU Start Using #AI? POST 9: What are LLMs?

AI is Actually Self-Aware – A Little Bit, At Least!

Let's Talk about AI: Single Model Solution or Multi-Agent System - A Short Overview

Oh Great, Not Another Article About DeepSeek

3 Ways AI, GPT-3 & Technology Buzzwords Are Going to Ruin Your Weekend