登录查看更多内容

Reasoning About Reasoning - III

Soban Raza

Flirting with LLMs | Vertical AI agents, the next big thing?

发布日期: 2025年1月13日

+ 关注

1. The Setup

We’re back, and we’re now going to talk about reasoning evaluations in scenarios involving black-box language models.

A quick refresher, a model is black-box if its internal details (i.e. activations, weights, etc.) are unknown to you.

Black-box contexts are the backbone of modern AI-based applications.

And down the line, these will be the bedrock that agentic AIs rest on. As a consequence, we’re forced to deal with the matter of reasoning indirectly at the application level.

2. Tennis For Two

Unlike the framework discussed in the previous issue, we’re going to have to take a radically different approach. We need a language model to evaluate our language model’s reasoning!

You might recall from the last issue that we strongly discouraged the use of a language model for generating synthetic reasoning data — that’s because we had access to activations and could rely on those.

We have no such luck here. We’re forced to get another language model to help us out.

Here’s a diagrammatic overview of how things’ll work out:

(i) Standard data flow (ii) Data flow with evaluator in the middle

It’s not all too different in structure from what we discussed in the last issue. The differences are the inability to view activations, and making use of an evaluator agent

领英推荐

Taking your RAG pipelines to a next level ! LangGraphs

Tensor Labs 1 年前

Is AI still the future of finance?

The AI Summit Series 2 年前

What is HtmlRAG, Multimodal RAG and Agentic RAG?

TuringPost 1 个月前

3. Enforcers

The evaluator agent will be responsible for assessing reasoning, and the key assumption we’re making is that this model too is a blackbox.

Otherwise, unless you happen to have a high-end model yourself, it’d be unwise to have a low-resource language model assess an external one.

Our modus operandi will be carefully constructed system prompts being fed to the evaluator. Here’s the essential steps involved:

Defining the evaluator’s purpose.
Outlining evaluation metrics.
Implementing the flow.
Assessing the evaluator.

In addition, we can do something nifty and setup a feedback loop — this would let your evaluator repeatedly rerequest the language model for an answer until it’d be satisfied.

Of course, we can’t do much without having a purpose in mind. Why do we care about reasoning at all?

Let’s suppose we had an LLM write up blog articles for us (spoiler alert — ours isn’t), we’d care about not just the correctness of the article but also if it’d click with our readerbase.

To that end, we might want to see a couple jokes and general tonality in line with our blog’s general vibe.

To explain all of that to an evaluator model, we’ll need to prompt it with a system prompt. In essence, a system prompt is akin to roleplaying like you would on Runescape.

You write up a long, well-structured prompt that reads out like a character backstory and hand it over to your evaluator.

The key thing is to be highly specific about every single facet involved in the process — you don’t want your evaluator being forced to guess when faced with the unknown.

There’s more to this system prompt than purpose — we’ll need to talk about the evaluation criteria that’ll be outlined.

Interested in reading the next 5 sections? Please visit the beehiiv version.

If you are building AI agents, we can possibly partner up. Visit our website or schedule a call.

The Antedote

2,184 位关注者

Connect Tech+Talent

2 周

This is a crucial discussion! Understanding AI reasoning is essential, especially as we delegate more responsibility to autonomous systems. What do you think are the most critical safeguards we should implement to guide these agents effectively? Looking forward to your insights!

要查看或添加评论，请登录

Soban Raza的更多文章

To Wrap or Not To Wrap

2025年2月24日

To Wrap or Not To Wrap

1. Delusions When I told my colleagues I’d be pivoting to AI-powered business solutions, one remark kept finding its…

2 条评论
Making a Man in the Mirror

2025年2月13日

Making a Man in the Mirror

1. Digital Engagement I was out grabbing a bite with a couple of mates of mine a couple of weeks ago when one of them…

7 条评论
Powering AI

2025年1月21日

Powering AI

1. Lights On What’s the biggest question we’re stuck asking ourselves when we want to build an AI-based application?…

7 条评论
Reasoning About Reasoning - II

2025年1月6日

Reasoning About Reasoning - II

1. The Setup In this issue, we’ll finally discuss the first kind of reasoning evaluation for language models — whitebox…

1 条评论
Reasoning About Reasoning

2024年12月18日

Reasoning About Reasoning

"An alleged scientific discovery has no merit unless it can be explained to a barmaid." — Lord Rutherford 1.

1 条评论
AI and Natural Languages - 20th Century to Today

2024年12月11日

AI and Natural Languages - 20th Century to Today

“The limits of my language means the limits of my world.” — Ludwig Wittgenstein, Tractatus Logico-Philosophicus 1.

6 条评论

See all articles

Reasoning About Reasoning - III

Soban Raza

Flirting with LLMs | Vertical AI agents, the next big thing?

1. The Setup

2. Tennis For Two

领英推荐

3. Enforcers

The Antedote

2,184 位关注者

Soban Raza的更多文章

社区洞察

其他会员也浏览了

DataX shines in AI and LLMs personnel talent, securing dual global honors at "NeurIPS 2023 Challenge"

Mixture of Agents (II)

Why AGI is Never Coming- Part 1: The limitations of Data

Can we use LLMs and alternative data to outperform the S&P 500? (convo w/Perplexity)

GPT-4 is Getting Faster ??

Year in Review and a View to 2025

Dr GPT or: How I Learned to Stop Worrying and Love the AI Part 4

Can you prompt LLMs to admit when they don't know the answer?

Unleashing Financial Potential with StructRAG: The Future of AI in Data-Heavy Markets and Knowledge-Intensive Reasoning

Hyperdisclaimer to "Unlocking Trillions: How AI and Advanced Calculus are Redefining Finance"

1. The Setup

2. Tennis For Two

领英推荐

3. Enforcers

The Antedote

2,184 位关注者

Soban Raza的更多文章

To Wrap or Not To Wrap

Making a Man in the Mirror

Powering AI