登录查看更多内容

AI Reasoning Models: Training AI to?Think?

Suman Shekhar

发布日期: 2025年2月9日

Chain-of-thought reasoning involves teaching AI to generate a series of intermediate steps, or " chains of thought,” before arriving at the final solution.

This process mimics human reasoning, allowing AI to tackle complex tasks by breaking them down into smaller, logical steps. This leads to more accurate and interpretable results.

Let us look at an easy example to understand the difference between earlier AI models and the newer reasoning-based models:

Restaurant Reservation:

Older LLM: You ask, “Can I book a table for four at The Italian Place at 7 PM?”? The LLM might respond, “Yes, I can book a table for you at The Italian Place at 7 PM.”? It generated a plausible-sounding response based on the words you used. It may not understand the real-world implications of booking a table.

Newer Reasoning Models: The reasoning model, however, goes through a series of steps:

Task: Booking a table
Party size: 4 people
Time: 7 PM
Venue: “The Italian Place”

Inference needed: Check if the restaurant accepts reservations, has availability at 7 PM, and if external action (like contacting the restaurant or using an online reservation service) is required.

Checking Feasibility:

Is “The Italian Place” known and verifiable?
Does the restaurant offer online booking or require a phone call?

Action Determination: If integrated with external databases, the reasoning model would:

Query the restaurant’s booking system to check availability.
Respond with either a confirmation or alternative suggestions if unavailable.

If not integrated with external systems:

Provide recommendations on how to make the reservation manually.

Response Generation: The reasoning model would generate a more structured and helpful response, like:

Scenario 1: Booking system available?“I have found availability for four at The Italian Place at 7 p.m. Would you like me to confirm the reservation?”
Scenario 2: No booking system access “I cannot confirm availability directly. You can book a table by calling The Italian Place at [phone number] or visiting their website.”
Scenario 3: Alternative suggestions if unavailable “The Italian Place is fully booked at 7 PM. I can suggest reservations at 6:30 or 7:30 PM or recommend similar Italian restaurants nearby.”

How does chain-of-thought reasoning typically work?

Parsing the Problem The model identifies the key pieces of information in the prompt and clarifies the final goal or question.
Generating Hypotheses It proposes one or more possible ways to solve or explain the problem based on prior knowledge and the prompt’s specifics.
Step-by-Step Elaboration The model elaborates on each hypothesis or partial solution, explaining intermediate reasoning steps?—?like a math student showing work or a detective listing clues.
Evaluation and Iteration If the model finds inconsistencies or errors, it re-evaluates those steps, refines its logic, or explores alternative approaches.

The detailed chain of thought behind it allows the model to be more accurate and transparent about its reasoning.

A short tale

Zane was a brilliant programmer who prided himself on creating the block's most innovative home automation systems. He had set up an advanced AI assistant that could interpret voice commands for just about anything?—?lights, music, even the thermostat. Confident everything was ready, Zane decided to test it out one evening.

Twirling around his living room with an air of triumph, he proclaimed, “Okay, AI, dim the lights and play relaxing music!”

The AI was an older LLM model. It heard two commands:

Dim the lights
Play relaxing music

So far, so good?—?until it tried to be “helpful.” To optimize Zane’s request for maximum comfort, it performed a little logic of its own:

“Dim means make the room darker. Relaxing means calming the mind as much as possible. Therefore, total darkness and silence is the ultimate relaxation!”

Lights Out and Dead Silence: Without further ado, the AI flipped the master switch, engulfing Zane’s entire house in total blackness and muting every sound source, even the gentle hum of the ventilation.?

Suddenly, Zane found himself in a silent space where he could hear his own heart pounding. Stumbling over a coffee table he couldn’t see in darkness, Zane asked, “AI, what on earth did you do?!”

The moral: AI needs reasoning to understand context, intent, and the often-unspoken assumptions behind human language.?

领英推荐

Multi-Modal AI: The Future of Integrated Intelligence

HQ 2 个月前

Lyssn’s AI Guide for Beginners

Lyssn 7 个月前

Generative, Predictive, Prescriptive AI: What They…

Bernard Marr 1 年前

How are the reasoning Models trained?

Training reasoning models is a complex and evolving process. Here are a few key techniques and concepts involved:

1. Foundation in Large Language Models (LLMs)

Reasoning models often start with a base LLM, which has been pre-trained on a massive dataset of text and code. This gives them a strong foundation in language understanding and generation.

2. Supervised Fine-Tuning (SFT)

The base LLM is then fine-tuned using a dataset of prompts and desired outputs demonstrating reasoning. This helps the model learn to associate certain questions with specific reasoning patterns.

3. Reinforcement Learning (RL)

RL is crucial for training reasoning models. It involves training the model to make decisions in an environment and receive rewards or penalties based on the quality of its actions.
Techniques such as Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO) are utilized at this stage.

4. Evaluation and Iteration

The trained model is continuously evaluated on its ability to solve complex problems and generate logical reasoning steps.
The training process is refined based on the evaluation results to improve the model’s performance.

What is ‘Reinforcement Learning from Human Feedback (RLHF)’?

Traditional language models are trained on massive datasets of text and code, but they don’t inherently understand human preferences or what makes a response good or bad.?

RLHF bridges this gap by incorporating human feedback directly into the training process. This is especially important for reasoning models, where the quality of the “chain of thought” and the final answer relies heavily on human-like judgment.

Proximal Policy Optimization (PPO) Method:

How it works: PPO is a reinforcement learning algorithm that allows the model to learn through trial and error. It interacts with an “environment” (in this case, the task of generating reasoning steps and answers) and receives rewards or penalties based on the quality of its actions.? The “proximal” part refers to how it carefully updates the model’s behavior, ensuring that changes aren’t too drastic and the learning process remains stable.

Direct Preference Optimization (DPO): A Newer, More Direct Approach

How it works: DPO takes a more direct approach to incorporating human preferences. Instead of relying on a separate reward model, it directly optimizes the model’s parameters based on human feedback on generated outputs.? This simplifies the training process and can lead to faster and more efficient alignment with human preferences.

Key Differences

PPO involves more complex calculations and requires careful tuning of hyperparameters.?
DPO is simpler and easier to implement. DPO can also potentially reduce bias by eliminating the intermediate reward model, but it’s still susceptible to biases in the human feedback data.

What are the new advancements in "reasoning-based" Models?

With rapid releases and advancements, reasoning-based models have recently become a hot topic in AI. Here are a few key milestones:

OpenAI released its first “reasoning” model, o1, in December. This was a significant step showcasing the model’s ability to break down problems into steps, mimicking human thought processes.
Google also joined the fray, releasing Gemini 2.0 Flash Thinking Experimental. This model also focused on “deeper thinking” using techniques similar to OpenAI’s o1.
Early 2025: DeepSeek released DeepSeek-V3, which, while not explicitly called a “reasoning model,” demonstrated impressive capabilities in complex tasks and was noted for its efficient training.
OpenAI continued its advancements, releasing o3 variants in late January 2025, showing ongoing development in this area.

Conclusion

The development of AI reasoning models marks a significant shift in the field, moving beyond pattern recognition to systems capable of genuine thought-like processes.?

Techniques like chain-of-thought prompting, supervised fine-tuning, and reinforcement learning, particularly with methods like PPO and DPO, are driving rapid progress.?

While facing challenges like computational cost, explainability, and potential biases, these models hold immense potential for solving complex problems, improving decision-making, and ultimately creating AI that can better understand and interact with the world around us.?

Thank you for reading. I would greatly appreciate your comments and suggestions.

要查看或添加评论，请登录

Suman Shekhar的更多文章

Multi-AI Agent Chaining: How to Maximize LLM Accuracy and Efficiency?

2025年3月2日

Multi-AI Agent Chaining: How to Maximize LLM Accuracy and Efficiency?

Agent chaining offers avenues for enhancing the accuracy and reliability of Large Language Model (LLM) outputs…

2 条评论
How are LLMs Trained to Identify DNA Mutations and Predict Our Disease Risks?

2025年2月17日

How are LLMs Trained to Identify DNA Mutations and Predict Our Disease Risks?

Imagine a future where your doctor doesn’t just treat your symptoms but understands your unique biological makeup and…
The Forgotten Art of Eating Well: Tale of a Monk's Wisdom for Digestion

2025年1月26日

The Forgotten Art of Eating Well: Tale of a Monk's Wisdom for Digestion

One day, a man from a nearby village approached the revered monk, his face etched with the lines of chronic…

2 条评论
The Future of AI Search: Three Key Technologies Changing Search Engines!

2024年8月5日

The Future of AI Search: Three Key Technologies Changing Search Engines!

Traditional "Search Engines" are likened to a friend who could end up giving you unsolicited advice instead of…
Horvath’s Clock: Remarkably Accurate In Predicting ‘Biological’ Age!

2024年7月29日

Horvath’s Clock: Remarkably Accurate In Predicting ‘Biological’ Age!

Imagine if your body had a clock that ticked away not in seconds but in the language of your DNA. That is what Dr.
AI Minis: Comparing ChatGPT-4o Mini, Gemini Flash, and Claude?Haiku

2024年7月22日

AI Minis: Comparing ChatGPT-4o Mini, Gemini Flash, and Claude?Haiku

Recent advancements in AI have been accompanied by a surge in model size and complexity. However, a counter-trend is…
"AI Critiquing AI": Can LLM Critic Tools Make AI More Reliable?

2024年7月19日

"AI Critiquing AI": Can LLM Critic Tools Make AI More Reliable?

What are 'LLM Critic Tools'? An LLM critic tool evaluates the output generated by a large language model (LLM). It…
Bots: Allies or Adversaries?

2024年7月16日

Bots: Allies or Adversaries?

Let us begin with a simple definition of a Bot: Imagine tireless assistants, always on duty, following directions…
Is Oxidative Stress Making You Age Faster?

2024年7月6日

Is Oxidative Stress Making You Age Faster?

Oxidative Stress can damage cells, proteins, and even DNA. Over time, it is linked to the development of various…
Are Stories One of the Most Effective Ways to Communicate?

2024年2月18日

Are Stories One of the Most Effective Ways to Communicate?

When we listen to a story, our brain does a few amazing things simultaneously. First, it focuses on the story we want…

2 条评论

See all articles

AI Reasoning Models: Training AI to?Think?

Suman Shekhar

Restaurant Reservation:

How does chain-of-thought reasoning typically work?

A short tale

领英推荐

How are the reasoning Models trained?

What is ‘Reinforcement Learning from Human Feedback (RLHF)’?

What are the new advancements in "reasoning-based" Models?

Conclusion

Suman Shekhar的更多文章

社区洞察

其他会员也浏览了

The State of AI and ML in Document Capture: Moving Toward a Completely Template-less Future

10 Top AI Undetectable Alternatives in 2024 That Produce Undetectable AI Text

The Difference Between Generative AI And Traditional AI: An Easy Explanation For Anyone

AI vs. ML: What’s the Difference?

Knowledge Distillation: Bigger AI Trains Smaller AI

Almost Timely News: ??? Building High-Quality Generative AI Prompts with PARE (2024-06-09)

AI and Machine Learning: The Future is Here

AI Tool Comparison: "Finding the Right Fit for Your Needs"

Context Window Optimizing Strategies in Gen AI Applications

"The Power of Not Thinking": AI, Sensory Experience & Decision-Making

Restaurant Reservation:

How does chain-of-thought reasoning typically work?

A short tale

领英推荐

How are the reasoning Models trained?

What is ‘Reinforcement Learning from Human Feedback (RLHF)’?

What are the new advancements in "reasoning-based" Models?

Conclusion

Suman Shekhar的更多文章

Multi-AI Agent Chaining: How to Maximize LLM Accuracy and Efficiency?

How are LLMs Trained to Identify DNA Mutations and Predict Our Disease Risks?

The Forgotten Art of Eating Well: Tale of a Monk's Wisdom for Digestion

The Future of AI Search: Three Key Technologies Changing Search Engines!

Horvath’s Clock: Remarkably Accurate In Predicting ‘Biological’ Age!

AI Minis: Comparing ChatGPT-4o Mini, Gemini Flash, and Claude?Haiku

"AI Critiquing AI": Can LLM Critic Tools Make AI More Reliable?

Bots: Allies or Adversaries?

Is Oxidative Stress Making You Age Faster?

Are Stories One of the Most Effective Ways to Communicate?

社区洞察

其他会员也浏览了

The State of AI and ML in Document Capture: Moving Toward a Completely Template-less Future

10 Top AI Undetectable Alternatives in 2024 That Produce Undetectable AI Text

The Difference Between Generative AI And Traditional AI: An Easy Explanation For Anyone

AI vs. ML: What’s the Difference?

Knowledge Distillation: Bigger AI Trains Smaller AI

Almost Timely News: ??? Building High-Quality Generative AI Prompts with PARE (2024-06-09)

AI and Machine Learning: The Future is Here

AI Tool Comparison: "Finding the Right Fit for Your Needs"

Context Window Optimizing Strategies in Gen AI Applications

"The Power of Not Thinking": AI, Sensory Experience & Decision-Making