Demystifying Reasoning Models
Aymen LABIDI
Architect & Engineering Manager @Inetum | AI Engineer | Startup Founder (Hiring!)
Introduction
An LLM reasoning model is a specialized architecture that enables large language models (LLMs) like ChatGPT’s o1 or DeepSeek’s R1 to perform structured, multi-step reasoning.
Unlike traditional LLMs that generate responses in a single pass, thinking models break down complex tasks by generating token thought into iterative steps—planning, acting, observing, and refining—to mimic systematic human problem-solving.
Reasoning models are pre-trained on text containing human-written thoughts, which are hence encoded into the model.
Reasoning models introduce explicit reasoning loops, tool integration, and stateful memory to improve accuracy and adaptability. These frameworks bridge the gap between raw generative potential and goal-oriented execution, enabling LLMs to tackle tasks like coding, advanced math, or strategic planning with human-like deliberation.
How are reasoning models trained?
One of the techniques used is Thought Preference Optimization.
The process starts by prompting the LLM to generate thoughts before its response. After sampling different outputs, the response parts are fed to the judge model(can be a tool like code execution not only model), which determines the best and worst ones. The corresponding full outputs are then used as chosen or rejected pairs for DPO optimization. Multiple iterations of this training are performed.
This technique may cause the evaluated model to converge towards the judge model, which can lead to a lack of diversity in the model and potentially result in overfitting to the judge model’s biases.
How do they actually work (Reasoning Loops)?
Reasoning models operate in loops rather than linear sequences.
When the user runs a prompt, a pre-pended thought prompt is used to instruct the model to generate reasoning (or “thought”) tokens. These thoughts represent intermediate steps or considerations made by the model while solving a problem or approaching a task. The LLM processes these thoughts to generate a response.
The response is then evaluated against the task’s goal or expected outcome. If the response doesn’t align with the task goal, the reasoning model iterates. During each iteration, the model generates new thoughts (or refines the previous ones) and re-runs the reasoning process. This continues until the generated response satisfies the task’s goal or requirement.
Here’s an example of how this iterative reasoning process unfolds:
Plan: Break a goal into subtasks (e.g., “Solve this equation step-by-step”).
Act: Use tools (code execution, web search) or internal reasoning to address each subtask.
Observe: Evaluate results for errors or inconsistencies.
Refine: Adjust the approach and repeat until the goal is met.
ChatGPT’s o1 model, for instance, uses a Tree of Thought (ToT) framework to explore multiple reasoning paths in parallel, pruning incorrect branches and retaining viable solutions.
Demo
Standard model: When asked to solve “2x + 4 = 12,” it may directly output “x = 4” without showing steps.
Reasoning Model: Breaks the problem down step by step:
<think>
To solve the equation 2x+4=12, I'll start by isolating the term with the variable.
领英推荐
First, I'll subtract 4 from both sides of the equation to eliminate the constant term on the left side.
This gives me 2x=8.
Next, I'll divide both sides by 2 to solve for x.
Finally, I find that x=4.
<think>
And when we ask for a slightly complex equation (6x + 4 = 12), it generates a lot of thinking tokens.
The tokens (words) between the <think> block are considered the internal thoughts of the model and are not considered part of the response.
Here’s a demo running DeepSeek R1 locally with Ollama and OpenWebUI.
Tool Integration
These reasoning models enhance raw text generation by incorporating external tools, enabling more sophisticated and context-aware outputs.
Code execution: Build and execute code within specified runtimes
Web search: Crawl data on behalf of the user, incorporating relevant information into the context to generate more precise and grounded responses.
Memory: Retain context across interactions, such as tracking variables or states in multi-step problems.
Difference between a thinking model and traditional LLMs
The standard LLM makes a single request pass to generate content, whereas the Reasoning Model iterates until it is satisfied with the result, requiring state management between iterations. This process reduces hallucinations by self-correcting the response.
In this process, tools can be involved while generating the response, with the drawback of adding more delay to the request.
Conclusion
LLM reasoning models like o1 from OpenAI and R1 Deepseek represent a paradigm shift—from generating text to engineering thought. However, challenges such as latency, unpredictability, and tool dependency arise.
Latency is a concern as these models are slower due to iterative loops, while unpredictable outcomes may occur in complex tasks, leading to inconsistent reasoning paths.
Performance also heavily depends on the quality of integration, such as code execution environments. These models trade speed for precision, making them ideal for technical domains like coding, data analysis, or research.
Their effectiveness relies on carefully designed prompts, robust tooling, and controlled environments.
As they evolve, LLMs will blur the line between human and machine reasoning, but mastery will require understanding their “cognitive” architecture—how they process information and make decisions.