Large Reasoning Models (I) - a technical overview

Large Reasoning Models (I) - a technical overview

The AI industry is no stranger to hype cycles, but the recent wave surrounding OpenAI’s o3 model has escalated to unprecedented levels.


AI will take your job” has infected headlines and spread over entire professions. I'm not sure about the fear factor, but uncertainty is definitely present.


Amid this hysteria, it's good to shift the focus to advancements and real limitations of today's frontier AI.


For example, o3’s much-discussed 87.5% accuracy on the ARC-AGI benchmark is impressive, but the staggering costs and inefficiencies it incurs, like spending $5,000 per task, paint a more complex picture. This kind of raw performance is, mildly put, far from practical for widespread adoption.


So, let's cut through the noise and discuss the actual outcomes of building these Large Reasoning Models (LRMs) and what they mean for the industry.


Let's start with a technical overview.


Note: It's quite a deep dive, so if you're new to LLMs, we explain what they are here and here. You might want to read it first and come back equipped with new knowledge.


The reality of new AI models

The announcement of the o3 model sent shockwaves through the tech community. Social media influencers and tech commentators often exaggerate claims with little balance between revolution and dystopia.


What we see is genuine, incremental progress AI has made once again.


While undoubtedly nailing benchmark after benchmark, insiders say that each task solved by o3 averaged around 57 million tokens, equating to roughly $5,000 per problem.


Groundbreaking, maybe, but definitely prohibitively expensive for widespread deployment.


Pareto on steroids

Let's consider the analogy of two children taking a test. One completes it in 20 minutes with an 80% score, while the other takes two months to achieve 90%. Which child is truly more intelligent?


To evaluate it, we must consider efficiency, not just raw accuracy.


In the case of o3, its high performance stems from leveraging immense computing power. That's like a fancy term for relying on brute-force methods. The scalability and practicality of such systems in everyday use cases are questionable at best.


As top AI researcher Miles Cranmer notes, models like o3 often exacerbate user experience issues by doubling down on mistakes, further complicating their integration into real-world workflows.


Traditional LLMs were all about accuracy. Evaluating Large Reasoning Models has to be done by measuring their 'intelligence efficiency.' It's a metric that assesses how effectively the model uses computational resources to produce high-quality outputs. We call this new criteria Bits per Byte (BpB).


In simple terms, it measures the amount of meaningful information generated per token.


If we take BpB into account, the sheer volume of tokens required to solve complex tasks diminishes o3's model efficiency, making it less practical despite its impressive accuracy.


No doubt that future advancements in AI must focus on improving this metric to ensure that new models are not only powerful but also cost-effective and scalable.


From LLMs to LRMs

First and foremost, the limitations of LLMs and the current transformer architecture have become evident. We've nearly depleted internet data and built incredible data centers, and got so far.


Simply increasing model size no longer delivers proportional gains in performance.


Another critical drawback is the static nature of LLMs’ knowledge. Once pre-trained, these models cannot adapt to new information or learn from their environment. Sure, you can use RAG, but it's good only for certain tasks. We're talking about situations requiring real-time reasoning or dynamic adaptation (ongoing learning).


What exactly are LRMs

I know the intro might have been a bit long, but it's important to revise all the news calmly and with a (large) grain of salt.


So, LRMs are a new paradigm that emphasizes inference-time compute. What we (humans, AI guys, you name it) figured out was that we want to shift from so-called System 1 thinking to System 2, following the late Daniel Kahneman's approach to describe human cognitive processing.


System 1 is making decisions on the spot, immediately, intuitively, with little effort.


System 2 is where deliberate reasoning and problem-solving come into play.


LRMs are designed to address tasks that require deeper reasoning, such as advanced mathematics, coding, and logical deduction. They achieve this by employing various strategies, including generating multiple solution paths, self-correcting errors, and leveraging search algorithms to explore many possibilities.


In short LRMs aren't models, they're systems.


They're composed of interconnected components, usually:

  1. Generators - LLMs that produce “thought” tokens or potential solutions.
  2. Verifiers - Reward models that evaluate and score the generated thoughts for accuracy and coherence. These may also involve symbolic engines or domain-specific tools, such as math solvers or code interpreters.
  3. Search algorithms - Advanced LRMs integrate search mechanisms to iteratively refine their reasoning paths in real-time, significantly improving their ability to handle complex tasks.


LRMs also differ from traditional LLMs in terms of their size and operational approach. They're much smaller than LLMs and instead of the number of parameters, they leverage inference-time compute to achieve comparable or superior results with smaller base models.


In short, LRMs can outperform larger models in specific tasks by thinking longer or attempting multiple solutions.


Building an LRM

Here's an overview of some key techniques:


Tree of Thoughts

Instead of generating a single output, word after word at a steady pace, LRMs explore multiple paths of potential solutions.


Additionally, they're divided into separate steps to make them more efficient. Each step is judged by how likely it is to lead to the correct answer and can lead to either going on with the path or terminating it at the right moment.


When showing the process in a diagram form, it looks like branches of a tree, hence the name.

This approach mirrors how humans deliberate complex problems, considering various strategies and their consequences before making a decision.


Monte Carlo Tree Search

MCTS was originally developed for game-playing AI, and was used in AlphaZero and AlphaGO.


It's crucial for the Tree of Thoughts, as it allows for simulation and evaluation of different solution paths.


Here's how MCTS works:

  1. Selection: The algorithm starts at the root of the tree (the initial problem state) and selects a promising branch based on predefined criteria, such as past performance or exploration potential.
  2. Simulation: The selected branch is expanded by simulating possible next steps, generating new nodes in the tree.
  3. Evaluation: Each node is evaluated, often using a reward model to score its quality and potential.
  4. Backpropagation: The evaluation scores are propagated back up the tree, updating the parent nodes to reflect the outcomes of their children.


It's iterative and enables the model to focus computational resources on the most promising branches, balancing exploring new paths with exploiting known high-quality solutions.


Post-training - a key difference between LLMs and LRMs

Building a Large Reasoner Model involves more than just scaling up pre-training. While traditional LLMs undergo Supervised Fine-Tuning (SFT) and Preference Fine-Tuning to align the model with human values and improve utility, LRMs add new layers of complexity to the post-training process. These include real-time memory updates, active test-time training, and advanced search mechanisms.


Memory updates and retrieval

One of the key advancements in LRMs is the ability to update their memory in real time. Unlike LLMs, which remain static after pre-training, LRMs can incorporate new information dynamically, enabling them to adapt to changes and improve over time.


Test-Time Training (TTT)

It's a technique that allows models to partially update their parameters during inference, effectively learning on the fly.

When faced with unfamiliar problems, the model generates small datasets of variations, trains itself on these examples, and applies the learned insights to the original task.


Reinforcement Learning from Verifiable Rewards (RLVR)

A technique that trains models to optimize their reasoning processes by providing direct feedback on correctness. For tasks with clear verification criteria, such as coding or mathematical problem-solving, RLVR uses automated verifiers to score the model’s outputs. This feedback loop refines the model’s reasoning, improving both accuracy and efficiency.


GFlowNets

Generative Flow Networks (GFlowNets) are a novel approach to AI reasoning. Unlike traditional reinforcement learning, which maximizes a single reward, GFlowNets generate solutions proportional to their rewards, enabling a more nuanced exploration of potential answers.

This innovative method has garnered significant attention, with AI luminary Yoshua Bengio describing it as a transformative research direction.



In the next chapter, we'll cover where we're at with RLMs, where they can be used, and which limitations we have to conquer before widespread use.



For more ML and AI insights, subscribe or follow Sparkbit on LinkedIn.

If you're looking to start an AI project, you can book a free consultation with our CTO here: https://lnkd.in/gj2kTChR



Author: Kornel Kania , AI Delivery Consultant at Sparkbit


要查看或添加评论,请登录

Sparkbit的更多文章

社区洞察

其他会员也浏览了