?? Search-R1, Gemini Embeddings & Controlled Reasoning with L1

?? Search-R1, Gemini Embeddings & Controlled Reasoning with L1

In this issue:

  1. Emergent search behavior in LLMs
  2. Stopping reasoning models from “overthinking”
  3. The best embeddings - for everything?


Accelerate your AI projects with Prolific. Claim $50 free credits and get quality human data in minutes from 200,000+ taskers.

No setup cost, no subscription, no delay—get started, top up your account to claim your free credit, and test Prolific for yourself now.

Use code: LLM-WATCH-50

$50 Free Credits


1. Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Watching: Search-R1 (paper)

What problem does it solve? Large Language Models (LLMs) face significant challenges when performing complex reasoning tasks that require up-to-date or specialized external knowledge. Current approaches to integrating LLMs with search engines have notable limitations. Retrieval-Augmented Generation (RAG) typically follows a one-round retrieval pattern that lacks flexibility for multi-turn, multi-query interactions needed for complex reasoning. Meanwhile, tool-use approaches either struggle with generalization (when using prompting) or require extensive high-quality labeled data of search-and-reasoning interactions (when using supervised fine-tuning). The research community has been searching for more flexible and data-efficient methods to teach LLMs how to optimally interact with search engines.

How does it solve the problem? The researchers introduced Search-R1, a new reinforcement learning framework that teaches LLMs to interleave step-by-step reasoning with search engine queries. Unlike previous approaches, Search-R1 requires no supervised demonstration data - the model learns solely through RL with a simple outcome-based reward function. The system uses special tokens to structure interactions: reasoning steps are wrapped in <think> tags, search queries in <search></search> tags, and retrieved information appears between <information></information> tags. To ensure stable training, the authors implemented "retrieved token masking" which prevents optimization of tokens from search results. The framework is compatible with various RL algorithms, including PPO and GRPO, allowing the LLM to develop sophisticated multi-turn retrieval strategies through experience.

What are the key findings? Experiments across seven question-answering datasets demonstrated that Search-R1 significantly outperforms state-of-the-art baselines, with impressive relative improvements of 26% for Qwen2.5-7B, 21% for Qwen2.5-3B, and 10% for LLaMA3.2-3B models. The researchers found that while instruction-tuned models initially converged faster, base models eventually achieved comparable performance through reinforcement learning. Interestingly, GRPO generally outperformed PPO for optimization, though PPO provided greater training stability. The case studies revealed that Search-R1 enables LLMs to perform iterative self-verification through additional search steps, even when they already have sufficient information - a sophisticated behavior that emerged naturally from RL training.

Why does it matter? By relying solely on reinforcement learning without extensive supervised data, Search-R1 offers a more scalable approach to teaching LLMs complex search and reasoning behaviors. This is particularly valuable as models encounter increasingly complex tasks requiring specialized or time-sensitive knowledge beyond their training data. The framework's effectiveness across different model families suggests broad applicability, while the multi-turn retrieval capability enables more sophisticated problem-solving than traditional RAG approaches. As LLMs continue to be deployed in knowledge-intensive applications, this work points toward more autonomous and effective information-seeking behaviors that could significantly enhance their real-world utility.


2. L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

Watching: L1 (paper)

What problem does it solve? LLMs with reasoning capabilities have shown they can solve increasingly complex problems by "thinking longer" - generating extended chain-of-thought sequences. However, these models lack control over their reasoning length, which creates a fundamental inefficiency. Without length control, it's impossible to allocate computing resources effectively: models might waste tokens on simple problems or cut reasoning short on complex ones. Previous attempts at controlling reasoning length, like the S1 method, have resulted in significant performance degradation compared to uncontrolled models.

How does it solve the problem? Length Controlled Policy Optimization (LCPO) is a new reinforcement learning approach that trains models to satisfy two objectives simultaneously: providing correct answers and adhering to user-specified length constraints given in the prompt. They trained two variants of their model called L1: L1-Exact, which generates reasoning of exactly the specified length, and L1-Max, which ensures reasoning doesn't exceed a maximum token budget. Unlike previous heuristic-based approaches, L1 learns to adaptively allocate tokens within the constraints while maintaining reasoning quality.

What are the key findings? L1 demonstrated precise control over reasoning length while significantly outperforming previous methods - beating the S1 length-control approach by up to 100% relative and 20% absolute on math reasoning tasks. The models maintained strong performance across various domains, even generalizing to out-of-distribution tasks like logical reasoning and general knowledge benchmarks. Perhaps most surprisingly, the researchers discovered that models trained to generate long reasoning chains become unexpectedly good at short-form reasoning, with their 1.5B parameter model matching GPT-4o's performance when using identical token budgets.

Why does it matter? These findings offer a solution to the compute efficiency problem in reasoning LLMs by enabling precise control of the performance-compute tradeoff. This has significant practical implications for deploying reasoning models in resource-constrained environments where computational costs must be carefully managed. The ability to dynamically allocate tokens based on problem difficulty creates more efficient reasoning systems overall. Furthermore, the discovery that smaller models can match much larger ones at identical token budgets suggests a promising approach for efficient reasoning that could reduce the need for extremely large parameter counts in certain applications.


3. Gemini Embedding: Generalizable Embeddings from Gemini

Watching: Gemini Embedding (paper)

What problem does it solve? Text embeddings - vector representations that capture the meaning of words, sentences, and documents - are fundamental building blocks for many NLP applications. However, until now, developers faced a frustrating trade-off: they could either optimize for strong multilingual capabilities, good code understanding, or superior English performance, but not all three simultaneously. This limitation has forced organizations to maintain multiple specialized embedding models, increasing complexity and computational costs. The Gemini Embedding paper tackles this fragmentation by attempting to create a truly universal embedding model that maintains high quality across languages, domains, and tasks without sacrificing performance in any area.

How does it solve the problem? The researchers leverage Google's powerful Gemini LLM as their foundation and employ a sophisticated two-stage training approach. First, they "pre-finetune" the model on billions of text pairs to adapt it from generation to encoding. Then, they fine-tune on diverse task-specific datasets with carefully optimized batch sizes. What's particularly clever is their bootstrapping approach - using Gemini itself to improve the training data by generating synthetic examples, filtering low-quality data, and mining optimal positive/negative pairs for contrastive learning. The final model benefits from "Model Soup," a technique that averages parameters from multiple fine-tuned checkpoints to enhance generalization across languages and tasks.

What are the key findings? Gemini Embedding achieves remarkable across-the-board performance, establishing new state-of-the-art benchmarks on MTEB Multilingual (68.32 task mean score, +5.09 over the previous best model), MTEB English, and MTEB Code simultaneously with a single model. It particularly excels at classification (+9.6 points), clustering (+3.7), and retrieval (+9.0) compared to previous top performers. Cross-lingual retrieval sees dramatic improvements, with the model achieving 90.42% on XOR-Retrieve and 64.33% on XTREME-UP, supporting over 250 languages. Perhaps most impressively, their ablation studies show the model can generalize to multiple languages effectively even when trained primarily on English data, suggesting it's leveraging Gemini's inherent multilingual understanding.

Why does it matter? This is a big deal for embeddings - the trade-offs between multilingual support, code understanding, and task-specific performance don’t seem unavoidable anymore. By creating a single unified embedding space that works well across all these dimensions, Gemini Embedding enables more inclusive and effective AI applications that can serve users regardless of language. The ability to precompute these high-quality representations also makes Gemini's capabilities accessible in compute and latency-sensitive settings where running a full LLM would be impractical. Consequently, we can expect more powerful and equitable information retrieval systems, recommendation engines, and semantic search applications that work equally well for users worldwide.


Papers of the Week:


?? If you enjoyed this article, give it a like and share it with your peers.


Khaled Alamri

CEO | vCTO | vCISO | vCDO - Digital & Innovation

1 周

No wonder. Google AI is the worse when it comes to answering questions. It is much farer than ChatGPT and Copilot.

要查看或添加评论,请登录

Pascal Biese的更多文章

  • ?? Quantum-Enhanced AI - It's Here

    ?? Quantum-Enhanced AI - It's Here

    In this issue: Chinese researchers introduce quantum-enhanced fine-tuning Enabling open-source reinforcement learning…

    3 条评论
  • ?? QwQ-32B: 20x smaller than DeepSeek-R1

    ?? QwQ-32B: 20x smaller than DeepSeek-R1

    In this issue: China just did it again: a new open source powerhouse The art of post-training reasoning models A new…

    6 条评论
  • OpenAI Can Not Be Happy About This

    OpenAI Can Not Be Happy About This

    In this issue: OpenAI releases first “vibe” model Microsoft bets on data quality and efficiency When old benchmarks…

  • ?????? One Giant Leap for AI Optimization

    ?????? One Giant Leap for AI Optimization

    In this issue: Sakana’s AI CUDA Engineer Inner Thinking Transformers Better Code Generation for any model Accelerate…

  • LLM Watch#74: DeepSeek-R1 Was Only The Beginning

    LLM Watch#74: DeepSeek-R1 Was Only The Beginning

    In this issue: 1B model > 405B model AI winning Olympic Gold Generating world models on the fly For those of you that…

    5 条评论
  • ?? Massive Progress in Reasoning Models

    ?? Massive Progress in Reasoning Models

    In this issue: Beating OpenAI with Open-Source 99% performance with only 1% data Chain-of-Associated-Thoughts (CoAT)…

    2 条评论
  • ??? Automatic Prompt Engineering 2.0

    ??? Automatic Prompt Engineering 2.0

    Foreword: hi everyone, I hope you had a great week! Before we dive into this newsletter and its (hopefully) exciting…

    5 条评论
  • ?? This AI Makes Big Tech Panic

    ?? This AI Makes Big Tech Panic

    In this issue: Re-defining what’s possible in AI DeepMind going even deeper Self-training agents are coming 1…

    11 条评论
  • ?? Google Releases Transformer 2.0

    ?? Google Releases Transformer 2.0

    In this issue: From Transformers to Titans Smaller, weaker, yet better O1-preview-level results for $450 Interested in…

    9 条评论
  • ???? AI Cutting Research Costs by 84%

    ???? AI Cutting Research Costs by 84%

    In this issue: AI helping researchers to be more efficient LLMs being unreliable when reasoning about time Evaluating…

    3 条评论

社区洞察