?? Search-R1, Gemini Embeddings & Controlled Reasoning with L1
In this issue:
Accelerate your AI projects with Prolific. Claim $50 free credits and get quality human data in minutes from 200,000+ taskers.
No setup cost, no subscription, no delay—get started, top up your account to claim your free credit, and test Prolific for yourself now.
Use code: LLM-WATCH-50
1. Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Watching: Search-R1 (paper)
What problem does it solve? Large Language Models (LLMs) face significant challenges when performing complex reasoning tasks that require up-to-date or specialized external knowledge. Current approaches to integrating LLMs with search engines have notable limitations. Retrieval-Augmented Generation (RAG) typically follows a one-round retrieval pattern that lacks flexibility for multi-turn, multi-query interactions needed for complex reasoning. Meanwhile, tool-use approaches either struggle with generalization (when using prompting) or require extensive high-quality labeled data of search-and-reasoning interactions (when using supervised fine-tuning). The research community has been searching for more flexible and data-efficient methods to teach LLMs how to optimally interact with search engines.
How does it solve the problem? The researchers introduced Search-R1, a new reinforcement learning framework that teaches LLMs to interleave step-by-step reasoning with search engine queries. Unlike previous approaches, Search-R1 requires no supervised demonstration data - the model learns solely through RL with a simple outcome-based reward function. The system uses special tokens to structure interactions: reasoning steps are wrapped in <think> tags, search queries in <search></search> tags, and retrieved information appears between <information></information> tags. To ensure stable training, the authors implemented "retrieved token masking" which prevents optimization of tokens from search results. The framework is compatible with various RL algorithms, including PPO and GRPO, allowing the LLM to develop sophisticated multi-turn retrieval strategies through experience.
What are the key findings? Experiments across seven question-answering datasets demonstrated that Search-R1 significantly outperforms state-of-the-art baselines, with impressive relative improvements of 26% for Qwen2.5-7B, 21% for Qwen2.5-3B, and 10% for LLaMA3.2-3B models. The researchers found that while instruction-tuned models initially converged faster, base models eventually achieved comparable performance through reinforcement learning. Interestingly, GRPO generally outperformed PPO for optimization, though PPO provided greater training stability. The case studies revealed that Search-R1 enables LLMs to perform iterative self-verification through additional search steps, even when they already have sufficient information - a sophisticated behavior that emerged naturally from RL training.
Why does it matter? By relying solely on reinforcement learning without extensive supervised data, Search-R1 offers a more scalable approach to teaching LLMs complex search and reasoning behaviors. This is particularly valuable as models encounter increasingly complex tasks requiring specialized or time-sensitive knowledge beyond their training data. The framework's effectiveness across different model families suggests broad applicability, while the multi-turn retrieval capability enables more sophisticated problem-solving than traditional RAG approaches. As LLMs continue to be deployed in knowledge-intensive applications, this work points toward more autonomous and effective information-seeking behaviors that could significantly enhance their real-world utility.
2. L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
Watching: L1 (paper)
What problem does it solve? LLMs with reasoning capabilities have shown they can solve increasingly complex problems by "thinking longer" - generating extended chain-of-thought sequences. However, these models lack control over their reasoning length, which creates a fundamental inefficiency. Without length control, it's impossible to allocate computing resources effectively: models might waste tokens on simple problems or cut reasoning short on complex ones. Previous attempts at controlling reasoning length, like the S1 method, have resulted in significant performance degradation compared to uncontrolled models.
How does it solve the problem? Length Controlled Policy Optimization (LCPO) is a new reinforcement learning approach that trains models to satisfy two objectives simultaneously: providing correct answers and adhering to user-specified length constraints given in the prompt. They trained two variants of their model called L1: L1-Exact, which generates reasoning of exactly the specified length, and L1-Max, which ensures reasoning doesn't exceed a maximum token budget. Unlike previous heuristic-based approaches, L1 learns to adaptively allocate tokens within the constraints while maintaining reasoning quality.
What are the key findings? L1 demonstrated precise control over reasoning length while significantly outperforming previous methods - beating the S1 length-control approach by up to 100% relative and 20% absolute on math reasoning tasks. The models maintained strong performance across various domains, even generalizing to out-of-distribution tasks like logical reasoning and general knowledge benchmarks. Perhaps most surprisingly, the researchers discovered that models trained to generate long reasoning chains become unexpectedly good at short-form reasoning, with their 1.5B parameter model matching GPT-4o's performance when using identical token budgets.
Why does it matter? These findings offer a solution to the compute efficiency problem in reasoning LLMs by enabling precise control of the performance-compute tradeoff. This has significant practical implications for deploying reasoning models in resource-constrained environments where computational costs must be carefully managed. The ability to dynamically allocate tokens based on problem difficulty creates more efficient reasoning systems overall. Furthermore, the discovery that smaller models can match much larger ones at identical token budgets suggests a promising approach for efficient reasoning that could reduce the need for extremely large parameter counts in certain applications.
3. Gemini Embedding: Generalizable Embeddings from Gemini
Watching: Gemini Embedding (paper)
What problem does it solve? Text embeddings - vector representations that capture the meaning of words, sentences, and documents - are fundamental building blocks for many NLP applications. However, until now, developers faced a frustrating trade-off: they could either optimize for strong multilingual capabilities, good code understanding, or superior English performance, but not all three simultaneously. This limitation has forced organizations to maintain multiple specialized embedding models, increasing complexity and computational costs. The Gemini Embedding paper tackles this fragmentation by attempting to create a truly universal embedding model that maintains high quality across languages, domains, and tasks without sacrificing performance in any area.
How does it solve the problem? The researchers leverage Google's powerful Gemini LLM as their foundation and employ a sophisticated two-stage training approach. First, they "pre-finetune" the model on billions of text pairs to adapt it from generation to encoding. Then, they fine-tune on diverse task-specific datasets with carefully optimized batch sizes. What's particularly clever is their bootstrapping approach - using Gemini itself to improve the training data by generating synthetic examples, filtering low-quality data, and mining optimal positive/negative pairs for contrastive learning. The final model benefits from "Model Soup," a technique that averages parameters from multiple fine-tuned checkpoints to enhance generalization across languages and tasks.
What are the key findings? Gemini Embedding achieves remarkable across-the-board performance, establishing new state-of-the-art benchmarks on MTEB Multilingual (68.32 task mean score, +5.09 over the previous best model), MTEB English, and MTEB Code simultaneously with a single model. It particularly excels at classification (+9.6 points), clustering (+3.7), and retrieval (+9.0) compared to previous top performers. Cross-lingual retrieval sees dramatic improvements, with the model achieving 90.42% on XOR-Retrieve and 64.33% on XTREME-UP, supporting over 250 languages. Perhaps most impressively, their ablation studies show the model can generalize to multiple languages effectively even when trained primarily on English data, suggesting it's leveraging Gemini's inherent multilingual understanding.
Why does it matter? This is a big deal for embeddings - the trade-offs between multilingual support, code understanding, and task-specific performance don’t seem unavoidable anymore. By creating a single unified embedding space that works well across all these dimensions, Gemini Embedding enables more inclusive and effective AI applications that can serve users regardless of language. The ability to precompute these high-quality representations also makes Gemini's capabilities accessible in compute and latency-sensitive settings where running a full LLM would be impractical. Consequently, we can expect more powerful and equitable information retrieval systems, recommendation engines, and semantic search applications that work equally well for users worldwide.
Papers of the Week:
?? If you enjoyed this article, give it a like and share it with your peers.
CEO | vCTO | vCISO | vCDO - Digital & Innovation
1 周No wonder. Google AI is the worse when it comes to answering questions. It is much farer than ChatGPT and Copilot.