Here are the top AI papers of the week:
1). s1: Simple test-time scaling
Researchers from Stanford, UW, and others introduce s1, a method to boost LLM performance by using extra compute at inference (“test-time scaling”). Key ideas include:
- Small yet powerful dataset – They curated s1K, only 1,000 challenging questions with detailed reasoning traces, to fine-tune a 32B model. Despite the tiny data, this provides strong reasoning exemplars.
- “Budget forcing” for reasoning – A new decoding trick appends the token “Wait” when the model tries to stop, forcing it to think longer. This leads the model to double-check and fix its reasoning step. By also cutting off overly long reasoning, they control inference time.
- Big gains over OpenAI’s o1 – The resulting model (s1-32B) (a fine-tuned version of Qwen2.5-32B-Instruct) outperforms OpenAI’s o1-preview model by up to +27% on competition-level math questions (MATH & AIME24). Notably, with test-time scaling, it boosts accuracy on AIME24 from 50% to 57%, surpassing its own normal limit.
Accelerate your AI projects with Prolific. Claim $50 free credits and get quality human data in minutes from 200,000+ taskers. No setup cost, no subscription, no delay—get started, top up your account to claim your free credit, and test Prolific for yourself now. Use code: NLP-50
2). OmniHuman-1: Scaling One-Stage Human Animation
A team at ByteDance AI Lab unveiled OmniHuman-1, a diffusion-transformer model that can generate highly realistic human videos from just a single image plus motion input (audio or video). Highlights:
- End-to-end human video generation – OmniHuman takes one image (any aspect ratio, from face only to full-body) and an audio clip or video motion and produces a lifelike video of that person speaking, singing, or performing actions. The outputs are remarkably realistic in motion, lighting, and texture detail.
- Mixed modality training – A key innovation is Omni-Conditions Training: mixing various motion modalities during training (audio-driven, video-driven, pose, etc.). This greatly expands the training data and overcomes the usual scarcity of high-quality talking-head video data. The model learns to handle diverse inputs (speech, song, instruments) and challenging poses.
- Outperforms prior methods – Compared to earlier one-stage models (e.g. audio-driven talking heads), OmniHuman generates more realistic videos and is more flexible in input types. It can even handle cartoons or animal figures as input, transferring motion naturally to each style.
- Broader support – The approach supports any portrait content (face close-up, half-body, full-body) and multiple driving signals simultaneously. This generality is a first for end-to-end human animation models.
3). LIMO: Less Is More for Reasoning
Can a handful of examples teach complex math reasoning to LLMs? This new LIMO paper challenges the notion that we need huge fine-tuning datasets for tough reasoning tasks. Key findings:
- Surprisingly few examples – With only 817 carefully curated training samples, the LIMO model achieves 57.1% accuracy on the AIME math competition and 94.8% on MATH. This is a giant leap from prior SFT-based models (which scored 6.5% and 59.2% respectively – using just 1% of the data those earlier approaches needed.
- Generalization with less data? – LIMO shows impressive OOD generalization: a +40.5% absolute improvement on average across 10 diverse benchmarks, even outperforming models trained on 100× more data. This challenges the assumption that more data is always required for complex skills and that fine-tuning only leads to memorization.
- “Less-Is-More” Hypothesis – The authors propose that if an LLM’s pre-training has already endowed it with rich knowledge, then only a minimal set of carefully designed examples (which they call “cognitive templates”) is needed to unlock advanced reasoning. Essentially, the model just needs to see how to use its knowledge, not thousands of repetitive problems.
- Open-source suite – The complete LIMO training suite is released for the community, supporting further research on data-efficient reasoning. This work hints that small, high-quality datasets might yield state-of-the-art reasoning, lowering the barrier to fine-tuning powerful LLMs.
4). CoAT: Chain-of-Associated-Thoughts for LLM Reasoning
This work introduces CoAT, a new “slow thinking” inference framework that enables an LLM to reason more like a human by exploring and updating its thoughts. Main components:
- MCTS + associative memory – CoAT marries Monte Carlo Tree Search (MCTS) with an associative memory mechanism. MCTS lets the model systematically explore different reasoning branches (possible solutions), while the associative memory dynamically injects new relevant information into the context as needed (mimicking how humans recall facts mid-thought).
- Iterative, self-improving reasoning – The framework can expand the search space of solutions and revisit or refine earlier intermediate conclusions. As it evaluates branches, it can incorporate new clues or correct itself, ensuring the final answer is more accurate and comprehensive. This is in contrast to standard one-pass LLM reasoning, which can’t easily backtrack or gather new info on the fly.
- Improved accuracy and diversity – In experiments across various generation and reasoning tasks, CoAT outperformed conventional single-pass inference on metrics like accuracy, coherence of reasoning steps, and solution diversity. The ability to iteratively broaden the search while keeping relevant context yields better results than “fast thinking” alone.
- Closer to human thought – CoAT is inspired by how humans solve problems: we iteratively consider alternatives, recall facts, and refine our thinking. It points toward LLM agents that can use search algorithms and memory to achieve more reliable reasoning.
5). Syntriever: Training Retrievers with LLM-Generated Data
How can we build a high-quality text retriever without large labeled datasets or access to an LLM’s internals? Syntriever presents a two-stage framework to distill knowledge from a black-box LLM into a retrieval model using synthetic data. Steps:
- Stage 1 – Distillation via synthetic Q&A: Given a query, they prompt a powerful LLM (e.g. GPT-4) to generate a relevant passage (answer) and also plausible but incorrect passages, using chain-of-thought to ensure variety. The LLM then self-verifies these generated passages to filter out any hallucinations or low-quality data. The result is a synthetic dataset of queries with positive and negative passages. A retriever is trained on this, with a loss that clusters embeddings of relevant passages closer than irrelevant ones.
- Stage 2 – Alignment with LLM preferences: They further align the retriever to prefer results the LLM would prefer. Using a partial Plackett-Luce ranking method, the retriever learns to rank passages similarly to the LLM’s judgments, with regularization to not drift too far from the Stage 1 model. This step fine-tunes the retriever to mimic the black-box LLM’s preferences.
- State-of-the-art results – Syntriever achieves new SOTA on several retrieval benchmarks across domains. This was achieved without any real training queries: all training data was synthetically generated by the LLM.
- No logits needed – Prior LLM-to-retriever distillation needed model logits or probabilities (not available from closed APIs). Syntriever gets around this by using only generated text and LLM scoring, making it applicable even to closed models.
6). Demystifying Long Chain-of-Thought Reasoning in LLMs
This work investigates how LLMs develop extended CoT reasoning, focusing on RL and compute scaling. Key insights include:
- Supervised fine-tuning (SFT) boosts performance – While not strictly necessary, SFT simplifies training and increases efficiency. Models fine-tuned with long CoT data achieve higher accuracy than those using short CoT sequences.
- Reward shaping is crucial for stable RL – The study finds that naive RL approaches don’t always extend CoT length effectively. To address this, the authors introduce a cosine length-scaling reward with repetition penalties, which balances reasoning depth and prevents meaningless length increases.
- Scaling verifiable reward signals – RL models trained with noisy, web-extracted “silver” supervision signals can generalize better to OOD tasks, such as STEM reasoning. Filtering such data is crucial to maintaining training stability.
- Emergent reasoning abilities in base models – Skills like error correction and backtracking exist in base models but require careful RL incentives to be effectively utilized in complex tasks.
This paper provides a structured roadmap for researchers looking to refine CoT training strategies for LLMs, highlighting how RL and reward tuning impact reasoning depth.
7). Rethinking Mixture-of-Agents: Ensemble One Strong LLM
Ensembling multiple models (Mixture-of-Agents, MoA) is a popular way to boost performance. This paper asks: is mixing different LLMs actually helpful, or are we better off ensembling one top model’s outputs? The surprising answer: “Self-MoA” (single-model ensemble) often wins over multi-model ensembles. Key points:
- Self-MoA vs. MoA – The authors propose Self-MoA, which simply generates multiple outputs from the single best model and then aggregates them (e.g., by majority voting or ranking), instead of combining outputs from various models. This increases diversity via multiple attempts, without introducing weaker models.
- Better performance – Extensive tests show Self-MoA outperforms a mixture of different LLMs in many cases. For example, using one strong model, Self-MoA achieved +6.6% higher score than a mixed-model MoA on the AlpacaEval 2.0 benchmark, and on average +3.8% across tasks like MMLU, CRUX, and MATH. In fact, applying Self-MoA to a top AlpacaEval model set a new state-of-the-art on the leaderboard.
- Why it works – Mixing models can hurt because the overall quality is limited by the weaker members. The study finds MoA’s benefit is highly sensitive to the quality of each model – adding a weaker model dilutes performance. Unless all models are very strong and complementary, you’re better off with one model’s outputs. They do identify niche scenarios where diverse models help, but those are exceptions.
- Sequential aggregation – They also introduce a sequential version of Self-MoA that can combine a large number of outputs over multiple rounds (rather than all at once). This sequential Self-MoA is as effective as one-shot aggregation, scaling ensembling to many outputs efficiently.
8). MaAS: Multi-agent Architecture Search (Agentic Supernet)
Building multi-agent systems of LLMs (where multiple agents collaborate, each with specific roles or tools) is powerful but usually requires hand-designing a single complex pipeline. MaAS (Multi-agent Architecture Search) instead learns a universal “agentic supernet” from which it can spawn an optimal agent team on the fly for each query. It automates designing the agent workflow per task:
- Agentic supernet – The authors define a continuous space of possible agent architectures (chains of LLM calls, tool uses, etc.). Rather than picking one static architecture, they train a supernet that encompasses many configurations. Each query can trigger a different sub-network of agents tailored to that query’s domain and difficulty.
- Dynamic resource allocation – Because the system adapts per query, it can allocate resources efficiently. Easy questions might use a simple, fast agent chain; hard problems invoke a more elaborate reasoning team. This avoids the one-size-fits-all cost of a monolithic agent system.
- Huge cost savings – On six benchmarks, MaAS used only 6–45% of the inference cost of existing multi-agent pipelines, yet still outperformed them by ~0.5–11.8% in accuracy. It finds cheaper ways to reach equal or better performance by tuning the agent configuration to the task.
- Robust and transferable – The agentic supernet approach showed strong generalization: architectures found effective on one task transferred well to new domains and even with different LLM backbones, outperforming static designs. This suggests the method learns general principles of how to orchestrate LLM agents optimally.
9). Advancing Reasoning in LLMs
This survey paper provides a timely overview of emerging methods to enhance reasoning capabilities in LLMs. It organizes the literature into several key approach categories:
- Prompting strategies – Techniques that guide the model’s reasoning via clever prompts, e.g. Chain-of-Thought prompting (having the model generate step-by-step solutions), Self-Consistency (sampling multiple reasoning paths and choosing the best answer), Tree-of-Thought strategies, etc. These methods improve logical deduction and multi-step solutions without changing the model’s architecture.
- Architectural innovations – Modifications to the model or its context to better facilitate reasoning. This includes retrieval-augmented models (LLMs that can fetch external facts), modular reasoning networks (systems that break a problem into sub-tasks handled by different modules or experts), and neuro-symbolic integration (combining neural nets with symbolic logic or tools. Such changes aim to give LLMs access to either more knowledge or more structured reasoning processes.
- Learning paradigms – New training methods to instill reasoning skills: fine-tuning on reasoning-specific datasets (e.g. math word problems), reinforcement learning approaches (rewarding correct reasoning chains), and self-supervised objectives that train the model to reason (like predicting masked steps in a proof. These improve the model’s inherent reasoning ability beyond what general pre-training provides.
- Evaluation & challenges – The survey also reviews how we evaluate reasoning in LLMs (benchmarks for logic, math, commonsense, etc.) and identifies open challenges. Key issues include hallucinations (the model fabricating illogical or untrue intermediate steps), brittleness to small changes (robustness), and generalization of reasoning methods across different tasks and domains. Addressing these will be crucial for the next generation of reasoning-augmented LLMs.
10). Survey: Text Data Augmentation for LLMs
This comprehensive survey covers text data augmentation techniques for LLMs. As LLMs demand massive training data, augmenting datasets with synthetic or transformed text is vital. In this paper:
- Classifies augmentation methods – It defines four categories: (1) Simple augmentation – basic text manipulations like synonym replacement, cropping, etc.; (2) Prompt-based augmentation – using an LLM with specific prompts to generate new training examples (taking advantage of the LLM’s own generative power; (3) Retrieval-based augmentation – pulling in external knowledge or contexts (via search or databases) to ground the generated text in facts; and (4) Hybrid augmentation – combinations of the above, or multi-step strategies.
- LLMs as data generators – A key insight is that modern LLMs can create high-quality synthetic data to improve themselves. By carefully prompting an LLM to produce variations of a task (for example, ask ChatGPT to come up with new math word problems), one can dramatically expand a training set. The survey discusses prompt design for this purpose and how to ensure the generated data is diverse and useful.
- Post-processing and filtering – Augmented data isn’t always perfect. The survey covers techniques to refine and filter generated data. For instance, verifying facts with a secondary model or removing examples that might introduce errors. This step is crucial to prevent “garbage in, garbage” out when augmenting data.
- Evaluation and future directions – It outlines common tasks where data augmentation is used (like low-resource language translation, QA, etc.) and how to evaluate the impact (improvement in accuracy, robustness, etc.). Finally, it discusses challenges (e.g. ensuring augmentation doesn’t distort data distribution, avoiding model bias reinforcement) and opportunities for new research.