??Top ML Papers of the Week
Welcome to the Top ML Papers of the Week (November 11 - 17).
1). Impacts of AI on Innovation - suggests that top scientists leverage their domain knowledge to prioritize promising AI suggestions, while others waste significant resources testing false positives; finds that implementing AI materials discovery technology leads to substantial increases in productivity, with 44% more materials discovered, 39% more patent filings, and 17% more product innovation; reports that these gains came with concerning tradeoffs, as 82% of scientists reported reduced job satisfaction due to decreased creativity and skill underutilization. (paper | tweet )
2). Scaling Laws for Precision - introduces "precision-aware" scaling laws that predict how model performance is affected by both training and inference precision in LLMs; key findings include: 1) post-training quantization becomes more harmful as models are trained on more data, eventually making additional pretraining actively detrimental, 2) training in lower precision requires increasing model size to maintain performance, and 3) when jointly optimizing model size, data, and precision, the compute-optimal training precision is around 7-8 bits and independent of compute; also reports that when the model size is fixed, compute-optimal precision increases approximately logarithmically with data; the authors validate their predictions on models up to 1.7B parameters trained on up to 26B tokens, showing that both very high (16-bit) and very low (sub 4-bit) training precisions may be suboptimal. (paper | tweet )
Sponsor Message
Learn how to build with LLMs with our new hands-on courses on advanced prompt engineering and building AI agents .
We are excited to offer our subscribers a 35% discount to join the academy now. Use promo code BLACKFRIDAY at checkout. The offer expires on 11/29.
3). Evo - a 7B parameter AI model designed to understand and generate DNA sequences across multiple biological scales; the model, trained on 2.7 million prokaryotic and phage genomes, can process sequences up to 131 kilobases long while maintaining single-nucleotide resolution, enabling it to understand both molecular-level interactions and genome-wide patterns; Evo demonstrates superior performance in predicting and generating functional DNA, RNA, and protein sequences, including the first successful AI-generated CRISPR-Cas complexes and transposable systems that have been experimentally validated. (paper | tweet )
4). OpenCoder - introduces OpenCoder, a fully open-source LLM specialized for code generation and understanding; the authors identify several critical factors for building high-performing code LLMs: (1) effective data cleaning with code-optimized heuristic rules for deduplication, (2) recall of relevant text corpus related to code, and (3) high-quality synthetic in both annealing and supervised fine-tuning stages; OpenCoder surpasses previous fully open models at the 6B+ parameter scale and releases not just the model weights but also the complete training pipeline, datasets, and protocols to enable reproducible research. (paper | tweet )
5). The Surprising Effectiveness of Test-Time Training for Abstract Reasoning - explores test-time training (TTT) - updating model parameters temporarily during inference - for improving an LLM's abstract reasoning capabilities using the ARC benchmark; identifies three crucial components: initial fine-tuning on similar tasks, auxiliary task format and augmentations, and per-instance training; TTT significantly improves performance, achieving up to 6x improvement in accuracy compared to base fine-tuned models; when applying TTT to an 8B LLM, they achieve 53% accuracy on ARC's public validation set, improving the state-of-the-art for neural approaches by nearly 25%; by ensembling their method with program generation approaches, they achieve state-of-the-art public validation accuracy of 61.9%, matching average human performance; the findings suggest that explicit symbolic search is not the only path to improved abstract reasoning in LLMs; test-time training applied to continued training on few-shot examples can be highly effective. (paper | tweet )
6). A Taxonomy of AgentOps for Enabling Observability of Foundation Model-based Agents - analyzes AgentOps platforms and tools, highlighting the need for comprehensive observability and traceability features to ensure reliability in foundation model-based autonomous agent systems across their development and production lifecycle. (paper | tweet )
7). Toward Optimal Search and Retrieval for RAG - examines how retrieval affects performance in RAG pipelines for QA tasks; conducts experiments using BGE-base and ColBERT retrievers with LLaMA and Mistral, finding that including more gold (relevant) documents improves QA accuracy; finds that using approximate nearest neighbor search with lower recall only minimally impacts performance while potentially improving speed and memory efficiency; reports that adding noisy or irrelevant documents consistently degrades performance, contradicting previous research claims; concludes that optimizing retrieval of gold documents is crucial for RAG performance, and that operating at lower search accuracy levels can be a viable approach for practical applications. (paper | tweet )
8). Mitigating LLM Jailbreaks with Few Examples - introduces a new approach called for defending LLMs against jailbreak attacks, focusing on quickly adapting defenses after detecting new attacks rather than aiming for perfect adversarial upfront robustness; using a new benchmark, the most effective method, based on fine-tuning an input classifier, reduced attack success rates by over 240x for known attack types and 15x for novel variations after seeing just one example of each attack strategy; demonstrates that rapidly responding to new jailbreaks can be an effective alternative to traditional static defenses. (paper | tweet )
9). Mixture of Transformers - introduce Mixture-of-Transformers (MoT), a new sparse multi-modal transformer architecture that matches the performance of traditional models while using only about half the computational resources for text and image processing; MoT matches a dense baseline's performance using only 55.8% of the FLOPs. (paper )
10). HtmlRAG - a novel approach that proposes using HTML instead of plain text as the format for building RAG systems; the key finding is that preserving HTML structure provides richer semantic and structural information compared to plain text conversion, which typically loses important formatting like headings, tables, and semantic tags; to address the challenge of HTML documents being too long for LLM context windows, the authors develop a two-step pruning method: first cleaning unnecessary HTML elements (reducing length by 94%), then using a block-tree-based pruning approach that combines embedding-based and generative pruning to further reduce the content while maintaining important information; experiments across six different QA datasets demonstrate that HtmlRAG outperforms existing plain-text based methods, validating the advantages of preserving HTML structure in RAG systems. (paper | tweet )