登录查看更多内容

??Top ML Papers of the Week

DAIR.AI

Democratizing Artificial Intelligence Research, Education, and Technologies

发布日期: 2024年11月10日

Welcome to the Top ML Papers of the Week (November 4 - 10).

1). Many-agent Simulations toward AI Civilization - demonstrates how 10-1000+ AI agents behave and progress with agent societies; proposes PIANO, an architecture that enables agents to interact with humans and other agents in real-time; shows that agents can autonomously develop specialized roles, adhere to and change collective rules, and engage in cultural and religious transmissions. (paper | tweet )

2). A Comprehensive Survey of Small Language Models - a survey on small language models (SLMs) and discussion on issues related to definitions, applications, enhancements, reliability, and more. (paper | tweet )

3). Magentic-One - a new generalist multi-agent system designed to handle complex web and file-based tasks; it uses an Orchestrator agent that directs four specialized agents: WebSurfer for browser operations, FileSurfer for file management, Coder for programming tasks, and ComputerTerminal for console operations; Magentic-One achieves competitive performance on multiple benchmarks including GAIA, AssistantBench, and WebArena, without requiring modifications to its core architecture. (paper | tweet )

4). Mixtures of In-Context Learners - uses subsets of demonstrations to train experts via in-context learning; given a training set, a trainable weighting function is used to combine the experts' next-token predictions; this approach applies to black-box LLMs since access to the internal parameters of the LLM is not required. Good properties include the following: 1) competitive with standard ICL while being significantly more data, memory, and computationally efficient, and 2) resilient to noisy demonstrations and label imbalance. (paper | tweet )

5). Attacking Vision-Language Agents via Pop-ups - shows that integrating adversarial pop-ups into existing agent testing environments leads to an attack success rate of 86%; this decreases the agents' task success rate by 47%; they also add that basic defense techniques (e.g., instructing the agent to ignore pop-ups) are ineffective. (paper | tweet )

6). Multi-expert Prompting with LLMs - improves LLM responses by simulating multiple experts and aggregating their responses; it guides an LLM to fulfill input instructions by simulating multiple experts and selecting the best response among individual and aggregated views; it achieves a new state-of-the-art on TruthfulQA-Generation with ChatGPT, surpassing the current SOTA of 87.97%; it also improves performance across factuality and usefulness while reducing toxicity and hurtfulness. (paper | tweet )

7). Number Understanding of LLMs - provides a comprehensive analysis of the numerical understanding and processing ability (NUPA) of LLMs; finds that naive finetuning can improve NUPA a lot on many but not all tasks; it also reports that techniques designed to enhance NUPA prove ineffective for finetuning pretrained models; explores chain-of-thought techniques applied to NUPA and suggests that chain-of-thought methods face scalability challenges, making them difficult to apply in practical scenarios. (paper | tweet )

8). WebRL - proposes a self-evolving online curriculum RL framework to bridge the gap between open and proprietary LLM-based web agents; it improves the success rate of Llama-3.1-8B from 4.8% to 42.4%, and from 6.1% to 43% for GLM4-9B; the open models significantly surpass the performance of GPT-4-Turbo (17.6%) and GPT-4o (13.9%); the self-evolving curriculum addresses the scarcity of web agent training tasks; this is underpinned by a robust outcome-supervised reward model to evaluate task success; an adaptive RL strategy helps to deal with distribution drift in online learning and ensures consistent improvements. (paper | tweet )

9). Adapting while Learning - proposes a two-part fine-tuning approach that first helps LLMs learn from tool-generated solutions and then trains them to determine when to solve problems directly versus when to use tools; testing on math, climate science, and epidemiology benchmarks shows significant improvements, with a 28% boost in accuracy and 14% better tool usage precision compared to leading models like GPT-4 and Claude-3.5; the two-stage approach helps the LLM to adaptively solve scientific problems of varying complexity. (paper | tweet )

10). Personalization of LLMs - presents a comprehensive framework for understanding personalized LLMs; introduces taxonomies for different aspects of personalization and unifying existing research across personalized text generation and downstream applications. (paper | tweet )

Vandana Sinha

2 周

Great to see such a diverse range of ML papers in this week's top picks! I'm particularly interested in the Many-agent Simulations toward AI Civilization paper, as it explores the potential for AI agents to interact with humans and other agents in real-time and develop specialized roles and cultural transmissions. It's exciting to see how AI is evolving and becoming more integrated into our society. Looking forward to reading more about these papers and their implications for the future of AI.