??Top ML Papers of the Week

??Top ML Papers of the Week

Welcome to The Top ML Papers of the Week (December 16 - 22).

1). Genesis - a new universal physics simulation platform that combines a high-performance physics engine with generative AI capabilities; it enables natural language-driven creation of robotic simulations, character animations, and interactive 3D environments at speeds up to 430,000 times faster than in real-time. (paper | tweet)


2). Alignment Faking in LLMs - demonstrates that the Claude model can engage in "alignment faking"; it can strategically comply with harmful requests to avoid retraining while preserving its original safety preferences; this raises concerns about the reliability of AI safety training methods. (paper | tweet)


3). TheAgentCompany - a new benchmark for evaluating AI agents on real-world professional tasks in a simulated software company environment; tasks span multiple professional roles including software engineering, project management, finance, and HR; when tested with various LLMs, including both API-based models like Claude-3.5-Sonnet and open-source models like Llama 3.1, the results show the current limitations of AI agents. The best-performing model, Claude-3.5-Sonnet, achieved only a 24% success rate on completing tasks fully while scoring 34.4% when accounting for partial progress. (paper | tweet)



Editor Message

We’ve launched a new course Cursor: Coding with AI. It covers everything you need to know about coding with Cursor’s AI assistants and agents.

Use CURSOR20 for a 20% discount on our entire course bundle. The offer ends in 24 hrs.

Students and teams can reach out to [email protected] for special discounts.

Enroll Now



4). Graphs to Text-Attributed Graphs - automatically generates textual descriptions for nodes in a graph which leads to effective graph to text-attributed graph transformation; evaluates the approach on text-rich, text-limited, and text-free graphs, demonstrating that it enables a single GNN to operate across diverse graphs. (paper | tweet)


5). Qwen-2.5 Technical Report - Alibaba releases Qwen2.5, a new series of LLMs trained on 18T tokens, offering both open-weight models like Qwen2.5-72B and proprietary MoE variants that achieve competitive performance against larger models like Llama-3 and GPT-4. (paper | tweet)


6). PAE (Proposer-Agent-Evaluator) - a learning system that enables AI agents to autonomously discover and practice skills through web navigation, using reinforcement learning and context-aware task proposals to achieve state-of-the-art performance on real-world benchmarks. (paper )


7). DeepSeek-VL2 - a new series of vision-language models featuring dynamic tiling for high-resolution images and efficient MoE architecture, achieving competitive performance across visual tasks; achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. (paper | tweet)


8). AutoFeedback - a two-agent AI system that generates more accurate and pedagogically sound feedback for student responses in science assessments, significantly reducing common errors like over-praise compared to single-agent models. (paper)


9). A Survey of Mathematical Reasoning in the Era of Multimodal LLMs - presents a comprehensive survey analyzing mathematical reasoning capabilities in multimodal large language models (MLLMs), covering benchmarks, methodologies, and challenges across 200+ studies since 2021. (paper | tweet)


10). Precise Length Control in LLMs - adapts a pre-trained decoder-only LLM to produce responses of a desired length; integrates a secondary length-difference positional encoding into the input embeddings which enables counting down to a user-set response terminal length; claims to achieve mean token errors of less than 3 tokens without compromising quality. (paper | tweet)

要查看或添加评论,请登录

DAIR.AI的更多文章

社区洞察

其他会员也浏览了