??Top ML Papers of the Week

??Top ML Papers of the Week

Welcome to The Top ML Papers of the Week (September 23 - September 29).

1). Llama 3.2 - presents small and medium-sized vision LLMs (11B and 90B parameters), and lightweight, text-only models (1B and 3B); the text-only models are trained to support context length of 128K tokens and outperform other models in their class on a range of tasks; vision models exceed other models such as Claude 3 Haiku on image understanding tasks. (paper | tweet )


2). Molmo - presents a family of open, state-of-the-art multimodal AI models; the 72B model in the Molmo family outperforms others in the class of open weight and data models; it also compares favorably against proprietary models like GPT-4o, Claude 3.5, and Gemini 1.5 on several benchmarks. (paper | tweet )



Sponsor message

DAIR.AI is excited to introduce a new catalog of self-paced courses in prompt engineering and LLMs. Join the academy to learn how to build effectively with AI.

Use code PROMPTING20 to get an extra 20% discount. Only valid to the first 500 enrollments.

Join Now!



3). AlphaChip - a reinforcement learning-based method trained to design the physical layout of chips; AlphaChip is reportedly used in three additional generations of Google’s TPU; this release includes an open-source implementation of the method to help pre-train on a variety of chip blocks to apply to new blocks; also releases a model checkpoint pre-trained on 20 TPU blocks. (paper | tweet )


4). LLMs Still Can’t Plan - evaluates whether large reasoning models such as o1 can plan; finds that a domain-independent planner can solve all instances of Mystery Blocksworld but LLMs struggle, even on small instances; o1-preview is effective on the task but tend to degrade in performance as plan length increases, concludes that while o1 shows progress on more challenging planning problems, the accuracy gains cannot be considered general or robust. (paper | tweet )


5). Scaled-up Instructable Model Become Less Reliable - suggests that larger and more instructable LLMs may become less reliable; investigates LLMs across three elements: difficulty concordance, task avoidance, and prompting stability; finds that early models often avoid user questions but scaled-up, shaped-up models tend to give an apparently sensible yet wrong answer much more often, including errors on difficult questions that human supervisors frequently overlook. (paper | tweet )


6). Logic-of-Thought - proposes a new prompting technique called Logic-of-Thought (LoT) which employs propositional logic to generate and inject expanded logical information from the input context; it enhances CoT performance on the ReClor dataset by +4.35%; it improves CoT+SelfConsistency’s performance on LogiQA by +5%; it also boosts the performance of ToT on the ProofWriter dataset by +8%. (paper | tweet )


7). RAG and Beyond - presents a survey that introduces a RAG task categorization method that helps to classify user queries into four levels according to the type of external data required and the focus of the task; summarizes key challenges in building robust data-augmented LLM applications and the most effective techniques for addressing them. (paper | tweet )


8). A Preliminary Study of o1 in Medicine - provides a preliminary exploration of the o1-preview model in medical scenarios; shows that o1 surpasses the previous GPT-4 in accuracy by an average of 6.2% and 6.6% across 19 datasets and two newly created complex QA scenarios; identifies hallucination, inconsistent multilingual ability, and discrepant metrics for evaluation. (paper | tweet )


9). Small Language Models Survey - a comprehensive survey on small language models (SLMs) across architectures, training datasets, and training algorithms; analyzes 59 state-of-the-art open-source SLMs and capabilities such as reasoning, in-context learning, maths, and coding; other discussions include on-device runtime costs, latency, memory footprint, and valuable insights. (paper | tweet )


10). Minstrel - a multi-generative agent system with reflection capabilities to automate structural prompt generation; it presents LangGPT, an extensible framework for designing prompts; Minstrel is built on top of LangGPT and experiments demonstrate that structural prompts (either generated by Minstrel or written manually) perform better in guiding LLMs to perform tasks. (paper | tweet )


Reach out to [email protected] if you would like to promote with us. Our newsletter is read by over 90K AI Researchers, Engineers, and Developers.

Anzhou Zhang

Principal Data Scientist @ Genentech | LLM, Generative AI

1 个月
Peter Bellen

Blog for AI Articles

1 个月

A more technical article: "Self-correcting LLM"..... Read this article. LEAVE A COMMENT OR QUESTION ON THE ARTICLE SITE. Thanks. Any interaction on the Article Site is welcome If you have an idea for a new article; tell me; Thanks. English : https://aifornoobsandexperts.com/self-correcting-llm/ Dutch :?https://aivoorjanenalleman.nl/zelfcorrigerende-llm/

回复
Habiba Zaman

Sales And Marketing Specialist at Amazon virtual assistant and freelancer

1 个月

Interesting

回复
Leandro Cunha

Senior Consultant @ PSC Consulting | Data & AI Engineer | Advanced Analytics | ex-McKinsey QuantumBlack

1 个月

Incredible week for ML breakthroughs! ?? Big thanks to DAIR.AI for the roundup

回复
Godwin Josh

Co-Founder of Altrosyn and DIrector at CDTECH | Inventor | Manufacturer

1 个月

Llama 3.2's fine-tuning approach on diverse datasets is notable. AlphaChip's hardware design for AI inference presents a compelling path toward efficiency gains. The "Logic-of-Thought" paper delves into the intriguing realm of reasoning within LLMs, raising questions about symbolic representation integration. How might we bridge the gap between symbolic reasoning and the statistical nature of current LLMs?

要查看或添加评论,请登录

社区洞察

其他会员也浏览了