Paper Review: Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning

Andrey Lukyanenko

Senior Data Scientist @ Careem. Kaggle Competition Master, Notebooks Top-1.

发布日期: 2024年7月1日

Husky is an open-source language agent designed to handle diverse complex tasks, including numerical, tabular, and knowledge-based reasoning. Unlike many existing agents that are proprietary or task-specific, Husky operates in a unified action space, alternating between generating actions and executing them with expert models to solve tasks. It uses a comprehensive set of actions and high-quality training data for expert models. Experiments demonstrate that Husky outperforms previous agents across 14 datasets and performs exceptionally well on a new evaluation set, HuskyQA, which tests mixed-tool reasoning and numerical tasks. Notably, Husky’s performance is comparable to leading models like GPT-4, even with 7B size.

The approach

Training

Husky lacks existing training data for next step prediction, tool calls, and generation of code, math, or search queries as it is a new framework. To address this, it leverages a teacher language model to create tool-integrated solution trajectories for training tasks. These trajectories are used to build training sets for various modules within Husky. The framework consists of an Action Generator, which determines the next step and tool to use, and Expert Models for code, math, and query generation.

Each module is trained on data extracted and formatted from the solution trajectories, using the task instruction, solution history, and current step as inputs. The modules are fine-tuned independently using a standard next token prediction objective.

Inference

Husky performs inference in the following steps:

Action generator determines the next step and appropriate tool for each stage of problem-solving
Based on the selected tool, the corresponding Expert Model is activated: Code Generator for coding tasks, Math Reasoner for mathematical problems, Query Generator for search queries, Commonsense Reasoner for general reasoning and output rewriting
The chosen Expert Model generated the output. For code and search outputs, the result is rewritten into natural language
The step and its output are added to the solution history, which is used as input for the next iteration.
This process repeats until the action Generator identifies the final answer.

Experiments

Base models for the modules:

Action generator - LLAMA-2-7B and 13B, and LLAMA-3-8B.
Code generator - DeepSeekCoder-7B-Instruct-V1.5
Math reasoner - DeepSeelMath-7B-Instruct.
Query generator - LLAMA-2-7B.

Vlad Bogolin 3 个月前

Building ReAct (Reason + Act) Agent Using Large…

Rambaksh Prajapati 1 个月前

Paper Review: Think before you speak: Training…

Andrey Lukyanenko 11 个月前

Numerical Reasoning Tasks: Husky outperforms other agents by 10-20 points on both seen and unseen tasks. Its iterative stepwise execution proves effective for unseen math tasks, even when compared to state-of-the-art CoT prompts.

Tabular Reasoning Tasks: Husky consistently outperforms baselines on both seen and unseen tasks, benefiting from its exposure to tabular data during training.
Knowledge-based Reasoning Tasks: Husky outperforms other agents, including task-specific ones, even on unseen tasks. It achieves this without using highly specialized tools like Wikipedia passage retrievers.
Mixed Reasoning Tasks: Husky outperforms most baselines and competes with or outperforms advanced proprietary models like GPT-3.5 and GPT-4 variants. It demonstrates strong generalization across various domains in multi-tool reasoning strategies.

Analysis

Cross-task Generalization

Husky’s action generator, trained jointly across different reasoning tasks, shows comparable performance to domain-specific training. While some tasks slightly benefit from domain-specific training, the differences are generally small, indicating that joint training preserves performance across domains and suggests potential for scaling to more diverse tasks.

Tool Choice

Testing different models for code generation and math reasoning showed that specialized models outperform general language models. However, strong general models like Llama-3-8B also showed competitive performance, especially in coding tasks.

Husky Llama3-8B-all

A version of Husky using Llama-3-8B for all components demonstrated similar performance to the specialized version in most tasks, except for numerical reasoning. This suggests that fine-tuning all modules from a single, capable base model can yield robust performance across various tasks while simplifying development.

Ali Naderi

Mechatronics Engineer Artificial Intelligence (Data Scientist, Machine/Deep Learning, Machine Vision, Image Processing)

2 个月

Very helpful!

要查看或添加评论，请登录

查看全部

Paper Review: Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning

Andrey Lukyanenko

Senior Data Scientist @ Careem. Kaggle Competition Master, Notebooks Top-1.

The approach

Training

Inference

Experiments

领英推荐

Analysis

Cross-task Generalization

Tool Choice

Husky Llama3-8B-all

更多精彩文章

社区洞察

其他会员也浏览了

Fine tuning LLMs for Memorization

Evaluating Retrieval Augmented Generation Pipelines with LlamaIndex and TruLens

LLMs for Industrial Engineers: Custom GPTs

Solving Complex Problems Using FastAPI, LangChain, and GPT-4 Enhanced by OCR and Graph-Based Tools

Three techniques to adapt LLMs for any use case

RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation

OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework

The Altar of Embeddings

Exploring RAG (Retrieval Augmented Generation)

Building a simple Large Language Model (LLM) using text from books - Replicate Project

The approach

Training

Inference

Experiments

领英推荐

Analysis

Cross-task Generalization

Tool Choice

Husky Llama3-8B-all

Paper Review: Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

2024年9月16日

Paper Review: Agentic Retrieval-Augmented Generation for Time Series?Analysis

2024年9月4日

Paper Review: Winning Amazon KDD Cup24

2024年8月19日

Paper Review: Wolf: Captioning Everything with a World Summarization Framework

2024年8月12日

Paper Review: Diffusion Feedback Helps CLIP See?Better

2024年8月5日

Paper Review: Masked Attention is All You Need for Graphs

2024年7月29日

Paper Review: RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

2024年7月22日

Paper Review: Unveiling Encoder-Free Vision-Language Models

2024年7月15日

Paper Review: Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

2024年6月17日

Paper Review: σ-GPTs: A New Approach to Autoregressive Models

2024年6月10日

社区洞察

其他会员也浏览了

Fine tuning LLMs for Memorization

Evaluating Retrieval Augmented Generation Pipelines with LlamaIndex and TruLens

LLMs for Industrial Engineers: Custom GPTs

Solving Complex Problems Using FastAPI, LangChain, and GPT-4 Enhanced by OCR and Graph-Based Tools

Three techniques to adapt LLMs for any use case

RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation

OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework

The Altar of Embeddings

Exploring RAG (Retrieval Augmented Generation)

Building a simple Large Language Model (LLM) using text from books - Replicate Project