Paper Review: Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning
Andrey Lukyanenko
Senior Data Scientist @ Careem. Kaggle Competition Master, Notebooks Top-1.
Husky is an open-source language agent designed to handle diverse complex tasks, including numerical, tabular, and knowledge-based reasoning. Unlike many existing agents that are proprietary or task-specific, Husky operates in a unified action space, alternating between generating actions and executing them with expert models to solve tasks. It uses a comprehensive set of actions and high-quality training data for expert models. Experiments demonstrate that Husky outperforms previous agents across 14 datasets and performs exceptionally well on a new evaluation set, HuskyQA, which tests mixed-tool reasoning and numerical tasks. Notably, Husky’s performance is comparable to leading models like GPT-4, even with 7B size.
The approach
Training
Husky lacks existing training data for next step prediction, tool calls, and generation of code, math, or search queries as it is a new framework. To address this, it leverages a teacher language model to create tool-integrated solution trajectories for training tasks. These trajectories are used to build training sets for various modules within Husky. The framework consists of an Action Generator, which determines the next step and tool to use, and Expert Models for code, math, and query generation.
Each module is trained on data extracted and formatted from the solution trajectories, using the task instruction, solution history, and current step as inputs. The modules are fine-tuned independently using a standard next token prediction objective.
Inference
Husky performs inference in the following steps:
Experiments
Base models for the modules:
领英推荐
Analysis
Cross-task Generalization
Husky’s action generator, trained jointly across different reasoning tasks, shows comparable performance to domain-specific training. While some tasks slightly benefit from domain-specific training, the differences are generally small, indicating that joint training preserves performance across domains and suggests potential for scaling to more diverse tasks.
Tool Choice
Testing different models for code generation and math reasoning showed that specialized models outperform general language models. However, strong general models like Llama-3-8B also showed competitive performance, especially in coding tasks.
Husky Llama3-8B-all
A version of Husky using Llama-3-8B for all components demonstrated similar performance to the specialized version in most tasks, except for numerical reasoning. This suggests that fine-tuning all modules from a single, capable base model can yield robust performance across various tasks while simplifying development.
Mechatronics Engineer Artificial Intelligence (Data Scientist, Machine/Deep Learning, Machine Vision, Image Processing)
2 个月Very helpful!