CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing
Credit: https://arxiv.org/pdf/2503.10613

CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

Today's paper introduces CoSTA* (Cost-Sensitive Toolpath Agent), a novel approach for multi-turn image editing that combines the strengths of large language models (LLMs) and graph search algorithms. The method decomposes complex image editing tasks into subtasks and finds optimal paths of tool usage that balance quality and cost. CoSTA* outperforms existing image editing models by efficiently selecting the right tools for each subtask.

Method Overview

CoSTA* combines the planning capabilities of large language models with the optimization power of A* search to find efficient toolpaths for multi-turn image editing. The approach follows a three-stage process that balances quality and computational cost.

In the first stage, an LLM analyzes the editing instruction and creates a "subtask tree" that decomposes the complex task into simpler subtasks. For example, the instruction "recolor the person red, replace the cat with a dog, and change 'ABC' to 'DEF'" might be broken down into separate subtasks for object detection, recoloring, object replacement, and text modification. This decomposition leverages the LLM's commonsense reasoning while avoiding its limitations in tool selection.

The second stage involves pruning a Tool Dependency Graph (TDG) based on the subtask tree. The TDG is a directed graph where nodes represent AI tools (like YOLO for object detection or Stable Diffusion for image generation) and edges indicate which tools can accept outputs from others. By focusing only on the subgraph relevant to the identified subtasks, CoSTA* significantly reduces the search space for the next stage.

In the final stage, CoSTA* performs A* search on the pruned subgraph to find the optimal toolpath. The search is guided by a heuristic function that combines both quality and cost metrics for each tool on every subtask. The quality-cost trade-off can be adjusted using a coefficient α to match user preferences. As each subtask is executed, a vision-language model evaluates the output, and if a failure is detected, the tool's cost and quality estimates are updated, allowing the search to quickly recover and explore alternative paths.

The method also incorporates prior knowledge about tools from benchmark evaluations, which helps in making better initial estimates for the A* search. This combination of LLM planning and cost-sensitive search allows CoSTA* to automatically switch between modalities (image and text) across subtasks for optimal results.

Results

CoSTA* demonstrates superior performance compared to state-of-the-art image editing models and agents on a novel benchmark of challenging multi-turn image editing tasks. The method achieves Pareto optimality, dominating baselines on both cost and quality metrics.

By adjusting the trade-off coefficient α, CoSTA* can produce different solutions that prioritize either quality or efficiency according to user preferences. This flexibility allows it to push the Pareto frontier beyond what existing methods can achieve.

Visual comparisons show that CoSTA* successfully completes complex editing tasks where other methods fail. For example, when tasked with detecting a bench while recoloring it to pink, removing a cat, and recoloring a wall to yellow, CoSTA* produces accurate results while maintaining image coherence, whereas competing methods like GenArtist, MagicBrush, CLOVA, and InstructPix2Pix struggle with various aspects of the task.

Conclusion

The paper introduces CoSTA*, a novel approach that effectively addresses the challenges of multi-turn image editing by combining LLM-based subtask planning with cost-sensitive A* search. By leveraging prior knowledge of tools and enabling flexible quality-cost trade-offs, CoSTA* outperforms existing methods on complex editing tasks. For more information please consult the full paper.

Congrats to the authors for their work!

Gupta, Advait, et al. "CoSTA*: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing." arXiv preprint arXiv:2503.10613 (2025).

要查看或添加评论,请登录

Vlad Bogolin的更多文章

社区洞察