登录查看更多内容

CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

发布日期: 2025年3月16日

Today's paper introduces CoSTA* (Cost-Sensitive Toolpath Agent), a novel approach for multi-turn image editing that combines the strengths of large language models (LLMs) and graph search algorithms. The method decomposes complex image editing tasks into subtasks and finds optimal paths of tool usage that balance quality and cost. CoSTA* outperforms existing image editing models by efficiently selecting the right tools for each subtask.

Method Overview

CoSTA* combines the planning capabilities of large language models with the optimization power of A* search to find efficient toolpaths for multi-turn image editing. The approach follows a three-stage process that balances quality and computational cost.

In the first stage, an LLM analyzes the editing instruction and creates a "subtask tree" that decomposes the complex task into simpler subtasks. For example, the instruction "recolor the person red, replace the cat with a dog, and change 'ABC' to 'DEF'" might be broken down into separate subtasks for object detection, recoloring, object replacement, and text modification. This decomposition leverages the LLM's commonsense reasoning while avoiding its limitations in tool selection.

The second stage involves pruning a Tool Dependency Graph (TDG) based on the subtask tree. The TDG is a directed graph where nodes represent AI tools (like YOLO for object detection or Stable Diffusion for image generation) and edges indicate which tools can accept outputs from others. By focusing only on the subgraph relevant to the identified subtasks, CoSTA* significantly reduces the search space for the next stage.

In the final stage, CoSTA* performs A* search on the pruned subgraph to find the optimal toolpath. The search is guided by a heuristic function that combines both quality and cost metrics for each tool on every subtask. The quality-cost trade-off can be adjusted using a coefficient α to match user preferences. As each subtask is executed, a vision-language model evaluates the output, and if a failure is detected, the tool's cost and quality estimates are updated, allowing the search to quickly recover and explore alternative paths.

The method also incorporates prior knowledge about tools from benchmark evaluations, which helps in making better initial estimates for the A* search. This combination of LLM planning and cost-sensitive search allows CoSTA* to automatically switch between modalities (image and text) across subtasks for optimal results.

Results

CoSTA* demonstrates superior performance compared to state-of-the-art image editing models and agents on a novel benchmark of challenging multi-turn image editing tasks. The method achieves Pareto optimality, dominating baselines on both cost and quality metrics.

By adjusting the trade-off coefficient α, CoSTA* can produce different solutions that prioritize either quality or efficiency according to user preferences. This flexibility allows it to push the Pareto frontier beyond what existing methods can achieve.

Visual comparisons show that CoSTA* successfully completes complex editing tasks where other methods fail. For example, when tasked with detecting a bench while recoloring it to pink, removing a cat, and recoloring a wall to yellow, CoSTA* produces accurate results while maintaining image coherence, whereas competing methods like GenArtist, MagicBrush, CLOVA, and InstructPix2Pix struggle with various aspects of the task.

Conclusion

The paper introduces CoSTA*, a novel approach that effectively addresses the challenges of multi-turn image editing by combining LLM-based subtask planning with cost-sensitive A* search. By leveraging prior knowledge of tools and enabling flexible quality-cost trade-offs, CoSTA* outperforms existing methods on complex editing tasks. For more information please consult the full paper.

Congrats to the authors for their work!

Gupta, Advait, et al. "CoSTA*: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing." arXiv preprint arXiv:2503.10613 (2025).

AI Paper of the Day

1,321 位关注者

要查看或添加评论，请登录

Vlad Bogolin的更多文章

LEGION: Learning to Ground and Explain for Synthetic Image Detection

2025年3月23日

LEGION: Learning to Ground and Explain for Synthetic Image Detection

Today's paper introduces LEGION, a comprehensive framework for synthetic image detection that not only identifies fake…
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

2025年3月22日

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Today's paper introduces JARVIS-VLA, a novel approach for training Vision-Language-Action (VLA) models to play visual…
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

2025年3月21日

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Today's paper presents the a comprehensive survey on efficient reasoning for Large Language Models (LLMs). It addresses…
TULIP: Towards Unified Language-Image Pretraining

2025年3月20日

TULIP: Towards Unified Language-Image Pretraining

Today's paper introduces TULIP (Towards Unified Language-Image Pretraining), a novel approach to image-text contrastive…
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

2025年3月19日

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

Today's paper introduces Creation-MMBench, a novel benchmark designed to evaluate the creative capabilities of…
SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

2025年3月18日

SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

Today's paper introduces SPIN-Bench, a comprehensive benchmark designed to evaluate how well Large Language Models…
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

2025年3月17日

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

Today's paper introduces ReCamMaster, a framework that enables re-shooting videos with new camera trajectories while…
OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

2025年3月15日

OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

Today's paper introduces OmniPaint, a unified framework for object-oriented image editing that reconceptualizes object…
Charting and Navigating Hugging Face's Model Atlas

2025年3月14日

Charting and Navigating Hugging Face's Model Atlas

Today's paper introduces the concept of a "model atlas" for navigating the vast landscape of publicly available neural…
VACE: All-in-One Video Creation and Editing

2025年3月13日

VACE: All-in-One Video Creation and Editing

Today's paper introduces VACE, an all-in-one model for video creation and editing. VACE unifies multiple video tasks…

See all articles

Method Overview

Results

Conclusion

AI Paper of the Day

1,321 位关注者

Vlad Bogolin的更多文章

LEGION: Learning to Ground and Explain for Synthetic Image Detection

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

TULIP: Towards Unified Language-Image Pretraining

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

Charting and Navigating Hugging Face's Model Atlas

VACE: All-in-One Video Creation and Editing

社区洞察