登录查看更多内容

OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

Vlad Bogolin

AI/ML Engineer & Researcher | Large Language Models (LLMs)

发布日期: 2025年3月15日

Today's paper introduces OmniPaint, a unified framework for object-oriented image editing that reconceptualizes object removal and insertion as interdependent processes rather than isolated tasks. The method leverages a pre-trained diffusion model and a progressive training pipeline to achieve high-fidelity object removal and insertion while preserving scene geometry and intrinsic properties like shadows and reflections.

Method Overview

OmniPaint builds upon a pre-trained diffusion model (FLUX) and introduces a novel training pipeline that treats object removal and insertion as complementary inverse problems. The framework takes an image with a binary mask indicating the region to be edited and operates on the masked input to either remove an object or insert a new one.

For object removal, the model suppresses semantic traces within the masked region while ensuring smooth boundary transitions and preventing unintended artifacts or hallucinations. For object insertion, it integrates a new object while maintaining global coherence and context-aware realism, including physical effects like shadows and reflections.

The training pipeline consists of three phases. First, an inpainting pretext training phase initializes the model with basic inpainting abilities. Second, a paired warmup phase uses 3,000 real-world paired samples to train the model for effect-aware object removal and insertion. Finally, a CycleFlow unpaired post-training phase leverages large-scale unpaired data to enhance object insertion capabilities.

The training process enforces cycle consistency between removal and insertion. This allows the model to learn from unpaired data by ensuring that reinserting a removed object approximately restores its original representation. The model uses two separate sets of parameters for object removal and insertion, which can be switched during inference via task-specific embeddings.

Another significant contribution is the Context-Aware Feature Deviation (CFD) score, a novel metric for evaluating object removal quality. CFD consists of two components: a hallucination penalty that detects unwanted object-like structures in the removed region, and a context coherence term that evaluates how well the inpainted region blends with the surrounding background.

Results

OmniPaint demonstrates superior performance in both object removal and insertion tasks compared to existing methods. For object removal, it achieves the lowest FID, CMMD, LPIPS, and CFD scores while maintaining high PSNR, SSIM, and ReMOVE scores across multiple datasets. Qualitative results show that OmniPaint successfully removes objects and their associated effects like reflections and shadows, which other methods struggle with.

For object insertion, OmniPaint outperforms all baselines in object identity preservation metrics (CLIP-I, DINOv2, CUTE, and DreamSim) and overall image quality metrics (MUSIQ and MANIQA). Visual comparisons reveal that OmniPaint generates inserted objects with more accurate shape, texture, and lighting consistency while preserving fine details and ensuring natural alignment with scene geometry and illumination.

Conclusion

OmniPaint presents a unified approach to object-oriented image editing by reconceptualizing object removal and insertion as interdependent tasks. Through its progressive training pipeline and CycleFlow mechanism, it achieves precise foreground elimination and seamless object integration while preserving scene geometry and intrinsic properties. For more information please consult the full paper.

Congrats to the authors for their work!

Yu, Yongsheng, et al. "OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting." arXiv preprint arXiv:2503.08677 (2025).

AI Paper of the Day

1,321 位关注者

要查看或添加评论，请登录

Vlad Bogolin的更多文章

LEGION: Learning to Ground and Explain for Synthetic Image Detection

2025年3月23日

LEGION: Learning to Ground and Explain for Synthetic Image Detection

Today's paper introduces LEGION, a comprehensive framework for synthetic image detection that not only identifies fake…
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

2025年3月22日

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Today's paper introduces JARVIS-VLA, a novel approach for training Vision-Language-Action (VLA) models to play visual…
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

2025年3月21日

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Today's paper presents the a comprehensive survey on efficient reasoning for Large Language Models (LLMs). It addresses…
TULIP: Towards Unified Language-Image Pretraining

2025年3月20日

TULIP: Towards Unified Language-Image Pretraining

Today's paper introduces TULIP (Towards Unified Language-Image Pretraining), a novel approach to image-text contrastive…
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

2025年3月19日

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

Today's paper introduces Creation-MMBench, a novel benchmark designed to evaluate the creative capabilities of…
SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

2025年3月18日

SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

Today's paper introduces SPIN-Bench, a comprehensive benchmark designed to evaluate how well Large Language Models…
ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

2025年3月17日

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

Today's paper introduces ReCamMaster, a framework that enables re-shooting videos with new camera trajectories while…
CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

2025年3月16日

CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

Today's paper introduces CoSTA* (Cost-Sensitive Toolpath Agent), a novel approach for multi-turn image editing that…
Charting and Navigating Hugging Face's Model Atlas

2025年3月14日

Charting and Navigating Hugging Face's Model Atlas

Today's paper introduces the concept of a "model atlas" for navigating the vast landscape of publicly available neural…
VACE: All-in-One Video Creation and Editing

2025年3月13日

VACE: All-in-One Video Creation and Editing

Today's paper introduces VACE, an all-in-one model for video creation and editing. VACE unifies multiple video tasks…

See all articles

Method Overview

Results

Conclusion

AI Paper of the Day

1,321 位关注者

Vlad Bogolin的更多文章

LEGION: Learning to Ground and Explain for Synthetic Image Detection

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

TULIP: Towards Unified Language-Image Pretraining

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM

SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

ReCamMaster: Camera-Controlled Generative Rendering from A Single Video

CoSTAast: Cost-Sensitive Toolpath Agent for Multi-turn Image Editing

Charting and Navigating Hugging Face's Model Atlas

VACE: All-in-One Video Creation and Editing

社区洞察