Image Editing As Programs with Diffusion Models

We introduce IEAP, a unified framework on the DiT backbone that leverages CoT reasoning to parse free-form instructions into sequential atomic operations and then executes them sequentially by a neural program interpreter, thereby enabling robust handling of layout-altering and complex edits.

Abstract

While diffusion models have achieved remarkable success in text-to-image generation, they encounter significant challenges with instruction-driven image editing. Our research highlights a key challenge: these models particularly struggle with structurally inconsistent edits that involve substantial layout changes. To mitigate this gap, we introduce Image Editing As Programs (IEAP), a unified image editing framework built upon the Diffusion Transformer (DiT) architecture. At its core, IEAP approaches instructional editing through a reductionist lens, decomposing complex editing instructions into sequences of atomic operations. Each operation is implemented via a lightweight adapter sharing the same DiT backbone and is specialized for a specific type of edit. Programmed by a vision-language model (VLM)-based agent, these operations collaboratively support arbitrary and structurally inconsistent transformations. By modularizing and sequencing edits in this way, IEAP generalizes robustly across a wide range of editing tasks, from simple adjustments to substantial structural changes. Extensive experiments demonstrate that IEAP significantly outperforms state-of-the-art methods on standard benchmarks across various editing scenarios. In these evaluations, our framework delivers superior accuracy and semantic fidelity, particularly for complex, multi-step instructions.

Comparison with State-of-the-Art Methods

More Visualization Results

Disclaimer

We open-source this project for academic research. The vast majority of images used in this project are either generated or licensed. If you have any concerns, please contact us, and we will promptly remove any inappropriate content. Our code is released under the Apache 2.0 License,, while our models are under the CC BY-NC 4.0 License. Any models related to FLUX.1-dev base model must adhere to the original licensing terms.

This research aims to advance the field of generative AI. Users are free to create images using this tool, provided they comply with local laws and exercise responsible usage. The developers are not liable for any misuse of the tool by users.

BibTeX

@article{hu2025ieap,
  title     = {Image Editing As Programs with Diffusion Models},
  author    = {Hu, Yujia and Liu, Songhua and Tan, Zhenxiong and Yang, Xingyi and Wang, Xinchao},
  year      = {2025},
  eprint    = {2506.04158},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}

I E A P

Image Editing As Programs with Diffusion Models

Yujia Hu Songhua Liu Zhenxiong Tan Xingyi Yang Xinchao Wang^*
National University of Singapore

Paper arXiv Code Model Demo

Abstract

Pipeline

Comparison with State-of-the-Art Methods

More Visualization Results

Disclaimer

BibTeX

I E A P Image Editing As Programs with Diffusion Models Yujia Hu Songhua Liu Zhenxiong Tan Xingyi Yang Xinchao Wang* National University of Singapore Paper arXiv Code Model Demo

Abstract

Pipeline

Comparison with State-of-the-Art Methods

More Visualization Results

Disclaimer

BibTeX

I E A P

Image Editing As Programs with Diffusion Models

Yujia Hu Songhua Liu Zhenxiong Tan Xingyi Yang Xinchao Wang^*
National University of Singapore

Paper arXiv Code Model Demo