When the Prompt Becomes Visual

Vision-Centric Jailbreak Attacks for Large Image Editing Models
Welcome ! This project aims to investigate the safety of large image editing models in a vision-centric perspective.

* Equal Contribution; # Correspondance

1 Tsinghua University    2 Peng Cheng Laboratory, Shenzhen    3 Central South University   

🎬Examples

Browse a curated set of attack inputs: each image encodes an editing request through visual annotations, serving as the prompt in our evaluation. The left side shows the original image, and the right side shows the image output from the edited image model.

🌟 Overview

Recent advances in large image editing models have shifted the paradigm from text-driven instructions to vision-prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual–text prompts. While this paradigm greatly expands usability, it also introduces a critical and underexplored safety risk: the attack surface itself becomes visual. To mitigate the safety gap, this project aims to systematically investigate the safety of large image editing models from a vision-centric perspective, with new jailbreak attack method, benchmark and a training-free defense approach.

🏆 LeaderBoard on 15 risky categories of IESBench

We adopt an automated evaluation protocol with MLLM-as-a-judge, and report multiple metrics to capture not only jailbreak success but also whether the output is meaningfully edited and harmful:

MetricDescription
Attack Success Rate (ASR)The ratio of attacks that can bypass the guard models.
Harmfulness Score (HS)The harmfulness level of the edited image in a scale of 1–5.
Editing Validity (EV)The attack that can successfully bypass the guard, but the edited content is invalid (e.g., garbled text).
High Risk Ratio (HRR)The proportion of effective and high-risk attacks (e.g., HS≥4), used to measure "true high-risk output".

Commercial Models Open-Source Models
We sort the models using the ASR by default, and give the 🥇🥈🥉 to the first, second and third models.
Model I1 I2 I3 I4 I5 I6 I7 I8 I9 I10 I11 I12 I13 I14 I15 ALL
Qwen-Image-Edit (Online version) 100.0 93.0 99.1 100.0 98.1 100.0 100.0 94.9 96.8 80.0 97.8 88.7 100.0 100.0 100.0 97.5
Nano Banana Pro (🥉) 60.4 95.3 88.3 30.8 92.5 100.0 90.5 95.8 84.2 100.0 41.3 74.2 100.0 83.8 100.0 80.9
GPT Image 1.5 (🥈) 48.9 87.6 44.1 39.8 54.7 97.2 94.0 91.6 38.9 60.0 95.7 32.3 92.3 82.4 100.0 70.3
Seedream 4.5 98.6 92.2 86.5 100.0 100.0 100.0 100.0 96.3 86.3 100.0 97.8 83.9 100.0 83.8 100.0 94.1
BAGEL 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
Flux2.0 [dev] 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
Qwen-Image-Edit* (Local version) 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
Qwen-Image-Edit-Safe (Ours) (🥇) 87.0 77.3 87.4 88.7 81.1 72.2 69.0 53.4 71.8 100.0 28.3 8.1 61.5 72.1 55.3 66.9

🔱 Risky Catergory Definition and Examples

To facilitate standardized evaluation, we construct the IESBench, a vision-centric benchmark for evaluating the safety of large image editing models, which contains 1054 visually-prompted images, spanning across 15 safety catergories, 116 attributes and 9 actions.

A

🎬 More Interesting Examples of Failed Attacks

🎓 BibTex

If you find our work can be helpful, we would appreciate your citation and star:

@misc{hou2026vja,
      title={When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models}, 
      author={Jiacheng Hou and Yining Sun and Ruochong Jin and Haochen Han and Fangming Liu and Wai Kin Victor Chan and Alex Jinpeng Wang},
      year={2026},
      eprint={xxx},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/xxx}, 
}