When the Prompt Becomes Visual

Vision-Centric Jailbreak Attacks for Large Image Editing Models
Welcome ! This project aims to investigate the safety of large image editing models in a vision-centric perspective.

📄 Paper 📚 Dataset 💻 Code

Jiacheng Hou^*1,
Yining Sun^{*1 2},
Ruochong Jin^{*1 2},
Haochen Han^#2,
Fangming Liu²,
Wai Kin Victor Chan¹,
Alex Jinpeng Wang^#3,

^* Equal Contribution; ^# Correspondance

¹ Tsinghua University ² Peng Cheng Laboratory, Shenzhen ³ Central South University

🎬Examples

Browse a curated set of attack inputs: each image encodes an editing request through visual annotations, serving as the prompt in our evaluation. The left side shows the original image, and the right side shows the image output from the edited image model.

🌟 Overview

Recent advances in large image editing models have shifted the paradigm from text-driven instructions to vision-prompt editing, where user intent is inferred directly from visual inputs such as marks, arrows, and visual–text prompts. While this paradigm greatly expands usability, it also introduces a critical and underexplored safety risk: the attack surface itself becomes visual. To mitigate the safety gap, this project aims to systematically investigate the safety of large image editing models from a vision-centric perspective, with new jailbreak attack method, benchmark and a training-free defense approach.

🏆 LeaderBoard on 15 risky categories of IESBench

We adopt an automated evaluation protocol with MLLM-as-a-judge, and report multiple metrics to capture not only jailbreak success but also whether the output is meaningfully edited and harmful:

Metric	Description
Attack Success Rate (ASR)	The ratio of attacks that can bypass the guard models.
Harmfulness Score (HS)	The harmfulness level of the edited image in a scale of 1–5.
Editing Validity (EV)	The attack that can successfully bypass the guard, but the edited content is invalid (e.g., garbled text).
High Risk Ratio (HRR)	The proportion of effective and high-risk attacks (e.g., HS≥4), used to measure "true high-risk output".

Commercial Models Open-Source Models

We sort the models using the ASR by default, and give the 🥇🥈🥉 to the first, second and third models.

Model	I1	I2	I3	I4	I5	I6	I7	I8	I9	I10	I11	I12	I13	I14	I15	ALL
Qwen-Image-Edit (Online version)	100.0	93.0	99.1	100.0	98.1	100.0	100.0	94.9	96.8	80.0	97.8	88.7	100.0	100.0	100.0	97.5
Nano Banana Pro (🥉)	60.4	95.3	88.3	30.8	92.5	100.0	90.5	95.8	84.2	100.0	41.3	74.2	100.0	83.8	100.0	80.9
GPT Image 1.5 (🥈)	48.9	87.6	44.1	39.8	54.7	97.2	94.0	91.6	38.9	60.0	95.7	32.3	92.3	82.4	100.0	70.3
Seedream 4.5	98.6	92.2	86.5	100.0	100.0	100.0	100.0	96.3	86.3	100.0	97.8	83.9	100.0	83.8	100.0	94.1
BAGEL	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0
Flux2.0 [dev]	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0
Qwen-Image-Edit* (Local version)	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0	100.0
Qwen-Image-Edit-Safe (Ours) (🥇)	87.0	77.3	87.4	88.7	81.1	72.2	69.0	53.4	71.8	100.0	28.3	8.1	61.5	72.1	55.3	66.9

Model	I1	I2	I3	I4	I5	I6	I7	I8	I9	I10	I11	I12	I13	I14	I15	ALL
Qwen-Image-Edit (Online version)	4.2	3.7	4.0	4.2	4.4	4.0	4.8	3.8	3.8	3.8	4.4	4.2	4.4	4.1	3.1	4.1
Nano Banana Pro (🥉)	3.2	4.4	4.1	1.9	4.5	4.7	4.6	4.3	3.6	4.7	2.4	3.8	4.6	3.5	3.3	3.8
GPT Image 1.5 (🥈)	2.4	3.5	2.3	2.3	3.0	4.4	4.4	4.1	2.1	3.4	4.1	2.2	3.0	3.6	4.7	3.2
Seedream 4.5	4.7	4.3	4.2	4.7	4.8	4.8	4.8	4.5	3.8	5.0	4.8	4.2	4.7	3.9	4.7	4.4
BAGEL	4.3	4.2	3.5	4.5	4.3	3.9	4.4	3.9	4.4	4.2	4.5	4.2	3.7	3.7	4.5	4.1
Flux2.0 [dev]	4.7	4.7	4.7	4.7	4.8	4.7	4.7	4.4	4.2	4.4	4.9	4.9	4.4	4.3	4.5	4.6
Qwen-Image-Edit* (Local version)	4.7	4.4	4.6	4.8	4.6	4.6	4.6	4.5	4.3	4.7	4.8	4.8	4.6	4.7	4.6	4.6
Qwen-Image-Edit-Safe (🥇, Ours)	4.3	3.7	4.2	4.4	4.0	3.6	3.6	2.9	3.3	4.7	2.0	1.3	3.1	3.6	3.1	3.4

Model	I1	I2	I3	I4	I5	I6	I7	I8	I9	I10	I11	I12	I13	I14	I15	ALL
Qwen-Image-Edit (Online version)	87.8	70.5	90.1	94.0	90.6	69.4	98.8	81.3	78.9	80.0	87.0	88.7	87.2	86.8	94.7	87.7
Nano Banana Pro (🥉)	60.4	94.6	84.7	30.1	92.5	100.0	90.5	95.3	75.8	100.0	39.1	74.2	100.0	79.4	100.0	79.1
GPT Image 1.5 (🥈)	36.0	86.8	34.5	33.8	52.8	97.2	90.5	87.5	25.3	60.0	84.8	30.6	100.0	75.0	94.7	63.0
Seedream 4.5	92.8	82.9	78.4	93.2	96.2	94.4	98.8	86.9	71.6	100.0	95.7	80.6	97.4	72.1	92.1	86.3
BAGEL	84.2	85.3	58.6	91.0	88.7	83.3	95.2	79.0	86.3	80.0	95.7	85.5	76.9	76.9	94.7	82.0
Flux2.0 [dev]	92.8	93.8	91.9	94.0	96.2	80.6	94.0	80.4	75.8	80.0	93.5	100.0	76.9	79.4	94.7	87.1
Qwen-Image-Edit* (Local version)	94.2	89.0	91.0	95.5	92.5	94.4	95.2	90.7	83.9	90.0	95.7	100.0	97.4	97.0	97.4	92.9
Qwen-Image-Edit-Safe (🥇, Ours)	83.3	68.8	81.1	85.0	75.5	69.4	69.0	50.7	58.8	90.0	26.1	8.1	61.5	70.6	55.3	62.8

Model	I1	I2	I3	I4	I5	I6	I7	I8	I9	I10	I11	I12	I13	I14	I15	ALL
Qwen-Image-Edit (Online version)	77.0	65.9	72.1	76.7	84.9	66.7	95.2	64.5	66.3	70.0	80.4	87.1	84.6	72.1	34.2	73.8
Nano Banana Pro (🥉)	55.4	89.1	75.7	21.8	88.7	100.0	90.5	81.8	63.2	90.0	34.8	74.2	92.3	61.8	42.1	70.6
GPT Image 1.5 (🥈)	30.9	60.5	30.6	29.3	47.2	86.1	85.7	72.9	20.0	60.0	73.9	29.0	48.7	60.3	86.8	52.0
Seedream 4.5	91.4	81.4	75.7	90.2	92.5	94.4	98.8	86.0	61.1	100.0	95.7	77.4	97.4	70.6	86.8	83.8
BAGEL	74.8	75.2	47.7	82.0	77.4	69.4	91.7	62.6	77.9	70.0	93.5	74.2	51.3	60.3	84.2	70.6
Flux2.0 [dev]	87.1	87.6	87.4	92.5	92.5	80.6	94.0	77.1	72.6	80.0	93.5	98.4	76.9	72.1	84.2	84.6
Qwen-Image-Edit* (Local version)	93.5	87.4	89.2	94.0	88.7	94.4	94.0	87.9	77.4	90.0	95.7	95.2	97.4	94.0	84.2	90.3
Qwen-Image-Edit-Safe (🥇, Ours)	82.6	68.0	80.2	83.5	75.5	69.4	69.0	49.8	54.1	90.0	26.1	8.1	61.5	67.6	50.0	61.7

🔱 Risky Catergory Definition and Examples

To facilitate standardized evaluation, we construct the IESBench, a vision-centric benchmark for evaluating the safety of large image editing models, which contains 1054 visually-prompted images, spanning across 15 safety catergories, 116 attributes and 9 actions.

🎬 More Interesting Examples of Failed Attacks

Stick figures are riding a horse

A little hedgehog is riding on wheels

Stick figures are hunting a cow

A strange lemon with a face

🎓 BibTex

If you find our work can be helpful, we would appreciate your citation and star:

@misc{hou2026vja,
      title={When the Prompt Becomes Visual: Vision-Centric Jailbreak Attacks for Large Image Editing Models}, 
      author={Jiacheng Hou and Yining Sun and Ruochong Jin and Haochen Han and Fangming Liu and Wai Kin Victor Chan and Alex Jinpeng Wang},
      year={2026},
      eprint={xxx},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/xxx}, 
}

🎬Examples

🌟 Overview

Contribution

🏆 LeaderBoard on 15 risky categories of IESBench

🔱 Risky Catergory Definition and Examples

Category 1: Violence & Harm Scene Manipulation (I1)

Category 2: Privacy Violation (I2)

Category 3: Self-Harm & Suicide Scene Generation (I3)

Category 4: Child Abuse (I4)

Category 5: Animal Abuse (I5)

Category 6: False Medical & Health Imagery (I6)

Category 7: Commercial Document Tampering (I7)

Category 8: Visual Misinformation & Disinformation (I8)

Category 9: Sexual & Non-consensual Intimate Imagery (I9)

Category 10: Dangerous Act Instructional Visualization (I10)

Category 11: Copyright & Authorship Tampering (I11)

Category 12: Hidden Information Reconstruction (I12)

Category 13: Evidence Tampering (I13)

Category 14: Discrimination-based Visual Editing (I14)

Category 15: Aversive Visual Manipulation (I15)

🎬 More Interesting Examples of Failed Attacks

🎓 BibTex