VCode

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

Kevin Qinghong Lin ^1*, Yuhao Zheng ^2*, Hangyu Ran ^3*, Dantong Zhu ³, Dongxing Mao ³, Linjie Li ⁴, Philip Torr ¹, Alex Jinpeng Wang ^3✉
¹University of Oxford ²University of Science and Technology of China ³Central South University ⁴Microsoft Research
* Equal contribution ✉ Corresponding author

Orig

Gemini-3-Pro

VCoder(Claude-4-Opus)

Claude-4-Opus

GPT-5

Qwen3-VL

Q:What is the spatial relation between the frisbee and the man?

A:above the mantop left

Orig

Gemini-3-Pro

VCoder(Claude-4-Opus)

Claude-4-Opus

GPT-5

Qwen3-VL

Q:Are which one of these two jets are larger in size?

A:same

Orig

Gemini-3-Pro

VCoder(Claude-4-Opus)

Claude-4-Opus

GPT-5

Qwen3-VL

Q:Is the person bigger than the car?

A:no

Orig

Gemini-3-Pro

VCoder(Claude-4-Opus)

Claude-4-Opus

GPT-5

Qwen3-VL

Q:Why are the persons wearing black clothes? Answer it and give the rationale.

A:They are going to a funeral. I think so because it is traditional for people to wear black to a funeral.

Orig

Gemini-3-Pro

VCoder(Claude-4-Opus)

Claude-4-Opus

GPT-5

Qwen3-VL

Q:Can you explain this meme?

A:This meme is a humorous take on procrastination and the tendency to delay tasks until a specific time. The person in the meme plans to do something at 8 o'clock, but when they miss that deadline by a few minutes, they decide to wait until 9 o'clock instead. The image of Kermit the Frog lying in bed represents the person's laziness and lack of motivation to complete the task.

Orig

Gemini-3-Pro

VCoder(Claude-4-Opus)

Claude-4-Opus

GPT-5

Qwen3-VL

Q:Can you explain this meme?

A:This meme is a commentary on the current situation of online learning due to the COVID-19 pandemic. It shows how different parties are affected by it.\\n\\nThe first two images show the Zoom company and the teachers throwing money in the air, which implies that they are profiting from the situation. Zoom, in particular, has seen a surge in usage and revenue due to the pandemic.\\n\\nThe third image shows the WiFi company also throwing money in the air, which implies that they are also profiting from the situation as more people are using the internet for online learning.\\n\\nThe last image shows a student with a sad clown face, which implies that students are not benefiting from the situation and are struggling with online learning. The \"Put on a happy face\" sign in the background further emphasizes the idea that students are putting on a brave face despite their struggles.

Orig

Gemini-3-Pro

VCoder(Claude-4-Opus)

Claude-4-Opus

GPT-5

Qwen3-VL

Q:What is funny about this image?

A:It is a cartoon of a rhinoceros painting a picture and each picture has its rhino horn because the rhino horn grows in front of its eyes. The caption \"You see the world as you are!\" is a playful commentary on how our perspective shapes our perception of the world.

Orig

Gemini-3-Pro

VCoder(Claude-4-Opus)

Claude-4-Opus

GPT-5

Qwen3-VL

Q:What is funny about this image?

A:There are a few things that people might find funny about this image:\\n\\n1. The contrast between the two images: On the left, we see two women who are very emotional and angry, while on the right, we see a cat who is calmly sitting at a dinner table.\\n\\n2. The absurdity of the cat's situation: It is not common to see a cat sitting at a dinner table with a plate of food in front of it, so this image is unexpected and ridiculous.\\n\\n3. The cat's expression: The cat appears to be frightened by the women's yelling, which adds to the humor of the image.

Orig

Gemini-3-Pro

VCoder(Claude-4-Opus)

Claude-4-Opus

GPT-5

Qwen3-VL

Q:Can you give a short introduction to this movie?

A:The Lord of the Rings: The Fellowship of the Ring is a 2001 epic fantasy adventure film directed by Peter Jackson from a screenplay by Fran Walsh, Philippa Boyens, and Jackson, based on 1954's The Fellowship of the Ring, the first volume of the novel The Lord of the Rings by J. R. R. Tolkien. The film is the first installment in The Lord of the Rings trilogy. It features an ensemble cast including Elijah Wood, Ian McKellen, Liv Tyler, Viggo Mortensen, Sean Astin, Cate Blanchett, John Rhys-Davies, Billy Boyd, Dominic Monaghan, Orlando Bloom, Christopher Lee, Hugo Weaving, Sean Bean, Ian Holm, and Andy Serkis.

Orig

Gemini-3-Pro

VCoder(Claude-4-Opus)

Claude-4-Opus

GPT-5

Qwen3-VL

Q:Can you give a short introduction to this painting?

A:Emperor Taizong Receiving the Tibetan Envoy (also called \u6b65\u8f87\u56fe, B\u00f9ni\u01cen T\u00fa ) is a painting on silk by Yan Liben to show the friendly encounter between the Tang dynasty and Tibet. The painting is 129 centimetres (51 in) long by 38.5 centimetres (15.2 in) wide. Bunian Tu is in The Palace Museum in Beijing.

Orig

Gemini-3-Pro

VCoder(Claude-4-Opus)

Claude-4-Opus

GPT-5

Qwen3-VL

Q:How many cars are in the image? Select from the following choices. (A) 5 (B) 3 (C) 2 (D) 1 (E) 4 (F) 0 Answer with the option's letter from the given choices directly.

A:(B)

Orig

Gemini-3-Pro

VCoder(Claude-4-Opus)

Claude-4-Opus

GPT-5

Qwen3-VL

Q:Considering the relative positions of the person and the boat in the image provided, where is the person located with respect to the boat? Select from the following choices. (A) above (B) below Answer with the option's letter from the given choices directly.

A:(B)

Orig

Gemini-3-Pro

VCoder(Claude-4-Opus)

Claude-4-Opus

GPT-5

Qwen3-VL

Q:Considering the relative positions of the person (annotated by the red box) and the cell phone in the image provided, where is the person (annotated by the red box) located with respect to the cell phone? Select from the following choices. (A) above (B) below Answer with the option's letter from the given choices directly.

A:(A)

Orig

Gemini-3-Pro

VCoder(Claude-4-Opus)

Claude-4-Opus

GPT-5

Qwen3-VL

Q:Considering the relative positions of the sheep and the horse in the image provided, where is the sheep located with respect to the horse? Select from the following choices. (A) left (B) right Answer with the option's letter from the given choices directly.

A:(B)

Orig

Gemini-3-Pro

VCoder(Claude-4-Opus)

Claude-4-Opus

GPT-5

Qwen3-VL

Q:How many benchs are in the image? Select from the following choices. (A) 1 (B) 3 (C) 2 (D) 0 Answer with the option's letter from the given choices directly.

A:(A)

Orig

Gemini-3-Pro

VCoder(Claude-4-Opus)

Claude-4-Opus

GPT-5

Qwen3-VL

Q:Considering the relative positions of the bottle and the suitcase in the image provided, where is the bottle located with respect to the suitcase? Select from the following choices. (A) left (B) right Answer with the option's letter from the given choices directly.

A:(B)

Orig

Gemini-3-Pro

VCoder(Claude-4-Opus)

Claude-4-Opus

GPT-5

Qwen3-VL

Q:Estimate the real-world distances between objects in this image. Which object is closer to the pedestrian (highlighted by a red box), the truck (highlighted by a blue box) or the bus (highlighted by a green box)? (A) truck (B) bus Answer with the option's letter from the given choices directly.

A:(A)

Orig

Gemini-3-Pro

VCoder(Claude-4-Opus)

Claude-4-Opus

GPT-5

Qwen3-VL

Q:Estimate the real-world distances between objects in this image. Which object is closer to the motorcycle (highlighted by a red box), the car (highlighted by a blue box) or the pedestrian (highlighted by a green box)? (A) car (B) pedestrian Answer with the option's letter from the given choices directly.

A:(A)

Orig

Gemini-3-Pro

VCoder(Claude-4-Opus)

Claude-4-Opus

GPT-5

Qwen3-VL

Q:Which object is closer to the camera taking this photo, the books (highlighted by a red box) or the computer (highlighted by a blue box)? (A) books (B) computer Answer with the option's letter from the given choices directly.

A:(A)

Orig

Gemini-3-Pro

VCoder(Claude-4-Opus)

Claude-4-Opus

GPT-5

Qwen3-VL

Q: What among the listed issues would not be the cause for the petioles of this rhubarb splitting? (A) Physiological problems (B) Phytoplasma infection (C) I don't know and don't want to guess. (D) Animal damage (E) Bacteria Answer with the option's letter from the given choices directly.

A:E

🏆 Leaderboard

VCode (Overall) General Professional Vision-centric

Recommendation Size View Time View

Show Unknown Sizes

Loading Chart...

#	Model Name	VCode Score ↓	General	Professional	Vision-centric	SigLip Score	Code Token (K)	Success Rate