V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in MLLMs

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in visual-text processing. However, existing static image-text benchmarks are insufficient for evaluating their dynamic perception and interactive reasoning abilities. We introduce Vision-centric Multiple Abilities Game Evaluation (V-MAGE), a novel game-based evaluation framework designed to systematically assess MLLMs’ visual reasoning in interactive, continuous-space environments. V-MAGE features five distinct video games comprising over 30 carefully constructed evaluation scenarios. These scenarios are set in free-form, visually complex environments that require models to interpret dynamic game states and make decisions based solely on visual input, thereby closely reflecting the conditions encountered by human players. To ensure robust and interpretable comparisons across models, V-MAGE employs a dynamic Elobased ranking system that accounts for varying difficulty levels and task diversity. Benchmarking state-of-the-art MLLMs against human baselines reveals that while leading models approach human-level performance in simple tasks, their performance drops significantly in complex scenarios requiring advanced reasoning and task orchestration. This persistent performance gap highlights fundamental limitations in current MLLMs’ ability to perform real-time, vision-grounded interactions. Through extensive analyses, we demonstrate the utility of V-MAGE in uncovering these limitations and providing actionable insights for improving the visual and reasoning capabilities of MLLMs in dynamic, interactive settings.

Pipeline

V-MAGE Evaluation Pipeline. We selected five different games and designed several levels for each game to decompose the evaluation of model performance. The games used are FlappyBird, RaceGame, SuperMario, PongGame, and Tempest Run.
During the evaluation process, the Agent module receives visual game state information from the Game module, specifically in the form of game screenshots. Within the Agent module, these screenshots are structured into inputs for the model. In the baseline agent of V-MAGE, inputs to the MLLM models are constructed by combining screenshots from the most recent three frames with a prompt containing the game rules. The output of the model is processed by the Agent module into response actions, which are subsequently sent back to the Game module.

Game Demos

OpenAI GPT-4o

Google Gemini 2.0 Flash

Qwen2.5VL-72B

OpenAI GPT-4o

Google Gemini 2.0 Flash

Qwen2.5VL-72B

OpenAI GPT-4o

Google Gemini 2.0 Flash

Qwen2.5VL-72B

Game playing examples of MLLMs sampled from the evaluation process in V-MAGE.

Leaderboard

Performance comparison across different games based on the ELO ranking system. The Random baseline refers to randomly selecting actions from the predefined action space during decision-making phases. Avg. Ratio refers to the average percentage of the model's score compared to the human baseline score.

Model	Flappybird	Pong	Race	Supermario	Tempestrun	Avg. ELO Score	Avg. Ratio (%)
Closed-Source Models
GPT-5-2025-08-07	1572	1939	1710	1584	1743	1710	43.4
Gemini-2.5-Pro	1526	1602	1660	1758	1474	1604	36.3
Claude-3.7-Sonnet	1560	1570	1633	1582	1369	1543	30.8
Gemini-2.5-Flash	1578	1524	1520	1531	1489	1528	23.8
GPT-4o	1557	1449	1581	1518	1527	1526	26.6
Gemini-2.0-Flash (Thinking)	1517	1479	1503	1564	1516	1516	22.6
GPT-5.1-2025-11-13	1552	1514	1507	1449	1411	1486	20.1
Gemini-2.0-Flash	1494	1461	1437	1499	1530	1484	16.7
Open-Source Models
Qwen3-VL-235B-A22B-Instruct	1567	1441	1517	1556	1496	1515	24.3
Qwen2.5-VL-72B-Instruct	1556	1442	1506	1541	1530	1515	22.8
InternVL2.5-78B	1463	1462	1465	1543	1528	1492	19.2
Qwen2-VL-72B-Instruct	1426	1445	1442	1505	1547	1473	16.5
InternVL2.5-8B	1459	1448	1431	1373	1495	1441	12.9
Qwen2.5-VL-7B-Instruct	1457	1446	1423	1354	1517	1439	12.1
Qwen2-VL-7B-Instruct	1470	1447	1408	1362	1501	1438	11.4
Keye-VL-8B-Preview	1419	1444	1428	1381	1499	1434	13.1
Phi-4-multimodal-instruct	1404	1454	1420	1482	1385	1429	13.7
Baseline
Random	1422	1434	1410	1417	1445	1426	11.0

BibTeX

@article{zheng2025vmagebenchmark, title={V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models}, author={Xiangxi Zheng and Linjie Li and Zhengyuan Yang and Ping Yu and Alex Jinpeng Wang and Rui Yan and Yuan Yao and Lijuan Wang}, journal={arXiv preprint arXiv:2504.06148}, year={2025} }

V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in MLLMs

ACL 2026 Findings

Abstract

Pipeline

Game Demos

Leaderboard

BibTeX