V-MAGE Logo

V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in MLLMs

ACL 2026 Findings

Xiangxi Zheng1    Linjie Li2    Zhengyuan Yang2    Ping Yu1    Alex Jinpeng Wang3    Rui Yan1    Yuan Yao1    Lijuan Wang2   
1Nanjing University     2Microsoft Research     3Central South University    
V-MAGE overview

We present V-MAGE, a benchmark built on video game environments designed to evaluate the comprehensive performance of MLLMs, with a focus on visual-centric capabilities. V-MAGE consists of five distinct video game environments, each containing manually crafted levels with varying difficulties to holistically assess the visual perception and reasoning abilities of MLLM. The evaluation employs a dynamic Elo-based framework with statistical stabilization, iteratively refining models' relative capabilities through randomized pairwise comparisons across multi-round interactions.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in visual-text processing. However, existing static image-text benchmarks are insufficient for evaluating their dynamic perception and interactive reasoning abilities. We introduce Vision-centric Multiple Abilities Game Evaluation (V-MAGE), a novel game-based evaluation framework designed to systematically assess MLLMs’ visual reasoning in interactive, continuous-space environments. V-MAGE features five distinct video games comprising over 30 carefully constructed evaluation scenarios. These scenarios are set in free-form, visually complex environments that require models to interpret dynamic game states and make decisions based solely on visual input, thereby closely reflecting the conditions encountered by human players. To ensure robust and interpretable comparisons across models, V-MAGE employs a dynamic Elobased ranking system that accounts for varying difficulty levels and task diversity. Benchmarking state-of-the-art MLLMs against human baselines reveals that while leading models approach human-level performance in simple tasks, their performance drops significantly in complex scenarios requiring advanced reasoning and task orchestration. This persistent performance gap highlights fundamental limitations in current MLLMs’ ability to perform real-time, vision-grounded interactions. Through extensive analyses, we demonstrate the utility of V-MAGE in uncovering these limitations and providing actionable insights for improving the visual and reasoning capabilities of MLLMs in dynamic, interactive settings.

Pipeline

V-MAGE Evaluation Pipeline. We selected five different games and designed several levels for each game to decompose the evaluation of model performance. The games used are FlappyBird, RaceGame, SuperMario, PongGame, and Tempest Run.
During the evaluation process, the Agent module receives visual game state information from the Game module, specifically in the form of game screenshots. Within the Agent module, these screenshots are structured into inputs for the model. In the baseline agent of V-MAGE, inputs to the MLLM models are constructed by combining screenshots from the most recent three frames with a prompt containing the game rules. The output of the model is processed by the Agent module into response actions, which are subsequently sent back to the Game module.

Game Demos

OpenAI GPT-4o

Google Gemini 2.0 Flash

Qwen2.5VL-72B

OpenAI GPT-4o

Google Gemini 2.0 Flash

Qwen2.5VL-72B

OpenAI GPT-4o

Google Gemini 2.0 Flash

Qwen2.5VL-72B

Game playing examples of MLLMs sampled from the evaluation process in V-MAGE.

Leaderboard

Performance comparison across different games based on the ELO ranking system. The Random baseline refers to randomly selecting actions from the predefined action space during decision-making phases. Avg. Ratio refers to the average percentage of the model's score compared to the human baseline score.

Model Flappybird Pong Race Supermario Tempestrun Avg. ELO Score Avg. Ratio (%)
Closed-Source Models
GPT-5-2025-08-0715721939171015841743171043.4
Gemini-2.5-Pro15261602166017581474160436.3
Claude-3.7-Sonnet15601570163315821369154330.8
Gemini-2.5-Flash15781524152015311489152823.8
GPT-4o15571449158115181527152626.6
Gemini-2.0-Flash (Thinking)15171479150315641516151622.6
GPT-5.1-2025-11-1315521514150714491411148620.1
Gemini-2.0-Flash14941461143714991530148416.7
Open-Source Models
Qwen3-VL-235B-A22B-Instruct15671441151715561496151524.3
Qwen2.5-VL-72B-Instruct15561442150615411530151522.8
InternVL2.5-78B14631462146515431528149219.2
Qwen2-VL-72B-Instruct14261445144215051547147316.5
InternVL2.5-8B14591448143113731495144112.9
Qwen2.5-VL-7B-Instruct14571446142313541517143912.1
Qwen2-VL-7B-Instruct14701447140813621501143811.4
Keye-VL-8B-Preview14191444142813811499143413.1
Phi-4-multimodal-instruct14041454142014821385142913.7
Baseline
Random14221434141014171445142611.0

BibTeX

@article{zheng2025vmagebenchmark,
  title={V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models},
  author={Xiangxi Zheng and Linjie Li and Zhengyuan Yang and Ping Yu and Alex Jinpeng Wang and Rui Yan and Yuan Yao and Lijuan Wang},
  journal={arXiv preprint arXiv:2504.06148},
  year={2025}
}