V-MAGE Logo

V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in MLLMs

Xiangxi Zheng1    XXXX1    XXXX1    XXXX1    XXXX1    XXXX1   
?Microsoft Research     ?Nanjing University     ?Central South University    
* Project lead.    First authors.    Second authors.    Leadership
Image description

We present V-MAGE, a benchmark built on video game environments designed to evaluate the comprehensive performance of MLLMs, with a focus on visual-centric capabilities. V-MAGE consists of five distinct video game environments, each containing manually crafted levels with varying difficulties to holistically assess the visual perception and reasoning abilities of MLLM. The evaluation employs a dynamic Elo-based framework with statistical stabilization, iteratively refining models' relative capabilities through randomized pairwise comparisons across multi-round interactions.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have led to significant improvements across various multimodal benchmarks. However, as evaluations shift from static datasets to open-world, dynamic environments, current game-based benchmarks remain inadequate—they lack visual-centric tasks and fail to assess the diverse reasoning skills required for real-world decision-making. To address this, we introduce Visual-centric Multiple Abilities Game Evaluation (V-MAGE), a game-based evaluation framework designed to assess MLLMs’ visual reasoning capabilities. V-MAGE features five diverse games with 30+ handcrafted levels, testing models on core visual skills such as positioning, trajectory tracking, timing, and visual memory, alongside higher-level reasoning like long-term planning and deliberation. We use V-MAGE to evaluate leading MLLMs, revealing significant challenges in their visual perception and reasoning. In all game environments, the top-performing MLLMs (including models such as GPT-4o and Gemini-2.0-flash-exp), as determined by Elo rating comparisons, exhibit a substantial performance gap compared to humans. Our findings highlight critical limitations, including various types of perceptual errors made by the models, and suggest potential avenues for improvement from an agent-centric perspective, such as refining agent strategies and addressing perceptual inaccuracies.

Pipeline

V-MAGE Evaluation Pipeline. We selected five different games and designed several levels for each game to decompose the evaluation of model performance. The games used are FlappyBird, RaceGame, SuperMario, PongGame, and Tempest Run.
During the evaluation process, the Agent module receives visual game state information from the Game module, specifically in the form of game screenshots. Within the Agent module, these screenshots are structured into inputs for the model. In the baseline agent of V-MAGE, inputs to the MLLM models are constructed by combining screenshots from the most recent three frames with a prompt containing the game rules. The output of the model is processed by the Agent module into response actions, which are subsequently sent back to the Game module.

Comparison

The comparison of V-MAGE with existing game-based evaluation benchmarks. *Text in V-MAGE only represents the instructions for game rules and output format.

Model Performance

OpenAI GPT-4o

Google Gemini 2.0 Flash

Qwen2.5VL-72B

OpenAI GPT-4o

Google Gemini 2.0 Flash

Qwen2.5VL-72B

OpenAI GPT-4o

Google Gemini 2.0 Flash

Qwen2.5VL-72B

Model performance in different games. The game playing examples of top-performing MLLMs. These examples are sampled from the evaluation process in games of V-MAGE.

Results

Evaluation results(Elo rating). Performance comparison across different games based on the elo ranking system.

Evaluation results(Score). The MLLM trails humans by a large margin in all six games.

Error Case

Error analysis in RaceGame and FlappyBird.