logo MIND: Benchmarking Memory Consistency and Action Control
in World Models
TL;DR: The first open-domain closed-loop revisited benchmark for evaluating memory consistency and action control in world models
Yixuan Ye1*, Xuanyu Lu1*, Yuxin Jiang2*, Yuchao Gu2, Rui Zhao2, Qiwei Liang3, Jiachun Pan2,
Fengda Zhang4, Weijia Wu2†, Alex Jinpeng Wang1†
1 CSU-JPG, Central South University    2 National University of Singapore
3 Hong Kong University of Science and Technology (Guangzhou)    4 Nanyang Technological University
🌐 Homepage | 👉 Dataset | 📄 Paper | 💻 Code | 🏆 Leaderboard (coming soon)
Project Teaser

Abstract

World models aim to understand, remember, and predict dynamic visual environments, yet a unified benchmark for evaluating their fundamental abilities remains lacking. To address this gap, we introduce MIND, the first open-domain benchmark for evaluating Memory consIstency and action coNtrol in worlD models. MIND contains 250 high-quality videos at 1080p and 24 FPS, including 100 (first-person) + 100 (third-person) video clips under a shared action space and 25 + 25 clips across varied action spaces covering eight diverse scenes. We design an efficient evaluation framework to measure two core abilities: memory consistency and action control, capturing temporal stability and contextual coherence across viewpoints. Furthermore, we design various action spaces, including different character movement speeds and camera rotation angles, to evaluate the action generalization capability across different action spaces under shared scenes. To facilitate future performance benchmarking on MIND, we introduce MIND-World, a novel interactive Video-to-World baseline. Extensive experiments demonstrate the completeness of MIND and reveal key challenges in current world models, including the difficulty of maintaining long-term memory consistency and generalizing across action spaces.

Overview

HAMLET Overview Diagram

Overview of the MIND. We build and collect the first open-domain benchmark using Unreal Engine 5, supporting both first-person and third-person perspectives with 1080p resolution at 24 FPS.

MIND Dataset

Distribution for Scene Categories and Action Space in MIND Dataset. MIND supports open-domain scenarios with diverse and well-balanced action spaces.

Ground Truth

MIND-World (Ours) w/ Memory

MIND-World (Ours) w/o Memory

Matrix-game-2.0

Challenge 1: Open Domain

Origin
Ground Truth First Frame
Trained on MIND Dataset
Trained on Minecraft Dataset

How to achieve open-domain generalization with readily available large-scale data (e.g., Minecraft Datasets)?

0.8 X

w/o Memory
Ground Truth
w/ Memory

1.0 X

Challenge 2 Diagram

How to overcome the interference caused by the mismatch between the implicit action space in context memory and the training distribution, thereby preventing catastrophic performance degradation of memory-augmented models during action space generalization?

Challenge 3: Precise Action Control

MIND-World (Ours)

Matrix-game-2.0

How to effectively decouple visual prompts from action dynamics, thereby preventing world models from being interfered by initial images/videos when executing precise action sequences, and achieving true precise action control?

w/o Memory
Ground Truth
w/ Memory

How to effectively maintain and leverage long-context memory, thereby enabling world models to achieve and sustain high consistency with ground-truth trajectories in ultra-long-horizon rollouts?

Challenge 5: Generated Scene Consistency

Matrix-game-2.0
Matrix-game-2.0
MIND-World (Ours)
MIND-World (Ours)

How to ensure high consistency in world models when repeatedly generating the same region?

Challenge 6: Third-person Perspective

Matrix-game-2.0
Matrix-game-2.0
MIND-World (Ours)
MIND-World (Ours)

How to endow world models with the ability to correctly model physical interactions and spatial relationships between agents and the environment?