Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT

1Central South University,  2University of Electronic Science and Technology of China
3Peng Cheng Laboratory,  4Nanjing University,  5Microsoft

*Indicates Equal Contribution, Corresponding author
 
    pengcheng Logo   nanjing Logo   microsoft Logo
▶️ Overview Video
🔍 Click to zoom in

Figure 1: A one-minute sanity check shatters the illusion of spatial reasoning in MLLMs. Red arrows indicate objects and multiple reasoning chains are provided to capture diverse yet valid solution strategies.
Abstract: Understanding the physical world—governed by laws of motion, spatial relations, and causality—poses a fundamental challenge for multimodal large language models (MLLMs). While recent advances such as OpenAI o3 and GPT-4o demonstrate impressive perceptual and reasoning capabilities, our investigation reveals these models struggle profoundly with visual physical reasoning, failing to grasp basic physical laws, spatial interactions, and causal effects in complex scenes. More importantly, they often fail to follow coherent reasoning chains grounded in visual evidence, especially when multiple steps are needed to arrive at the correct answer. To rigorously evaluate this capability, we introduce MVPBench, a curated benchmark designed to rigorously evaluate visual physical reasoning through the lens of visual chain-of-thought (CoT). Each of the 1,211 examples features interleaved multi-image inputs and demands not only the correct final answer but also a coherent, step-by-step reasoning path grounded in evolving visual cues. Each example features interleaved multi-image inputs and demands not only the correct final answer but also a coherent, step-by-step reasoning path grounded in evolving visual cues. This setup mirrors how humans reason through real-world physical processes over time. To ensure fine-grained evaluation, we introduce a graph-based CoT consistency metric that verifies whether the reasoning path of model adheres to valid physical logic. Additionally, we minimize shortcut exploitation from text priors, encouraging models to rely on visual understanding. Experimental results reveal a concerning trend: even cutting-edge MLLMs exhibit poor visual reasoning accuracy and weak image-text alignment in physical domains. Surprisingly, RL-based post-training alignment—commonly believed to improve visual reasoning performance—often harms spatial reasoning, suggesting a need to rethink current fine-tuning practices.
Dataset Overview

Figure 2: Examples from MVPBench across four major categories. Each example includes an initial scene followed by multiple reasoning steps. Target objects are marked with red arrows and labeled with letters to reduce textual bias.
◀️ ▶️👆 Click to Navigate Qualitative Results
Performance comparison between single-image and multi-image inputs on CoT evaluation
CoT Performance of MLLMs with post training versus without post-training

BibTeX

@article{dong2025mvpbench,
              title={Seeing is Not Reasoning: MVPBench for Graph-based Evaluation of Multi-path Visual Physical CoT},
              author={Dong, Zhuobai and Yi, Junchao and Zheng, Ziyuan and Han, haochen and Zheng, Xiangxi and Wang, Alex Jinpeng and Liu, Fangming and Li, Linjie and others},
              year={2025}
            }
            
×