VLA-Reasoner: Empowering Vision-Language-Action Models with Reasoning via Online Monte Carlo Tree Search

Abstract

Vision-Language-Action models (VLAs) achieve strong performance in general robotic manipulation tasks by scaling imitation learning. However, existing VLAs are limited to predicting short-sighted next-action, which struggle with long-horizon trajectory tasks due to incremental deviations. To address this problem, we propose a plug-in framework named VLA-Reasoner that effectively empowers off-the-shelf VLAs with the capability of foreseeing future states via test-time scaling.

Specifically, VLA-Reasoner samples and rolls out possible action trajectories where involved actions are rationales to generate future states via a world model, which enables VLA-Reasoner to foresee and reason potential outcomes and search for the optimal actions. We further leverage Monte Carlo Tree Search (MCTS) to improve search efficiency in large action spaces, where stepwise VLA predictions seed the root. Meanwhile, we introduce a confidence sampling mechanism based on Kernel Density Estimation (KDE), to enable efficient exploration in MCTS without redundant VLA queries. We evaluate intermediate states in MCTS via an offline reward shaping strategy, to score predicted futures and correct deviations with long-term feedback.

We conducted extensive experiments in both simulators and the real world, demonstrating that our proposed VLA-Reasoner achieves significant improvements over the state-of-the-art VLAs. Our method highlights a potential pathway toward scalable test-time computation of robotic manipulation.

Method Overview

The overall pipeline of VLA-Reasoner. At test time, a lightweight and modified MCTS searches for the optimal action conditioned on the VLA prediction. The search is steered by expert-like sampling and dense reward shaping, which guide expansion and backup throughout the tree. The method is plug-and-play, and it can be attached to any VLA-based manipulation policy and consistently improves performance across tasks, environments, and robot embodiments.

Experimental Results

Results in Simulations

Average success rates across 500 episodes for LIBERO and 100 episodes for SimplerEnv. Our method outperforms OpenVLA-SFT on all 4 direction tasks and Octo-Small/SpatialVLA on 4 tasks. Bold entries mark the highest success rates, underlined for second-best. Asterisked results are chosen baselines and locally evaluated for fairness.

Results in Real World

Average success rates of 5 tasks in different scenarios. Each task is evaluated 20 times. Our method apparently improves OpenVLA and π0-FAST in all tasks.

Performance Visualization

Comparison between VLA-Reasoner and baseline methods on stack cube task. Our method demonstrates more robust and successful manipulation.

VLA-Reasoner (Ours)

π0-FAST (Baseline)

Long Horizon Manipulation

VLA-Reasoner successfully handles complex multi-step manipulation tasks that require careful planning and sequential reasoning.

Spatial Generalization

VLA-Reasoner demonstrates strong spatial generalization capabilities, adapting to varied object positions.

BibTeX

@article{guo2025vla,
  title={Vla-reasoner: Empowering vision-language-action models with reasoning via online monte carlo tree search},
  author={Guo, Wenkai and Lu, Guanxing and Deng, Haoyuan and Wu, Zhenyu and Tang, Yansong and Wang, Ziwei},
  journal={arXiv preprint arXiv:2509.22643},
  year={2025}
}