Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge

We propose ChatVLA-2, a novel VLA model enables generalized open-world embodied reasoning and reasoning following abilities.

We argue that a generalizable VLA model should retain and expand upon the VLM's core competencies: 1) Open-world embodied reasoning - the VLA should inherit the knowledge from VLM, i.e., recognize anything that the VLM can recognize, capable of solving math problems, possessing visual-spatial intelligence, 2) Reasoning following – effectively translating the open-world reasoning into actionable steps for the robot. In this work, we introduce ChatVLA-2, a novel mixture-of-expert VLA model coupled with a specialized two-stage training pipeline designed to preserve the VLM's original strengths while enabling actionable reasoning.

To validate our approach, we design a math-matching task wherein a robot interprets math problems written on a whiteboard and picks corresponding number cards from a table to solve equations. Remarkably, our method exhibits exceptional mathematical reasoning and OCR capabilities, despite these abilities not being explicitly trained within the VLA. Furthermore, we demonstrate that the VLA possesses strong spatial reasoning skills, enabling it to interpret novel directional instructions involving previously unseen objects. Overall, our method showcases reasoning and comprehension abilities that significantly surpass state-of-the-art imitation learning methods such as OpenVLA, DexVLA, and \(\pi_0\). This work represents a substantial advancement toward developing truly generalizable robotic foundation models endowed with robust reasoning capacities.

Mathematical Reasoning: Demos on Math-Matching Game

Below are examples of math matching game to demonstrate open-world reasoning ability and reasoning-following ability. All examples are NOT in training data.

User: "Answer the question and pick the card with the correct answer."

Unseen number

Unseen answer

Unseen background

User: "Answer the question and pick the card with the correct answer."

Unseen number

Unseen answer

Unseen background

Unseen symbol "-"

User: "Answer the question and pick the card with the correct answer."

Unseen number

Unseen answer

Unseen background

Unseen symbol "×"

User: "Answer the question and pick the card with the correct answer."

Unseen number

Unseen answer

Unseen background

Unseen symbol "÷"

Spatial Reasoning: Demos on Toy Placement

Below are examples of toy placement task to demonstrate open-world reasoning ability and reasoning-following ability. All examples are NOT in training data.

User: "Pick the [obj] and place it to [place] of the [target]"

Unseen object

Unseen target

Unseen background

2x

Unseen object

Unseen target

Unseen background

2x

Unseen object

Unseen target

Unseen background

Unseen place

2x

BibTeX

    @misc{zhou2025visionlanguageactionmodelopenworldembodied,
      title={Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge},
      author={Zhongyi Zhou and Yichen Zhu and Junjie Wen and Chaomin Shen and Yi Xu},
      year={2025},
      eprint={2505.21906},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2505.21906},
      }
    @misc{zhou2025chatvlaunifiedmultimodalunderstanding,
      title={ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model},
      author={Zhongyi Zhou and Yichen Zhu and Minjie Zhu and Junjie Wen and Ning Liu and Zhiyuan Xu and Weibin Meng and Ran Cheng and Yaxin Peng and Chaomin Shen and Feifei Feng},
      year={2025},
      eprint={2502.14420},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2502.14420},
      }

ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge

Open-World Embodied Reasoning: Just write on the whiteboard — ChatVLA-2 solves it!

Solving Sequential Tasks

Unseen number, answer, background, and even symbol!

Mathematical Reasoning: Demos on Math-Matching Game

Spatial Reasoning: Demos on Toy Placement

Experimental Setup

We use a Franka Emika robot equipped with a Robotiq gripper to pick and place items at specified target locations. We utilize the ARX R5 bimanual robots with a top camera of RealSense L515.

Results on the math matching game

We evaluate multiple models on both in-domain settings, where the data is presented in the training data, and open-world setups. We evaluate average score of OCR (4 scores in total) and mathematical reasoning (2 scores in total), and average success rate of task execution at both setups.

Results on the toy placement task

We evaluate multiple models on both in-domain settings, where the data is presented in the training data, and open-world setups. We evaluate average object recognition score, spatial affordance score and task success rate at both setups.

BibTeX