2025-11-24T17:43:17.218297

DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning

Yuan, Liu, Lu et al.
Vision-Language-Action (VLA) models have recently shown impressive generalization and language-guided manipulation capabilities. However, their performance degrades on tasks requiring precise spatial reasoning due to limited spatial reasoning inherited from Vision-Language Models (VLMs). Existing VLAs rely on extensive action-data pretraining to ground VLMs in 3D space, which reduces training efficiency and is still insufficient for accurate spatial understanding. In this work, we present DepthVLA, a simple yet effective VLA architecture that explicitly incorporates spatial awareness through a pretrained depth prediction module. DepthVLA adopts a mixture-of-transformers design that unifies a VLM, a depth transformer, and an action expert with fully shared attentions, forming an end-to-end model with enhanced spatial reasoning. Extensive evaluations in both real-world and simulated environments show that DepthVLA outperforms state-of-the-art approaches, achieving 78.5% vs. 65.0% progress in real-world tasks, 94.9% vs. 93.6% in the LIBERO simulator, and 74.8% vs. 58.8% in the Simpler simulator. Our code will be made publicly available.
academic

DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning

Basic Information

  • Paper ID: 2510.13375
  • Title: DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning
  • Authors: Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Zhuoguang Chen, Tao Jiang, Hang Zhao
  • Institutions: IIIS, Tsinghua University & Galaxea AI
  • Category: cs.CV (Computer Vision)
  • Publication Date: October 15, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.13375

Abstract

Vision-Language-Action (VLA) models demonstrate strong performance in generalization and language-guided manipulation tasks, but show degraded performance on tasks requiring precise spatial reasoning, stemming from limited spatial reasoning capabilities inherited from vision-language models (VLMs). Existing VLAs rely on large-scale action data pretraining to ground VLMs in 3D space, which reduces training efficiency and remains insufficient for accurate spatial understanding. This paper proposes DepthVLA, a simple yet effective VLA architecture that explicitly incorporates spatial awareness through a pretrained depth prediction module. DepthVLA employs a mixture-of-experts Transformer design that unifies VLM, depth Transformer, and action experts through fully shared attention mechanisms, forming an end-to-end model with enhanced spatial reasoning capabilities. Extensive evaluations in real-world and simulated environments demonstrate that DepthVLA surpasses state-of-the-art methods, achieving 78.5% vs 65.0% progress on real-world tasks, 94.9% vs 93.6% on LIBERO simulator, and 74.8% vs 58.8% on Simpler simulator.

Research Background and Motivation

Core Problem

Existing Vision-Language-Action (VLA) models perform poorly on robotic manipulation tasks requiring precise spatial reasoning, primarily due to:

  1. Limited spatial reasoning capabilities: VLAs inherit spatial reasoning limitations from VLMs, underperforming on precise manipulation tasks
  2. Low training efficiency: Existing methods rely on large-scale action data pretraining to ground VLMs in 3D space, yet still fail to sufficiently understand spatial information
  3. Practical application challenges: VLAs frequently fail at grasping small objects, executing precise operations, or avoiding collisions

Problem Significance

Precise spatial reasoning is critical for robotic manipulation, particularly in:

  • Grasping small objects or fine-grained manipulation
  • Collision-aware path planning
  • Stacking tasks requiring precise position estimation
  • Multi-step operations in complex environments

Limitations of Existing Methods

  1. Generative world model approaches: Lack explicit 3D knowledge with limited improvements to current scene encoding
  2. Chain-of-Thought reasoning: Introduces significant latency (over 2 seconds) requiring autoregressive generation of hundreds of spatial tokens
  3. External depth estimators: Methods like SpatialVLA use off-the-shelf depth estimators but lack end-to-end optimization with VLA, limiting performance ceiling

Core Contributions

  1. DepthVLA Architecture: Proposes a novel VLA model integrating a pretrained depth prediction expert into a mixture-of-experts Transformer framework, achieving explicit spatial reasoning while maintaining VLM's semantic foundation
  2. Per-Expert Pretraining Strategy: The mixture-of-experts Transformer design allows each expert (VLM and depth) to be separately pretrained on different datasets, improving training efficiency and scalability beyond embodied action data
  3. Comprehensive Real-World and Simulated Validation: Validates DepthVLA significantly outperforms state-of-the-art VLAs in real-world and simulated environments (LIBERO, Simpler), achieving substantial improvements in grasping accuracy, collision avoidance, and overall task success rates

Method Details

Task Definition

Following the standard end-to-end VLA setup, the policy πθ predicts a k-length action chunk At based on current observation ot (from one or multiple cameras), language instruction l, and proprioceptive state st:

At = πθ(ot, l, st)

Model Architecture

DepthVLA employs a Mixture-of-Experts Transformer (MoT) architecture integrating three experts:

1. Overall Design

  • VLM Expert: Encodes observations and language instructions, capturing semantic and linguistic foundational features
  • Depth Expert: Processes observations to infer geometric information
  • Action Expert: Generates continuous actions based on combined features from semantic and geometric experts

2. Depth Expert Design

  • Encoder-Decoder Architecture: Encoder based on DINOv2, initialized from Depth Anything V2 pretrained checkpoint
  • Decoder Structure: Matches VLM's Transformer structure, outputs depth predictions through linear head
  • Intermediate Feature Utilization: Performs spatial reasoning at all intermediate layers, providing rich geometric cues for action prediction

3. Attention Mechanism

Employs block-level masking strategy:

  • VLM and depth expert tokens attend only to themselves
  • Action tokens can attend to all streams
  • Preserves learning capacity of pretrained modules while fusing semantic and spatial cues

Technical Innovations

1. Explicit Spatial Reasoning

Unlike implicit methods, DepthVLA provides explicit 3D geometric understanding through a dedicated depth expert, avoiding reliance on large-scale action data.

2. Mixture-of-Experts Design

  • Allows different experts to be pretrained on their respective optimal datasets
  • Achieves effective fusion through shared attention layers
  • Maintains specialized capabilities of each expert

3. End-to-End Optimization

The depth expert is jointly trained with VLA using combined loss:

L = Lsi + Lflow

where Lsi is scale-invariant depth loss and Lflow is flow matching loss.

Experimental Setup

Datasets

  1. Pretraining Datasets:
    • Depth Expert: WildRGB-D, ScanNet, ScanNet++, HyperSim
    • VLA: Galaxea Open-World Dataset (100k trajectories), BridgeData V2 (60k trajectories)
  2. Evaluation Datasets:
    • Simpler WidowX: 4 task suites, 120 trials
    • LIBERO: 4 task suites (Spatial/Object/Goal/Long), 2000 trials
    • Real-World: 3 benchmark tasks, 20 runs per task

Evaluation Metrics

  • Success Rate: Percentage of completed tasks
  • Progress Score: Each successful substep contributes one point, averaged across all runs

Comparison Methods

  • Diffusion Policy
  • Octo-Base
  • SpatialVLA
  • π0 (reimplemented)
  • OpenVLA
  • CoT-VLA
  • MolmoACT
  • DreamVLA

Implementation Details

  • Model: Paligemma-3B as VLM backbone, DINOv2-L as depth encoder
  • Training: 32 NVIDIA H100 GPUs, AdamW optimizer
  • Inference: NVIDIA 4090 GPU, BF16 mixed precision, 210ms latency

Experimental Results

Main Results

1. Simpler WidowX Benchmark

ModelPretrainPut SpoonPut CarrotStack BlockPick EggplantAverage
π0 (reimplemented)×81.7%64.2%30.0%59.2%58.8%
DepthVLA×75.8%71.7%62.5%89.2%74.8%

2. LIBERO Benchmark

ModelPretrainSpatialObjectGoalLongAverage
π0 (reimplemented)×95.8%96.4%94.8%87.4%93.6%
DepthVLA×96.4%98.0%95.8%89.2%94.9%

3. Real-World Benchmark

  • Overall Performance: DepthVLA achieves 79% vs 65% baseline average progress score
  • Microwave Operation: Demonstrates excellent collision avoidance performance
  • Block Stacking: Exhibits superior spatial awareness capabilities
  • Desktop Organization: Performs comparably on small object grasping tasks

Ablation Studies

SettingSpoonCarrotBlockEggplantAverage
Depth Expert Random Init60.0%60.8%43.3%40.0%51.0%
Remove Depth Loss69.2%60%28.3%70.0%56.9%
Freeze Depth Expert65.8%69.2%74.2%78.3%71.9%
Remove Block-Level Masking66.7%65.0%2.5%88.3%55.6%
DepthVLA Complete75.8%71.7%62.5%89.2%74.8%

Key Findings

  1. Depth Pretraining is Critical: Randomly initialized depth expert shows significantly degraded performance
  2. Depth Loss is Necessary: Removing depth loss leads to performance degradation
  3. Block-Level Masking is Effective: Maintaining expert independence is crucial for performance
  4. Prediction Outperforms Direct Input: Predicting depth performs better than directly using ground-truth depth

General-Purpose Robotic Manipulation Policies

Evolution from single-task experts to general models, driven by advances in large language models, vision-language models, and large-scale robotic action datasets. Early VLAs generate action tokens through VLM fine-tuning, while latest VLAs employ diffusion-based action experts.

Spatial-Aware VLAs

  • Early Methods: Use additional 3D inputs like LiDAR or RGB-D cameras, but reduce cross-platform generality
  • SpatialVLA: Uses off-the-shelf depth estimators to generate pseudo point clouds, but lacks end-to-end optimization
  • Generative World Models: Predict future frames, keypoints, or semantic states, but provide limited improvements to current scene encoding
  • CoT Reasoning: Autoregressively generates depth tokens, but introduces high latency

3D Geometric Awareness

Recent advances in 3D perception demonstrate strong capabilities in inferring geometry from monocular or multi-view images, offering potential for improving VLA spatial reasoning.

Conclusions and Discussion

Main Conclusions

  1. Explicit Spatial Reasoning is Effective: Significantly improves VLA performance on precise manipulation tasks through pretrained depth experts
  2. Mixture-of-Experts Design is Superior: Allows different experts to be pretrained on their respective optimal datasets, improving efficiency
  3. End-to-End Optimization is Key: Joint optimization of depth prediction and action generation is more effective than using external depth estimators

Limitations

  1. Monocular Depth Prediction Challenges: May still fail in difficult scenarios (tiny edges, reflective or transparent objects, textureless surfaces)
  2. Computational Overhead: Adds 600M parameters and 20ms inference latency
  3. Dependence on Depth Labels: Requires generating pseudo depth labels for training

Future Directions

  1. Multi-View Depth Prediction: Explore multi-view depth or point map prediction to enhance spatial precision and robustness
  2. More Efficient Architectures: Reduce computational overhead while maintaining performance
  3. Unsupervised Spatial Learning: Reduce dependence on depth labels

In-Depth Evaluation

Strengths

  1. Strong Method Innovation: First to effectively integrate pretrained depth experts into VLA, providing explicit spatial reasoning
  2. Comprehensive Experiments: Covers real-world and multiple simulated environments with detailed ablation studies
  3. Significant Performance Improvements: Achieves consistent performance gains across all test environments
  4. Reasonable Design: Mixture-of-experts architecture preserves specialized capabilities of each expert while achieving effective fusion
  5. Strong Practicality: Minimal inference latency increase, suitable for real-time deployment

Weaknesses

  1. Depth Quality Dependency: Performance limited by depth prediction quality, may fail in challenging scenarios
  2. Label Generation Cost: Requires generating pseudo depth labels for training data, increasing data preparation costs
  3. Insufficient Theoretical Analysis: Lacks in-depth theoretical analysis of why depth prediction outperforms direct depth input
  4. Limited Generalization Validation: Primarily validated on specific types of manipulation tasks; generalization to other task types requires further verification

Impact

  1. Field Contribution: Provides new effective method for enhancing VLA spatial reasoning, potentially influencing future research directions
  2. Practical Value: Simple and effective method, easily implementable in existing VLA systems
  3. Reproducibility: Authors commit to releasing code, facilitating research reproduction and further development

Applicable Scenarios

  1. Precise Manipulation Tasks: Particularly suitable for robotic manipulation tasks requiring precise spatial reasoning
  2. Multi-Modal Robotic Systems: Applicable to various robotic platforms with RGB cameras
  3. Industrial Applications: Potential applications in manufacturing, service robots, and other scenarios requiring precise manipulation

References

The paper cites extensive related work including:

  • VLA Models: OpenVLA, π0, Octo, etc.
  • Spatial-Aware Methods: SpatialVLA, CoT-VLA, etc.
  • 3D Perception Models: Depth Anything V2, DINOv2, etc.
  • Evaluation Benchmarks: LIBERO, Simpler, BridgeData V2, etc.

Overall Assessment: This is a high-quality research paper proposing a simple yet effective method to enhance VLA spatial reasoning capabilities. The experimental design is comprehensive, results are convincing, and the work has significant practical value and research significance for the robotic manipulation field.