2025-11-24T17:43:17.218297

DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning

Yuan, Liu, Lu et al.

Vision-Language-Action (VLA) models have recently shown impressive generalization and language-guided manipulation capabilities. However, their performance degrades on tasks requiring precise spatial reasoning due to limited spatial reasoning inherited from Vision-Language Models (VLMs). Existing VLAs rely on extensive action-data pretraining to ground VLMs in 3D space, which reduces training efficiency and is still insufficient for accurate spatial understanding. In this work, we present DepthVLA, a simple yet effective VLA architecture that explicitly incorporates spatial awareness through a pretrained depth prediction module. DepthVLA adopts a mixture-of-transformers design that unifies a VLM, a depth transformer, and an action expert with fully shared attentions, forming an end-to-end model with enhanced spatial reasoning. Extensive evaluations in both real-world and simulated environments show that DepthVLA outperforms state-of-the-art approaches, achieving 78.5% vs. 65.0% progress in real-world tasks, 94.9% vs. 93.6% in the LIBERO simulator, and 74.8% vs. 58.8% in the Simpler simulator. Our code will be made publicly available.

academic

DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning

Basic Information

Paper ID: 2510.13375
Title: DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning
Authors: Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Zhuoguang Chen, Tao Jiang, Hang Zhao
Institutions: IIIS, Tsinghua University & Galaxea AI
Category: cs.CV (Computer Vision)
Publication Date: October 15, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.13375

Abstract

Vision-Language-Action (VLA) models demonstrate strong performance in generalization and language-guided manipulation tasks, but show degraded performance on tasks requiring precise spatial reasoning, stemming from limited spatial reasoning capabilities inherited from vision-language models (VLMs). Existing VLAs rely on large-scale action data pretraining to ground VLMs in 3D space, which reduces training efficiency and remains insufficient for accurate spatial understanding. This paper proposes DepthVLA, a simple yet effective VLA architecture that explicitly incorporates spatial awareness through a pretrained depth prediction module. DepthVLA employs a mixture-of-experts Transformer design that unifies VLM, depth Transformer, and action experts through fully shared attention mechanisms, forming an end-to-end model with enhanced spatial reasoning capabilities. Extensive evaluations in real-world and simulated environments demonstrate that DepthVLA surpasses state-of-the-art methods, achieving 78.5% vs 65.0% progress on real-world tasks, 94.9% vs 93.6% on LIBERO simulator, and 74.8% vs 58.8% on Simpler simulator.

Research Background and Motivation

Core Problem

Existing Vision-Language-Action (VLA) models perform poorly on robotic manipulation tasks requiring precise spatial reasoning, primarily due to:

Limited spatial reasoning capabilities: VLAs inherit spatial reasoning limitations from VLMs, underperforming on precise manipulation tasks
Low training efficiency: Existing methods rely on large-scale action data pretraining to ground VLMs in 3D space, yet still fail to sufficiently understand spatial information
Practical application challenges: VLAs frequently fail at grasping small objects, executing precise operations, or avoiding collisions

Problem Significance

Precise spatial reasoning is critical for robotic manipulation, particularly in:

Grasping small objects or fine-grained manipulation
Collision-aware path planning
Stacking tasks requiring precise position estimation
Multi-step operations in complex environments

Limitations of Existing Methods

Generative world model approaches: Lack explicit 3D knowledge with limited improvements to current scene encoding
Chain-of-Thought reasoning: Introduces significant latency (over 2 seconds) requiring autoregressive generation of hundreds of spatial tokens
External depth estimators: Methods like SpatialVLA use off-the-shelf depth estimators but lack end-to-end optimization with VLA, limiting performance ceiling

Core Contributions

DepthVLA Architecture: Proposes a novel VLA model integrating a pretrained depth prediction expert into a mixture-of-experts Transformer framework, achieving explicit spatial reasoning while maintaining VLM's semantic foundation
Per-Expert Pretraining Strategy: The mixture-of-experts Transformer design allows each expert (VLM and depth) to be separately pretrained on different datasets, improving training efficiency and scalability beyond embodied action data
Comprehensive Real-World and Simulated Validation: Validates DepthVLA significantly outperforms state-of-the-art VLAs in real-world and simulated environments (LIBERO, Simpler), achieving substantial improvements in grasping accuracy, collision avoidance, and overall task success rates

Method Details

Task Definition

Following the standard end-to-end VLA setup, the policy πθ predicts a k-length action chunk At based on current observation ot (from one or multiple cameras), language instruction l, and proprioceptive state st:

At = πθ(ot, l, st)

Model Architecture

DepthVLA employs a Mixture-of-Experts Transformer (MoT) architecture integrating three experts:

1. Overall Design

VLM Expert: Encodes observations and language instructions, capturing semantic and linguistic foundational features
Depth Expert: Processes observations to infer geometric information
Action Expert: Generates continuous actions based on combined features from semantic and geometric experts

2. Depth Expert Design

Encoder-Decoder Architecture: Encoder based on DINOv2, initialized from Depth Anything V2 pretrained checkpoint
Decoder Structure: Matches VLM's Transformer structure, outputs depth predictions through linear head
Intermediate Feature Utilization: Performs spatial reasoning at all intermediate layers, providing rich geometric cues for action prediction

3. Attention Mechanism

Employs block-level masking strategy:

VLM and depth expert tokens attend only to themselves
Action tokens can attend to all streams
Preserves learning capacity of pretrained modules while fusing semantic and spatial cues

Technical Innovations

1. Explicit Spatial Reasoning

Unlike implicit methods, DepthVLA provides explicit 3D geometric understanding through a dedicated depth expert, avoiding reliance on large-scale action data.

2. Mixture-of-Experts Design

Allows different experts to be pretrained on their respective optimal datasets
Achieves effective fusion through shared attention layers
Maintains specialized capabilities of each expert

3. End-to-End Optimization

The depth expert is jointly trained with VLA using combined loss:

L = Lsi + Lflow

where Lsi is scale-invariant depth loss and Lflow is flow matching loss.

Experimental Setup

Datasets

Pretraining Datasets:
- Depth Expert: WildRGB-D, ScanNet, ScanNet++, HyperSim
- VLA: Galaxea Open-World Dataset (100k trajectories), BridgeData V2 (60k trajectories)
Evaluation Datasets:
- Simpler WidowX: 4 task suites, 120 trials
- LIBERO: 4 task suites (Spatial/Object/Goal/Long), 2000 trials
- Real-World: 3 benchmark tasks, 20 runs per task

Evaluation Metrics

Success Rate: Percentage of completed tasks
Progress Score: Each successful substep contributes one point, averaged across all runs

Comparison Methods

Diffusion Policy
Octo-Base
SpatialVLA
π0 (reimplemented)
OpenVLA
CoT-VLA
MolmoACT
DreamVLA

Implementation Details

Model: Paligemma-3B as VLM backbone, DINOv2-L as depth encoder
Training: 32 NVIDIA H100 GPUs, AdamW optimizer
Inference: NVIDIA 4090 GPU, BF16 mixed precision, 210ms latency

Experimental Results

Main Results

1. Simpler WidowX Benchmark

Model	Pretrain	Put Spoon	Put Carrot	Stack Block	Pick Eggplant	Average
π0 (reimplemented)	×	81.7%	64.2%	30.0%	59.2%	58.8%
DepthVLA	×	75.8%	71.7%	62.5%	89.2%	74.8%

2. LIBERO Benchmark

Model	Pretrain	Spatial	Object	Goal	Long	Average
π0 (reimplemented)	×	95.8%	96.4%	94.8%	87.4%	93.6%
DepthVLA	×	96.4%	98.0%	95.8%	89.2%	94.9%

3. Real-World Benchmark

Overall Performance: DepthVLA achieves 79% vs 65% baseline average progress score
Microwave Operation: Demonstrates excellent collision avoidance performance
Block Stacking: Exhibits superior spatial awareness capabilities
Desktop Organization: Performs comparably on small object grasping tasks

Ablation Studies

Setting	Spoon	Carrot	Block	Eggplant	Average
Depth Expert Random Init	60.0%	60.8%	43.3%	40.0%	51.0%
Remove Depth Loss	69.2%	60%	28.3%	70.0%	56.9%
Freeze Depth Expert	65.8%	69.2%	74.2%	78.3%	71.9%
Remove Block-Level Masking	66.7%	65.0%	2.5%	88.3%	55.6%
DepthVLA Complete	75.8%	71.7%	62.5%	89.2%	74.8%

Key Findings

Depth Pretraining is Critical: Randomly initialized depth expert shows significantly degraded performance
Depth Loss is Necessary: Removing depth loss leads to performance degradation
Block-Level Masking is Effective: Maintaining expert independence is crucial for performance
Prediction Outperforms Direct Input: Predicting depth performs better than directly using ground-truth depth

General-Purpose Robotic Manipulation Policies

Evolution from single-task experts to general models, driven by advances in large language models, vision-language models, and large-scale robotic action datasets. Early VLAs generate action tokens through VLM fine-tuning, while latest VLAs employ diffusion-based action experts.

Spatial-Aware VLAs

Early Methods: Use additional 3D inputs like LiDAR or RGB-D cameras, but reduce cross-platform generality
SpatialVLA: Uses off-the-shelf depth estimators to generate pseudo point clouds, but lacks end-to-end optimization
Generative World Models: Predict future frames, keypoints, or semantic states, but provide limited improvements to current scene encoding
CoT Reasoning: Autoregressively generates depth tokens, but introduces high latency

3D Geometric Awareness

Recent advances in 3D perception demonstrate strong capabilities in inferring geometry from monocular or multi-view images, offering potential for improving VLA spatial reasoning.

Conclusions and Discussion

Main Conclusions

Explicit Spatial Reasoning is Effective: Significantly improves VLA performance on precise manipulation tasks through pretrained depth experts
Mixture-of-Experts Design is Superior: Allows different experts to be pretrained on their respective optimal datasets, improving efficiency
End-to-End Optimization is Key: Joint optimization of depth prediction and action generation is more effective than using external depth estimators

Limitations

Monocular Depth Prediction Challenges: May still fail in difficult scenarios (tiny edges, reflective or transparent objects, textureless surfaces)
Computational Overhead: Adds 600M parameters and 20ms inference latency
Dependence on Depth Labels: Requires generating pseudo depth labels for training

Future Directions

Multi-View Depth Prediction: Explore multi-view depth or point map prediction to enhance spatial precision and robustness
More Efficient Architectures: Reduce computational overhead while maintaining performance
Unsupervised Spatial Learning: Reduce dependence on depth labels

In-Depth Evaluation

Strengths

Strong Method Innovation: First to effectively integrate pretrained depth experts into VLA, providing explicit spatial reasoning
Comprehensive Experiments: Covers real-world and multiple simulated environments with detailed ablation studies
Significant Performance Improvements: Achieves consistent performance gains across all test environments
Reasonable Design: Mixture-of-experts architecture preserves specialized capabilities of each expert while achieving effective fusion
Strong Practicality: Minimal inference latency increase, suitable for real-time deployment

Weaknesses

Depth Quality Dependency: Performance limited by depth prediction quality, may fail in challenging scenarios
Label Generation Cost: Requires generating pseudo depth labels for training data, increasing data preparation costs
Insufficient Theoretical Analysis: Lacks in-depth theoretical analysis of why depth prediction outperforms direct depth input
Limited Generalization Validation: Primarily validated on specific types of manipulation tasks; generalization to other task types requires further verification

Impact

Field Contribution: Provides new effective method for enhancing VLA spatial reasoning, potentially influencing future research directions
Practical Value: Simple and effective method, easily implementable in existing VLA systems
Reproducibility: Authors commit to releasing code, facilitating research reproduction and further development

Applicable Scenarios

Precise Manipulation Tasks: Particularly suitable for robotic manipulation tasks requiring precise spatial reasoning
Multi-Modal Robotic Systems: Applicable to various robotic platforms with RGB cameras
Industrial Applications: Potential applications in manufacturing, service robots, and other scenarios requiring precise manipulation

References

The paper cites extensive related work including:

VLA Models: OpenVLA, π0, Octo, etc.
Spatial-Aware Methods: SpatialVLA, CoT-VLA, etc.
3D Perception Models: Depth Anything V2, DINOv2, etc.
Evaluation Benchmarks: LIBERO, Simpler, BridgeData V2, etc.

Overall Assessment: This is a high-quality research paper proposing a simple yet effective method to enhance VLA spatial reasoning capabilities. The experimental design is comprehensive, results are convincing, and the work has significant practical value and research significance for the robotic manipulation field.