Vision-Language-Action (VLA) models have recently shown impressive generalization and language-guided manipulation capabilities. However, their performance degrades on tasks requiring precise spatial reasoning due to limited spatial reasoning inherited from Vision-Language Models (VLMs). Existing VLAs rely on extensive action-data pretraining to ground VLMs in 3D space, which reduces training efficiency and is still insufficient for accurate spatial understanding. In this work, we present DepthVLA, a simple yet effective VLA architecture that explicitly incorporates spatial awareness through a pretrained depth prediction module. DepthVLA adopts a mixture-of-transformers design that unifies a VLM, a depth transformer, and an action expert with fully shared attentions, forming an end-to-end model with enhanced spatial reasoning. Extensive evaluations in both real-world and simulated environments show that DepthVLA outperforms state-of-the-art approaches, achieving 78.5% vs. 65.0% progress in real-world tasks, 94.9% vs. 93.6% in the LIBERO simulator, and 74.8% vs. 58.8% in the Simpler simulator. Our code will be made publicly available.
- Paper ID: 2510.13375
- Title: DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning
- Authors: Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Zhuoguang Chen, Tao Jiang, Hang Zhao
- Institutions: IIIS, Tsinghua University & Galaxea AI
- Category: cs.CV (Computer Vision)
- Publication Date: October 15, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.13375
Vision-Language-Action (VLA) models demonstrate strong performance in generalization and language-guided manipulation tasks, but show degraded performance on tasks requiring precise spatial reasoning, stemming from limited spatial reasoning capabilities inherited from vision-language models (VLMs). Existing VLAs rely on large-scale action data pretraining to ground VLMs in 3D space, which reduces training efficiency and remains insufficient for accurate spatial understanding. This paper proposes DepthVLA, a simple yet effective VLA architecture that explicitly incorporates spatial awareness through a pretrained depth prediction module. DepthVLA employs a mixture-of-experts Transformer design that unifies VLM, depth Transformer, and action experts through fully shared attention mechanisms, forming an end-to-end model with enhanced spatial reasoning capabilities. Extensive evaluations in real-world and simulated environments demonstrate that DepthVLA surpasses state-of-the-art methods, achieving 78.5% vs 65.0% progress on real-world tasks, 94.9% vs 93.6% on LIBERO simulator, and 74.8% vs 58.8% on Simpler simulator.
Existing Vision-Language-Action (VLA) models perform poorly on robotic manipulation tasks requiring precise spatial reasoning, primarily due to:
- Limited spatial reasoning capabilities: VLAs inherit spatial reasoning limitations from VLMs, underperforming on precise manipulation tasks
- Low training efficiency: Existing methods rely on large-scale action data pretraining to ground VLMs in 3D space, yet still fail to sufficiently understand spatial information
- Practical application challenges: VLAs frequently fail at grasping small objects, executing precise operations, or avoiding collisions
Precise spatial reasoning is critical for robotic manipulation, particularly in:
- Grasping small objects or fine-grained manipulation
- Collision-aware path planning
- Stacking tasks requiring precise position estimation
- Multi-step operations in complex environments
- Generative world model approaches: Lack explicit 3D knowledge with limited improvements to current scene encoding
- Chain-of-Thought reasoning: Introduces significant latency (over 2 seconds) requiring autoregressive generation of hundreds of spatial tokens
- External depth estimators: Methods like SpatialVLA use off-the-shelf depth estimators but lack end-to-end optimization with VLA, limiting performance ceiling
- DepthVLA Architecture: Proposes a novel VLA model integrating a pretrained depth prediction expert into a mixture-of-experts Transformer framework, achieving explicit spatial reasoning while maintaining VLM's semantic foundation
- Per-Expert Pretraining Strategy: The mixture-of-experts Transformer design allows each expert (VLM and depth) to be separately pretrained on different datasets, improving training efficiency and scalability beyond embodied action data
- Comprehensive Real-World and Simulated Validation: Validates DepthVLA significantly outperforms state-of-the-art VLAs in real-world and simulated environments (LIBERO, Simpler), achieving substantial improvements in grasping accuracy, collision avoidance, and overall task success rates
Following the standard end-to-end VLA setup, the policy πθ predicts a k-length action chunk At based on current observation ot (from one or multiple cameras), language instruction l, and proprioceptive state st:
DepthVLA employs a Mixture-of-Experts Transformer (MoT) architecture integrating three experts:
- VLM Expert: Encodes observations and language instructions, capturing semantic and linguistic foundational features
- Depth Expert: Processes observations to infer geometric information
- Action Expert: Generates continuous actions based on combined features from semantic and geometric experts
- Encoder-Decoder Architecture: Encoder based on DINOv2, initialized from Depth Anything V2 pretrained checkpoint
- Decoder Structure: Matches VLM's Transformer structure, outputs depth predictions through linear head
- Intermediate Feature Utilization: Performs spatial reasoning at all intermediate layers, providing rich geometric cues for action prediction
Employs block-level masking strategy:
- VLM and depth expert tokens attend only to themselves
- Action tokens can attend to all streams
- Preserves learning capacity of pretrained modules while fusing semantic and spatial cues
Unlike implicit methods, DepthVLA provides explicit 3D geometric understanding through a dedicated depth expert, avoiding reliance on large-scale action data.
- Allows different experts to be pretrained on their respective optimal datasets
- Achieves effective fusion through shared attention layers
- Maintains specialized capabilities of each expert
The depth expert is jointly trained with VLA using combined loss:
where Lsi is scale-invariant depth loss and Lflow is flow matching loss.
- Pretraining Datasets:
- Depth Expert: WildRGB-D, ScanNet, ScanNet++, HyperSim
- VLA: Galaxea Open-World Dataset (100k trajectories), BridgeData V2 (60k trajectories)
- Evaluation Datasets:
- Simpler WidowX: 4 task suites, 120 trials
- LIBERO: 4 task suites (Spatial/Object/Goal/Long), 2000 trials
- Real-World: 3 benchmark tasks, 20 runs per task
- Success Rate: Percentage of completed tasks
- Progress Score: Each successful substep contributes one point, averaged across all runs
- Diffusion Policy
- Octo-Base
- SpatialVLA
- π0 (reimplemented)
- OpenVLA
- CoT-VLA
- MolmoACT
- DreamVLA
- Model: Paligemma-3B as VLM backbone, DINOv2-L as depth encoder
- Training: 32 NVIDIA H100 GPUs, AdamW optimizer
- Inference: NVIDIA 4090 GPU, BF16 mixed precision, 210ms latency
| Model | Pretrain | Put Spoon | Put Carrot | Stack Block | Pick Eggplant | Average |
|---|
| π0 (reimplemented) | × | 81.7% | 64.2% | 30.0% | 59.2% | 58.8% |
| DepthVLA | × | 75.8% | 71.7% | 62.5% | 89.2% | 74.8% |
| Model | Pretrain | Spatial | Object | Goal | Long | Average |
|---|
| π0 (reimplemented) | × | 95.8% | 96.4% | 94.8% | 87.4% | 93.6% |
| DepthVLA | × | 96.4% | 98.0% | 95.8% | 89.2% | 94.9% |
- Overall Performance: DepthVLA achieves 79% vs 65% baseline average progress score
- Microwave Operation: Demonstrates excellent collision avoidance performance
- Block Stacking: Exhibits superior spatial awareness capabilities
- Desktop Organization: Performs comparably on small object grasping tasks
| Setting | Spoon | Carrot | Block | Eggplant | Average |
|---|
| Depth Expert Random Init | 60.0% | 60.8% | 43.3% | 40.0% | 51.0% |
| Remove Depth Loss | 69.2% | 60% | 28.3% | 70.0% | 56.9% |
| Freeze Depth Expert | 65.8% | 69.2% | 74.2% | 78.3% | 71.9% |
| Remove Block-Level Masking | 66.7% | 65.0% | 2.5% | 88.3% | 55.6% |
| DepthVLA Complete | 75.8% | 71.7% | 62.5% | 89.2% | 74.8% |
- Depth Pretraining is Critical: Randomly initialized depth expert shows significantly degraded performance
- Depth Loss is Necessary: Removing depth loss leads to performance degradation
- Block-Level Masking is Effective: Maintaining expert independence is crucial for performance
- Prediction Outperforms Direct Input: Predicting depth performs better than directly using ground-truth depth
Evolution from single-task experts to general models, driven by advances in large language models, vision-language models, and large-scale robotic action datasets. Early VLAs generate action tokens through VLM fine-tuning, while latest VLAs employ diffusion-based action experts.
- Early Methods: Use additional 3D inputs like LiDAR or RGB-D cameras, but reduce cross-platform generality
- SpatialVLA: Uses off-the-shelf depth estimators to generate pseudo point clouds, but lacks end-to-end optimization
- Generative World Models: Predict future frames, keypoints, or semantic states, but provide limited improvements to current scene encoding
- CoT Reasoning: Autoregressively generates depth tokens, but introduces high latency
Recent advances in 3D perception demonstrate strong capabilities in inferring geometry from monocular or multi-view images, offering potential for improving VLA spatial reasoning.
- Explicit Spatial Reasoning is Effective: Significantly improves VLA performance on precise manipulation tasks through pretrained depth experts
- Mixture-of-Experts Design is Superior: Allows different experts to be pretrained on their respective optimal datasets, improving efficiency
- End-to-End Optimization is Key: Joint optimization of depth prediction and action generation is more effective than using external depth estimators
- Monocular Depth Prediction Challenges: May still fail in difficult scenarios (tiny edges, reflective or transparent objects, textureless surfaces)
- Computational Overhead: Adds 600M parameters and 20ms inference latency
- Dependence on Depth Labels: Requires generating pseudo depth labels for training
- Multi-View Depth Prediction: Explore multi-view depth or point map prediction to enhance spatial precision and robustness
- More Efficient Architectures: Reduce computational overhead while maintaining performance
- Unsupervised Spatial Learning: Reduce dependence on depth labels
- Strong Method Innovation: First to effectively integrate pretrained depth experts into VLA, providing explicit spatial reasoning
- Comprehensive Experiments: Covers real-world and multiple simulated environments with detailed ablation studies
- Significant Performance Improvements: Achieves consistent performance gains across all test environments
- Reasonable Design: Mixture-of-experts architecture preserves specialized capabilities of each expert while achieving effective fusion
- Strong Practicality: Minimal inference latency increase, suitable for real-time deployment
- Depth Quality Dependency: Performance limited by depth prediction quality, may fail in challenging scenarios
- Label Generation Cost: Requires generating pseudo depth labels for training data, increasing data preparation costs
- Insufficient Theoretical Analysis: Lacks in-depth theoretical analysis of why depth prediction outperforms direct depth input
- Limited Generalization Validation: Primarily validated on specific types of manipulation tasks; generalization to other task types requires further verification
- Field Contribution: Provides new effective method for enhancing VLA spatial reasoning, potentially influencing future research directions
- Practical Value: Simple and effective method, easily implementable in existing VLA systems
- Reproducibility: Authors commit to releasing code, facilitating research reproduction and further development
- Precise Manipulation Tasks: Particularly suitable for robotic manipulation tasks requiring precise spatial reasoning
- Multi-Modal Robotic Systems: Applicable to various robotic platforms with RGB cameras
- Industrial Applications: Potential applications in manufacturing, service robots, and other scenarios requiring precise manipulation
The paper cites extensive related work including:
- VLA Models: OpenVLA, π0, Octo, etc.
- Spatial-Aware Methods: SpatialVLA, CoT-VLA, etc.
- 3D Perception Models: Depth Anything V2, DINOv2, etc.
- Evaluation Benchmarks: LIBERO, Simpler, BridgeData V2, etc.
Overall Assessment: This is a high-quality research paper proposing a simple yet effective method to enhance VLA spatial reasoning capabilities. The experimental design is comprehensive, results are convincing, and the work has significant practical value and research significance for the robotic manipulation field.