Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-to-Real Manipulation
Wang, Tian, Swann et al.
Learning robotic manipulation policies directly in the real world can be expensive and time-consuming. While reinforcement learning (RL) policies trained in simulation present a scalable alternative, effective sim-to-real transfer remains challenging, particularly for tasks that require precise dynamics. To address this, we propose Phys2Real, a real-to-sim-to-real RL pipeline that combines vision-language model (VLM)-inferred physical parameter estimates with interactive adaptation through uncertainty-aware fusion. Our approach consists of three core components: (1) high-fidelity geometric reconstruction with 3D Gaussian splatting, (2) VLM-inferred prior distributions over physical parameters, and (3) online physical parameter estimation from interaction data. Phys2Real conditions policies on interpretable physical parameters, refining VLM predictions with online estimates via ensemble-based uncertainty quantification. On planar pushing tasks of a T-block with varying center of mass (CoM) and a hammer with an off-center mass distribution, Phys2Real achieves substantial improvements over a domain randomization baseline: 100% vs 79% success rate for the bottom-weighted T-block, 57% vs 23% in the challenging top-weighted T-block, and 15% faster average task completion for hammer pushing. Ablation studies indicate that the combination of VLM and interaction information is essential for success. Project website: https://phys2real.github.io/ .
academic
Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-to-Real Manipulation
This paper proposes Phys2Real, a real-to-sim-to-real reinforcement learning pipeline that combines vision language model (VLM) physical parameter estimation with interactive online adaptation to address sim-to-real transfer challenges in robotic manipulation through uncertainty-aware fusion. The method comprises three core components: (1) high-fidelity geometric reconstruction based on 3D Gaussian splatting, (2) prior distributions of physical parameters inferred by VLMs, and (3) online physical parameter estimation based on interactive data. In planar pushing tasks with T-shaped blocks and hammers, Phys2Real achieves significant improvements over domain randomization baselines: 100% vs 79% success rate for bottom-weighted T-blocks, 57% vs 23% for top-weighted T-blocks, and 15% faster average completion time for hammer pushing tasks.
The transfer of robotic manipulation policies from simulation to the real world remains a fundamental challenge, particularly for tasks requiring precise dynamics. While traditional domain randomization (DR) methods provide robustness, they typically employ averaged behavior and cannot adapt to object-specific physical property variations.
Humans demonstrate superior exploratory behavior when manipulating novel objects: first forming initial judgments about physical properties based on visual appearance, then refining these estimates through interaction. Inspired by this, the paper aims to equip robots with similar capabilities by combining visual physical reasoning with interactive learning to improve real-world manipulation performance.
Uncertainty-Aware Fusion of VLM Priors with Interactive Adaptation: First demonstration that VLMs can provide physical parameter estimation (e.g., center of mass) and can be combined with interaction-based parameter estimation for real-time low-level closed-loop control
Ensemble-Based Uncertainty Quantification: Decomposes uncertainty into epistemic and aleatoric components, fusing VLM priors and interactive estimates through inverse-variance weighting
Physics-Informed Digital Twin: Combines 3D Gaussian splatting reconstruction with online physical property estimation to create digital twins containing both geometric and physical information
The paper investigates non-grasping manipulation tasks where robots must manipulate objects with varying physical properties (e.g., center of mass, friction coefficient) to target positions and orientations through pushing. Inputs include object pose, robot end-effector position, and estimated physical parameters; outputs are end-effector position changes.
Interpretable Physical Parameters: Direct conditioning on physical parameters rather than learned latent variables enables direct fusion of VLM estimates
Dual-Source Uncertainty Fusion: Relies more on VLM estimates when interaction history uncertainty is high, and vice versa
Ensemble Uncertainty Decomposition: Separates model uncertainty from data uncertainty, providing more precise uncertainty estimates
Traditional methods train robust policies through randomizing simulation dynamics but often employ averaged behavior sacrificing performance. System identification methods require manual parameter tuning and produce static models.
Methods like RMA perform well in continuous contact scenarios (e.g., locomotion) but face challenges in intermittent contact of general manipulation tasks. This paper addresses this through VLM priors and uncertainty-aware fusion.
NeRF and GSplat enable high-fidelity 3D scene reconstruction, but existing digital twins focus on visual fidelity while neglecting physical properties. This paper creates physics-informed digital twins.
Recent work demonstrates VLMs' physical reasoning capabilities, primarily for high-level planning. This paper is the first to directly integrate VLM physical parameter estimation into low-level control policies.
Phys2Real successfully demonstrates the effectiveness of combining VLM visual reasoning with interactive adaptation, significantly outperforming domain randomization baselines across multiple manipulation tasks. The uncertainty-aware fusion mechanism enables the system to dynamically adjust weights based on reliability of each information source.
This work makes important contributions to robotic learning, demonstrating the potential of foundation models in low-level control. Expected to inspire more research combining visual reasoning with interactive learning, advancing sim-to-real transfer technology.
1 Kumar et al. "RMA: Rapid Motor Adaptation for Legged Robots." RSS 2021.
2 Chi et al. "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion." IJRR 2024.
3 Kerbl et al. "3D Gaussian Splatting for Real-Time Radiance Field Rendering." ACM TOG 2023.
Overall Assessment: This is a high-quality robotics learning paper that innovatively combines multiple cutting-edge techniques, providing a novel and effective solution to the sim-to-real transfer problem. Despite certain limitations, its technical contributions and experimental validation meet high standards, possessing significant academic value and application prospects.