Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-to-Real Manipulation
Wang, Tian, Swann et al.
Learning robotic manipulation policies directly in the real world can be expensive and time-consuming. While reinforcement learning (RL) policies trained in simulation present a scalable alternative, effective sim-to-real transfer remains challenging, particularly for tasks that require precise dynamics. To address this, we propose Phys2Real, a real-to-sim-to-real RL pipeline that combines vision-language model (VLM)-inferred physical parameter estimates with interactive adaptation through uncertainty-aware fusion. Our approach consists of three core components: (1) high-fidelity geometric reconstruction with 3D Gaussian splatting, (2) VLM-inferred prior distributions over physical parameters, and (3) online physical parameter estimation from interaction data. Phys2Real conditions policies on interpretable physical parameters, refining VLM predictions with online estimates via ensemble-based uncertainty quantification. On planar pushing tasks of a T-block with varying center of mass (CoM) and a hammer with an off-center mass distribution, Phys2Real achieves substantial improvements over a domain randomization baseline: 100% vs 79% success rate for the bottom-weighted T-block, 57% vs 23% in the challenging top-weighted T-block, and 15% faster average task completion for hammer pushing. Ablation studies indicate that the combination of VLM and interaction information is essential for success. Project website: https://phys2real.github.io/ .
本論文はPhys2Realを提案する。これは視覚言語モデル(VLM)による物理パラメータ推定と対話的オンライン適応を組み合わせたリアル・ツー・シム・ツー・リアル強化学習パイプラインであり、不確実性認識融合を通じてロボット操作におけるシム・ツー・リアル移行の課題に対処する。本手法は3つの核心要素から構成される:(1)3D高斯スプラッティングに基づく高忠実度幾何再構成、(2)VLM推論による物理パラメータ事前分布、(3)対話データに基づくオンライン物理パラメータ推定。T字型ブロックとハンマーの平面押し込みタスクにおいて、Phys2Realはドメイン・ランダマイゼーション基線と比較して顕著な改善を達成した:底部加重T字型ブロック成功率100% vs 79%、上部加重T字型ブロック57% vs 23%、ハンマー押し込みタスク平均完了時間15%高速化。
1 Kumar et al. "RMA: Rapid Motor Adaptation for Legged Robots." RSS 2021.
2 Chi et al. "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion." IJRR 2024.
3 Kerbl et al. "3D Gaussian Splatting for Real-Time Radiance Field Rendering." ACM TOG 2023.