2025-11-14T15:37:11.416295

Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-to-Real Manipulation

Wang, Tian, Swann et al.

Learning robotic manipulation policies directly in the real world can be expensive and time-consuming. While reinforcement learning (RL) policies trained in simulation present a scalable alternative, effective sim-to-real transfer remains challenging, particularly for tasks that require precise dynamics. To address this, we propose Phys2Real, a real-to-sim-to-real RL pipeline that combines vision-language model (VLM)-inferred physical parameter estimates with interactive adaptation through uncertainty-aware fusion. Our approach consists of three core components: (1) high-fidelity geometric reconstruction with 3D Gaussian splatting, (2) VLM-inferred prior distributions over physical parameters, and (3) online physical parameter estimation from interaction data. Phys2Real conditions policies on interpretable physical parameters, refining VLM predictions with online estimates via ensemble-based uncertainty quantification. On planar pushing tasks of a T-block with varying center of mass (CoM) and a hammer with an off-center mass distribution, Phys2Real achieves substantial improvements over a domain randomization baseline: 100% vs 79% success rate for the bottom-weighted T-block, 57% vs 23% in the challenging top-weighted T-block, and 15% faster average task completion for hammer pushing. Ablation studies indicate that the combination of VLM and interaction information is essential for success. Project website: https://phys2real.github.io/ .

academic

Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-to-Real Manipulation

Basic Information

Paper ID: 2510.11689
Title: Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-to-Real Manipulation
Authors: Maggie Wang¹, Stephen Tian¹, Aiden Swann¹, Ola Shorinwa², Jiajun Wu¹, Mac Schwager¹
Affiliations: ¹Stanford University, ²Princeton University
Categories: cs.RO (Robotics), cs.AI (Artificial Intelligence)
Publication Date: October 13, 2025
Paper Link: https://arxiv.org/abs/2510.11689v1

Abstract

This paper proposes Phys2Real, a real-to-sim-to-real reinforcement learning pipeline that combines vision language model (VLM) physical parameter estimation with interactive online adaptation to address sim-to-real transfer challenges in robotic manipulation through uncertainty-aware fusion. The method comprises three core components: (1) high-fidelity geometric reconstruction based on 3D Gaussian splatting, (2) prior distributions of physical parameters inferred by VLMs, and (3) online physical parameter estimation based on interactive data. In planar pushing tasks with T-shaped blocks and hammers, Phys2Real achieves significant improvements over domain randomization baselines: 100% vs 79% success rate for bottom-weighted T-blocks, 57% vs 23% for top-weighted T-blocks, and 15% faster average completion time for hammer pushing tasks.

Research Background and Motivation

Core Problem

The transfer of robotic manipulation policies from simulation to the real world remains a fundamental challenge, particularly for tasks requiring precise dynamics. While traditional domain randomization (DR) methods provide robustness, they typically employ averaged behavior and cannot adapt to object-specific physical property variations.

Research Motivation

Humans demonstrate superior exploratory behavior when manipulating novel objects: first forming initial judgments about physical properties based on visual appearance, then refining these estimates through interaction. Inspired by this, the paper aims to equip robots with similar capabilities by combining visual physical reasoning with interactive learning to improve real-world manipulation performance.

Limitations of Existing Methods

Domain Randomization: Trains robust policies but sacrifices performance; cannot adapt to object-specific variations
System Identification: Requires manual parameter tuning; produces static models
Online Policy Adaptation: Faces challenges in intermittent contact scenarios; lacks external prior information
Digital Twins: Focuses on visual fidelity while neglecting physical properties

Core Contributions

Uncertainty-Aware Fusion of VLM Priors with Interactive Adaptation: First demonstration that VLMs can provide physical parameter estimation (e.g., center of mass) and can be combined with interaction-based parameter estimation for real-time low-level closed-loop control
Ensemble-Based Uncertainty Quantification: Decomposes uncertainty into epistemic and aleatoric components, fusing VLM priors and interactive estimates through inverse-variance weighting
Physics-Informed Digital Twin: Combines 3D Gaussian splatting reconstruction with online physical property estimation to create digital twins containing both geometric and physical information

Methodology Details

Task Definition

The paper investigates non-grasping manipulation tasks where robots must manipulate objects with varying physical properties (e.g., center of mass, friction coefficient) to target positions and orientations through pushing. Inputs include object pose, robot end-effector position, and estimated physical parameters; outputs are end-effector position changes.

Model Architecture

1. Real-to-Sim Scene Reconstruction

Segment target objects using SAM-2
Train 3D Gaussian splatting (GSplat) models
Extract surface-aligned meshes via SuGaR
Generate simulation-ready watertight mesh assets

2. Physics-Conditioned Policy Learning

Employs a three-stage training paradigm:

Phase 1: Policy trained conditioned on ground-truth physical parameters Phase 1.5: Fine-tune policy with noisy physical parameters to establish robustness to downstream noisy estimates Phase 2: Train ensemble of N=10 adaptation models predicting physical parameters from observation-action history

3. Uncertainty Quantification and Fusion

VLM Estimation (θ_vlm, σ_vlm):

Query GPT-5 to estimate task-relevant physical parameters
Query M times for each of N images; compute aggregated mean and uncertainty

RMA Estimation (θ_rma, σ_rma):

Epistemic uncertainty: σ²_epistemic = (1/N)∑(θᵢ - θ_rma)²
Aleatoric uncertainty: σ²_aleatoric = (1/N)∑σᵢ²
Total RMA uncertainty: σ²_rma = σ²_epistemic + σ²_aleatoric

Inverse-Variance Weighted Fusion:

θ̂ = (θ_vlm/σ²_vlm + θ_rma/σ²_rma) / (1/σ²_vlm + 1/σ²_rma)

Technical Innovations

Interpretable Physical Parameters: Direct conditioning on physical parameters rather than learned latent variables enables direct fusion of VLM estimates
Dual-Source Uncertainty Fusion: Relies more on VLM estimates when interaction history uncertainty is high, and vice versa
Ensemble Uncertainty Decomposition: Separates model uncertainty from data uncertainty, providing more precise uncertainty estimates

Experimental Setup

Experimental Tasks

T-Shaped Block Pushing: Modify center of mass by placing 143-gram metal weights at different positions; test two configurations
- Weight on top: center of mass +6.1cm, more challenging
- Weight on bottom: center of mass -0.7cm, relatively simple
Hammer Pushing: Center of mass near hammer head, producing complex motion dynamics

Evaluation Metrics

Success rate: position error <3cm and orientation error <20°
Final position error (cm)
Final orientation error (degrees)
Task completion time (seconds)

Comparison Methods

Domain Randomization (DR): Standard domain randomization baseline
Diffusion Policy: Strongly supervised learning baseline
RMA-only: Adaptation model only
Physics-conditioned VLM: VLM estimate only
Physics-conditioned privileged: Privileged baseline using ground-truth physical parameters

Implementation Details

6-DOF UFactory xArm robotic arm
PPO training with 4096 parallel environments
Asymmetric actor-critic architecture
Motion capture system for precise object pose acquisition

Experimental Results

Main Results

T-Shaped Block Pushing (Weight on Bottom):

Phys2Real: 100% success rate, 1.76±0.54cm position error
DR baseline: 79.17% success rate, 7.14±11.34cm position error
Privileged baseline: 95.83% success rate, 1.92±0.50cm position error

T-Shaped Block Pushing (Weight on Top, More Challenging):

Phys2Real: 57.14% success rate, 2.60±0.90cm position error
DR baseline: 23.81% success rate, 6.00±5.78cm position error
Privileged baseline: 90.48% success rate, 1.90±0.98cm position error

Hammer Pushing:

Both Phys2Real and DR achieve 100% success rate
Phys2Real average completion time: 77.79±44.08 seconds
DR average completion time: 90.65±42.03 seconds, 14.2% improvement

Ablation Studies

VLM vs RMA Individual Usage:

VLM estimate only: 4.76% success rate (weight on top)
RMA only: 14.29% success rate (weight on top)
Phys2Real fusion: 57.14% success rate

Results demonstrate that combining VLM and interactive information is crucial; neither alone achieves satisfactory performance.

Case Analysis

Figure 6 illustrates parameter estimation evolution during typical execution:

Initially, RMA estimates exhibit high uncertainty and deviate from ground truth
As contact continues, uncertainty decreases and fused estimates converge toward ground truth
After contact ends, uncertainty rises again due to lack of new information

Experimental Findings

Value of Physical Parameter Estimation: Accurate physical parameter estimation significantly improves manipulation performance
Necessity of Fusion: VLM and interactive information are both essential; individual usage causes dramatic performance degradation
Importance of Uncertainty Awareness: Effective information fusion achieved through uncertainty-weighted combination
Robustness: Exhibits strong robustness to inaccurate VLM estimates

Domain Randomization and System Identification

Traditional methods train robust policies through randomizing simulation dynamics but often employ averaged behavior sacrificing performance. System identification methods require manual parameter tuning and produce static models.

Online Policy Adaptation

Methods like RMA perform well in continuous contact scenarios (e.g., locomotion) but face challenges in intermittent contact of general manipulation tasks. This paper addresses this through VLM priors and uncertainty-aware fusion.

Digital Twins and Rendering

NeRF and GSplat enable high-fidelity 3D scene reconstruction, but existing digital twins focus on visual fidelity while neglecting physical properties. This paper creates physics-informed digital twins.

VLMs Physical Reasoning

Recent work demonstrates VLMs' physical reasoning capabilities, primarily for high-level planning. This paper is the first to directly integrate VLM physical parameter estimation into low-level control policies.

Conclusions and Discussion

Main Conclusions

Phys2Real successfully demonstrates the effectiveness of combining VLM visual reasoning with interactive adaptation, significantly outperforming domain randomization baselines across multiple manipulation tasks. The uncertainty-aware fusion mechanism enables the system to dynamically adjust weights based on reliability of each information source.

Limitations

Symmetry Assumption: Reconstruction pipeline performs best on approximately symmetric objects; mirroring may distort true shapes of asymmetric objects
VLM Estimation Bias: VLMs tend to shift toward geometric centers, potentially producing physically inconsistent estimates
Task Complexity: Current validated tasks are relatively simple; generalization to complex operations remains uncertain
Perception Dependency: Relies on motion capture systems; transition to pure visual perception is a future direction

Future Directions

Extend reconstruction strategies to asymmetric objects
Replace motion capture with perception-based tracking
Validate performance on more complex manipulation tasks
Explore estimation of other physical parameters (friction, stiffness)

In-Depth Evaluation

Strengths

Strong Novelty: First organic fusion of VLM physical reasoning with RMA adaptation, opening new research directions
Sound Technical Approach: Uncertainty decomposition and inverse-variance weighted fusion have theoretical foundations
Comprehensive Experiments: Multi-task, multi-configuration evaluation with ablation studies revealing component contributions
High Practical Value: Provides new solutions for sim-to-real transfer

Weaknesses

Limited Task Scope: Validates only planar pushing tasks; generalization to complex operations unknown
VLM Dependency: Heavily relies on VLM physical reasoning capabilities; potential systematic biases
Computational Overhead: Ensemble methods and VLM queries may introduce additional computational costs
Insufficient Theoretical Analysis: Lacks convergence analysis of fusion strategy

Impact

This work makes important contributions to robotic learning, demonstrating the potential of foundation models in low-level control. Expected to inspire more research combining visual reasoning with interactive learning, advancing sim-to-real transfer technology.

Applicable Scenarios

Manipulation tasks requiring precise physical modeling
Scenarios with unknown or varying object physical properties
Non-grasping operations with intermittent contact
Applications requiring rapid adaptation to novel objects

References

1 Kumar et al. "RMA: Rapid Motor Adaptation for Legged Robots." RSS 2021. 2 Chi et al. "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion." IJRR 2024. 3 Kerbl et al. "3D Gaussian Splatting for Real-Time Radiance Field Rendering." ACM TOG 2023.

Overall Assessment: This is a high-quality robotics learning paper that innovatively combines multiple cutting-edge techniques, providing a novel and effective solution to the sim-to-real transfer problem. Despite certain limitations, its technical contributions and experimental validation meet high standards, possessing significant academic value and application prospects.