2025-11-12T16:49:10.216931

iMoWM: Taming Interactive Multi-Modal World Model for Robotic Manipulation

Zhang, Wu, Lu et al.
Learned world models hold significant potential for robotic manipulation, as they can serve as simulator for real-world interactions. While extensive progress has been made in 2D video-based world models, these approaches often lack geometric and spatial reasoning, which is essential for capturing the physical structure of the 3D world. To address this limitation, we introduce iMoWM, a novel interactive world model designed to generate color images, depth maps, and robot arm masks in an autoregressive manner conditioned on actions. To overcome the high computational cost associated with three-dimensional information, we propose MMTokenizer, which unifies multi-modal inputs into a compact token representation. This design enables iMoWM to leverage large-scale pretrained VideoGPT models while maintaining high efficiency and incorporating richer physical information. With its multi-modal representation, iMoWM not only improves the visual quality of future predictions but also serves as an effective simulator for model-based reinforcement learning (MBRL) and facilitates real-world imitation learning. Extensive experiments demonstrate the superiority of iMoWM across these tasks, showcasing the advantages of multi-modal world modeling for robotic manipulation. Homepage: https://xingyoujun.github.io/imowm/
academic

iMoWM: Taming Interactive Multi-Modal World Model for Robotic Manipulation

Basic Information

  • Paper ID: 2510.09036
  • Title: iMoWM: Taming Interactive Multi-Modal World Model for Robotic Manipulation
  • Authors: Chuanrui Zhang¹, Zhengxian Wu², Guanxing Lu², Yansong Tang², Ziwei Wang¹
  • Affiliations: ¹Nanyang Technological University, ²Tsinghua University
  • Category: cs.RO (Robotics)
  • Publication Date: October 10, 2024 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.09036
  • Project Homepage: https://xingyoujun.github.io/imowm/

Abstract

Learning world models holds tremendous potential for robotic manipulation, serving as simulators for real-world interactions. While significant progress has been made in 2D video-based world models, these methods often lack geometric and spatial reasoning capabilities essential for capturing the physical structure of 3D worlds. To address this limitation, the authors propose iMoWM, a novel interactive world model capable of autoregressively generating RGB images, depth maps, and robot arm masks conditioned on actions. To overcome the high computational costs associated with 3D information, the authors introduce MMTokenizer, which unifies multi-modal inputs into compact token representations. This design enables iMoWM to leverage large-scale pre-trained VideoGPT models while maintaining efficiency and incorporating richer physical information.

Research Background and Motivation

Problem Definition

Robotic manipulation tasks require accurate prediction of physical dynamics in 3D environments, but existing world models suffer from the following issues:

  1. Lack of Geometric Understanding: Most methods rely solely on RGB video prediction, lacking explicit representations of 3D spatial structure
  2. High Computational Cost: Direct processing of 3D information (e.g., 3D Gaussians) incurs substantial computational overhead
  3. Limited Generalization: Absence of action-conditional constraints makes it difficult to adapt to diverse robotic manipulation scenarios

Research Motivation

Robotic manipulation occurs in three-dimensional space, and relying solely on RGB information is prone to errors under visual variations and complex object interactions. While existing 3D methods such as GWM employ 3D Gaussian distributions, they depend on high-quality 3DGS reconstruction, which is limited in monocular scenarios and difficult to scale.

Core Contributions

  1. Proposes iMoWM Framework: The first interactive multi-modal world model capable of simultaneously predicting RGB images, depth maps, and robot arm masks
  2. Designs MMTokenizer: An innovative multi-modal tokenizer that unifies heterogeneous inputs into compact token representations, significantly reducing computational costs
  3. Enables Multi-Task Applications: Supports action-conditioned video generation, model-based reinforcement learning (MBRL), and real-world imitation learning
  4. Demonstrates Superior Performance: Achieves state-of-the-art results on public benchmarks and real-world experiments

Methodology Details

Task Definition

Given an initial observation O₁ (containing RGB image, depth map, and robot arm mask) and an action sequence {aₜ}ᵀₜ₌₁, iMoWM must predict future multi-modal observation sequences {Oₜ}ᵀₜ₌₂.

Model Architecture

MMTokenizer Design

MMTokenizer is the core innovation, employing a dual encoder-decoder framework {(Ec, Dc), (Ed, Dd)}:

  1. Context Encoding: Processes initial frames using context encoder
    Zᶜₜ = Ec(Oₜ), Ôₜ = Dc(Zᶜₜ) t = 1,...,T₀
    
  2. Dynamic Encoding: Conditional encoder focuses on dynamic regions
    Zᵈₜ = Ep(Oₜ|O₁:T₀), Ôₜ = Dc(Zᵈₜ|O₁:T₀) t = T₀+1,...,T
    
  3. Modality Adaptation: Replicates first and last layers to handle feature distribution differences across modalities, introducing modality-specific embeddings

Autoregressive Transformer

Employs a LLaMA-style transformer architecture, including:

  • RMSNorm normalization
  • SwiGLU activation function
  • Rotary position encoding
  • Action-conditional slot token injection mechanism

Action conditioning is implemented via slot tokens:

[Sₜ] = [S] + Linear(aₜ)

Training objective uses cross-entropy loss:

Ltransformer = -∑ᵀₜ₌T₀₊₁ log(Xₜ|Xₜ₋₁)

Technical Innovations

  1. Unified Multi-Modal Representation: First to systematically encode RGB, depth, and masks together, avoiding information loss between modalities
  2. Computational Efficiency Optimization: Dynamic encoder focuses only on changing regions, significantly reducing token count
  3. Pre-trained Model Reuse: Design compatible with existing VideoGPT pre-trained weights, accelerating convergence

Experimental Setup

Datasets

  1. BAIR Robot Pushing Dataset: 43K training videos, 256 test videos, 64×64 resolution
  2. RoboNet Dataset: 19K training video subset, 256 test videos
  3. Self-Collected Dataset: 1K training videos, 150 test videos, 256×256 high resolution
  4. Meta-World Benchmark: 6 robotic manipulation tasks for reinforcement learning evaluation

Evaluation Metrics

  • Visual Quality: FVD, PSNR, SSIM, LPIPS
  • Depth Accuracy: AbsRel (Absolute Relative Error)
  • Manipulation Performance: Task success rate

Baseline Methods

  • MaskViT, SVG, GHVAE (video prediction baselines)
  • iVideoGPT (strongest RGB baseline)
  • GWM (3D Gaussian method)

Implementation Details

  • Video Depth Anything used for depth map generation
  • Grounding DINO + SAM2 for robot arm mask extraction
  • Pre-trained weights for transformer initialization
  • 4 rollouts for fair comparison

Experimental Results

Main Results

Video Generation Performance

On BAIR dataset:

  • FVD: 60.9 (vs iVideoGPT 65.01)
  • PSNR: 23.82 (vs iVideoGPT 23.40)
  • SSIM: 0.896 (vs iVideoGPT 0.882)
  • LPIPS: 0.051 (vs iVideoGPT 0.058)
  • AbsRel: 0.045 (vs iVideoGPT 0.059)

Comprehensively outperforms baselines on RoboNet dataset, achieving PSNR of 38.33 on high-resolution real data.

Reinforcement Learning Performance

Outperforms iVideoGPT and GWM on all 6 Meta-World tasks, with faster convergence and higher final success rates. Geometry-aware rollouts significantly improve RL performance.

Real-World Deployment

On GALAXEA A1 robot for cup stacking and bread picking tasks:

  • Overall success rate: 29/35 (vs iVideoGPT 13/35, GT 27/35)
  • Approaches ground truth performance, validating high-fidelity multi-modal rollouts

Ablation Studies

  1. MMTokenizer Effect: Compared to original tokenizer, inference time reduced from 860s to 10s while improving all visual metrics
  2. Modality Contribution Analysis:
    • RGB+Depth+Mask (complete method): FVD 67.6
    • RGB only: FVD 70.2
    • RGB+Mask: FVD 70.6
    • RGB+Depth: FVD 67.5

Each modality contributes to performance improvement, with depth information providing the largest contribution.

Experimental Findings

  1. Resolution Sensitivity: High-resolution inputs significantly improve performance by providing more precise depth and mask information
  2. Importance of Geometric Information: Depth maps provide richer geometric constraints than masks
  3. Computational Efficiency: MMTokenizer substantially improves inference speed while maintaining performance

Learning World Models

Early methods inspired by VideoGPT perform RGB video tokenization, while recent diffusion models advance latent space prediction. GWM employs 3DGS but is limited by monocular scene quality.

4D Video Prediction

Methods like TesserAct explore RGB-D generation but lack explicit action conditioning, limiting robotic applications.

Robotic Manipulation World Models

Primarily used for data augmentation and RL simulation, but generally lack 3D information, limiting effectiveness as simulators and data generators.

Conclusions and Discussion

Main Conclusions

  1. Multi-modal world models significantly outperform pure RGB methods
  2. MMTokenizer achieves good balance between efficiency and performance
  3. Geometric information is crucial for robotic manipulation tasks
  4. The method performs well in both simulation and real environments

Limitations

  1. Pre-training Dependency: Still requires large-scale pre-training to fully leverage multi-modal world model generalization
  2. Computational Resources: While more efficient than 3DGS methods, still more computationally intensive than pure RGB approaches
  3. Depth Quality Dependency: Performance is affected by depth estimation quality

Future Directions

  1. Explore larger-scale multi-modal pre-training
  2. Investigate more efficient 3D representation methods
  3. Extend to more robotic platforms and task types

In-Depth Evaluation

Strengths

  1. Strong Innovation: First systematic introduction of multi-modal information into world models with novel technical approach
  2. Engineering Completeness: Forms complete closed-loop from theoretical design to practical deployment
  3. Comprehensive Experiments: Covers simulation, benchmark testing, and real robot validation
  4. Significant Performance Gains: Achieves notable improvements across multiple metrics

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks in-depth theoretical explanation for why multi-modal information improves performance
  2. Limited Generalization Verification: Primarily validated on specific robot platforms; cross-platform generalization requires further verification
  3. Incomplete Computational Analysis: While efficiency improvements are mentioned, detailed computational complexity analysis is lacking

Impact

  1. Academic Value: Provides new multi-modal direction for world model research
  2. Practical Value: Directly applicable to real robotic systems with strong practicality
  3. Reproducibility: Provides detailed implementation details and open-source commitments

Applicable Scenarios

  1. Robotic manipulation tasks requiring precise geometric understanding
  2. Data-scarce robotic learning scenarios
  3. Reinforcement learning applications requiring high-fidelity simulation

References

This paper cites 63 relevant references covering important works in world models, video prediction, robotic learning, and other domains, providing solid theoretical foundation for the research.


Overall Assessment: This is a high-quality robotics learning paper making important contributions to multi-modal world models. The technical innovations are clear, experimental validation is comprehensive, and it possesses strong academic and practical value.