2025-11-12T16:49:10.216931

iMoWM: Taming Interactive Multi-Modal World Model for Robotic Manipulation

Zhang, Wu, Lu et al.

Learned world models hold significant potential for robotic manipulation, as they can serve as simulator for real-world interactions. While extensive progress has been made in 2D video-based world models, these approaches often lack geometric and spatial reasoning, which is essential for capturing the physical structure of the 3D world. To address this limitation, we introduce iMoWM, a novel interactive world model designed to generate color images, depth maps, and robot arm masks in an autoregressive manner conditioned on actions. To overcome the high computational cost associated with three-dimensional information, we propose MMTokenizer, which unifies multi-modal inputs into a compact token representation. This design enables iMoWM to leverage large-scale pretrained VideoGPT models while maintaining high efficiency and incorporating richer physical information. With its multi-modal representation, iMoWM not only improves the visual quality of future predictions but also serves as an effective simulator for model-based reinforcement learning (MBRL) and facilitates real-world imitation learning. Extensive experiments demonstrate the superiority of iMoWM across these tasks, showcasing the advantages of multi-modal world modeling for robotic manipulation. Homepage: https://xingyoujun.github.io/imowm/

academic

Basic Information

Paper ID: 2510.09036
Title: iMoWM: Taming Interactive Multi-Modal World Model for Robotic Manipulation
Authors: Chuanrui Zhang¹, Zhengxian Wu², Guanxing Lu², Yansong Tang², Ziwei Wang¹
Affiliations: ¹Nanyang Technological University, ²Tsinghua University
Category: cs.RO (Robotics)
Publication Date: October 10, 2024 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.09036
Project Homepage: https://xingyoujun.github.io/imowm/

Abstract

Learning world models holds tremendous potential for robotic manipulation, serving as simulators for real-world interactions. While significant progress has been made in 2D video-based world models, these methods often lack geometric and spatial reasoning capabilities essential for capturing the physical structure of 3D worlds. To address this limitation, the authors propose iMoWM, a novel interactive world model capable of autoregressively generating RGB images, depth maps, and robot arm masks conditioned on actions. To overcome the high computational costs associated with 3D information, the authors introduce MMTokenizer, which unifies multi-modal inputs into compact token representations. This design enables iMoWM to leverage large-scale pre-trained VideoGPT models while maintaining efficiency and incorporating richer physical information.

Research Background and Motivation

Problem Definition

Robotic manipulation tasks require accurate prediction of physical dynamics in 3D environments, but existing world models suffer from the following issues:

Lack of Geometric Understanding: Most methods rely solely on RGB video prediction, lacking explicit representations of 3D spatial structure
High Computational Cost: Direct processing of 3D information (e.g., 3D Gaussians) incurs substantial computational overhead
Limited Generalization: Absence of action-conditional constraints makes it difficult to adapt to diverse robotic manipulation scenarios

Research Motivation

Robotic manipulation occurs in three-dimensional space, and relying solely on RGB information is prone to errors under visual variations and complex object interactions. While existing 3D methods such as GWM employ 3D Gaussian distributions, they depend on high-quality 3DGS reconstruction, which is limited in monocular scenarios and difficult to scale.

Core Contributions

Proposes iMoWM Framework: The first interactive multi-modal world model capable of simultaneously predicting RGB images, depth maps, and robot arm masks
Designs MMTokenizer: An innovative multi-modal tokenizer that unifies heterogeneous inputs into compact token representations, significantly reducing computational costs
Enables Multi-Task Applications: Supports action-conditioned video generation, model-based reinforcement learning (MBRL), and real-world imitation learning
Demonstrates Superior Performance: Achieves state-of-the-art results on public benchmarks and real-world experiments

Methodology Details

Task Definition

Given an initial observation O₁ (containing RGB image, depth map, and robot arm mask) and an action sequence {aₜ}ᵀₜ₌₁, iMoWM must predict future multi-modal observation sequences {Oₜ}ᵀₜ₌₂.

Model Architecture

MMTokenizer Design

MMTokenizer is the core innovation, employing a dual encoder-decoder framework {(Ec, Dc), (Ed, Dd)}:

Context Encoding: Processes initial frames using context encoder
```
Zᶜₜ = Ec(Oₜ), Ôₜ = Dc(Zᶜₜ) t = 1,...,T₀
```

Dynamic Encoding: Conditional encoder focuses on dynamic regions

Zᵈₜ = Ep(Oₜ|O₁:T₀), Ôₜ = Dc(Zᵈₜ|O₁:T₀) t = T₀+1,...,T

Modality Adaptation: Replicates first and last layers to handle feature distribution differences across modalities, introducing modality-specific embeddings

Autoregressive Transformer

Employs a LLaMA-style transformer architecture, including:

RMSNorm normalization
SwiGLU activation function
Rotary position encoding
Action-conditional slot token injection mechanism

Action conditioning is implemented via slot tokens:

[Sₜ] = [S] + Linear(aₜ)

Training objective uses cross-entropy loss:

Ltransformer = -∑ᵀₜ₌T₀₊₁ log(Xₜ|Xₜ₋₁)

Technical Innovations

Unified Multi-Modal Representation: First to systematically encode RGB, depth, and masks together, avoiding information loss between modalities
Computational Efficiency Optimization: Dynamic encoder focuses only on changing regions, significantly reducing token count
Pre-trained Model Reuse: Design compatible with existing VideoGPT pre-trained weights, accelerating convergence

Experimental Setup

Datasets

BAIR Robot Pushing Dataset: 43K training videos, 256 test videos, 64×64 resolution
RoboNet Dataset: 19K training video subset, 256 test videos
Self-Collected Dataset: 1K training videos, 150 test videos, 256×256 high resolution
Meta-World Benchmark: 6 robotic manipulation tasks for reinforcement learning evaluation

Evaluation Metrics

Visual Quality: FVD, PSNR, SSIM, LPIPS
Depth Accuracy: AbsRel (Absolute Relative Error)
Manipulation Performance: Task success rate

Baseline Methods

MaskViT, SVG, GHVAE (video prediction baselines)
iVideoGPT (strongest RGB baseline)
GWM (3D Gaussian method)

Implementation Details

Video Depth Anything used for depth map generation
Grounding DINO + SAM2 for robot arm mask extraction
Pre-trained weights for transformer initialization
4 rollouts for fair comparison

Experimental Results

Main Results

Video Generation Performance

On BAIR dataset:

FVD: 60.9 (vs iVideoGPT 65.01)
PSNR: 23.82 (vs iVideoGPT 23.40)
SSIM: 0.896 (vs iVideoGPT 0.882)
LPIPS: 0.051 (vs iVideoGPT 0.058)
AbsRel: 0.045 (vs iVideoGPT 0.059)

Comprehensively outperforms baselines on RoboNet dataset, achieving PSNR of 38.33 on high-resolution real data.

Reinforcement Learning Performance

Outperforms iVideoGPT and GWM on all 6 Meta-World tasks, with faster convergence and higher final success rates. Geometry-aware rollouts significantly improve RL performance.

Real-World Deployment

On GALAXEA A1 robot for cup stacking and bread picking tasks:

Overall success rate: 29/35 (vs iVideoGPT 13/35, GT 27/35)
Approaches ground truth performance, validating high-fidelity multi-modal rollouts

Ablation Studies

MMTokenizer Effect: Compared to original tokenizer, inference time reduced from 860s to 10s while improving all visual metrics
Modality Contribution Analysis:
- RGB+Depth+Mask (complete method): FVD 67.6
- RGB only: FVD 70.2
- RGB+Mask: FVD 70.6
- RGB+Depth: FVD 67.5

Each modality contributes to performance improvement, with depth information providing the largest contribution.

Experimental Findings

Resolution Sensitivity: High-resolution inputs significantly improve performance by providing more precise depth and mask information
Importance of Geometric Information: Depth maps provide richer geometric constraints than masks
Computational Efficiency: MMTokenizer substantially improves inference speed while maintaining performance

Learning World Models

Early methods inspired by VideoGPT perform RGB video tokenization, while recent diffusion models advance latent space prediction. GWM employs 3DGS but is limited by monocular scene quality.

4D Video Prediction

Methods like TesserAct explore RGB-D generation but lack explicit action conditioning, limiting robotic applications.

Robotic Manipulation World Models

Primarily used for data augmentation and RL simulation, but generally lack 3D information, limiting effectiveness as simulators and data generators.

Conclusions and Discussion

Main Conclusions

Multi-modal world models significantly outperform pure RGB methods
MMTokenizer achieves good balance between efficiency and performance
Geometric information is crucial for robotic manipulation tasks
The method performs well in both simulation and real environments

Limitations

Pre-training Dependency: Still requires large-scale pre-training to fully leverage multi-modal world model generalization
Computational Resources: While more efficient than 3DGS methods, still more computationally intensive than pure RGB approaches
Depth Quality Dependency: Performance is affected by depth estimation quality

Future Directions

Explore larger-scale multi-modal pre-training
Investigate more efficient 3D representation methods
Extend to more robotic platforms and task types

In-Depth Evaluation

Strengths

Strong Innovation: First systematic introduction of multi-modal information into world models with novel technical approach
Engineering Completeness: Forms complete closed-loop from theoretical design to practical deployment
Comprehensive Experiments: Covers simulation, benchmark testing, and real robot validation
Significant Performance Gains: Achieves notable improvements across multiple metrics

Weaknesses

Insufficient Theoretical Analysis: Lacks in-depth theoretical explanation for why multi-modal information improves performance
Limited Generalization Verification: Primarily validated on specific robot platforms; cross-platform generalization requires further verification
Incomplete Computational Analysis: While efficiency improvements are mentioned, detailed computational complexity analysis is lacking

Impact

Academic Value: Provides new multi-modal direction for world model research
Practical Value: Directly applicable to real robotic systems with strong practicality
Reproducibility: Provides detailed implementation details and open-source commitments

Applicable Scenarios

Robotic manipulation tasks requiring precise geometric understanding
Data-scarce robotic learning scenarios
Reinforcement learning applications requiring high-fidelity simulation

References

This paper cites 63 relevant references covering important works in world models, video prediction, robotic learning, and other domains, providing solid theoretical foundation for the research.

Overall Assessment: This is a high-quality robotics learning paper making important contributions to multi-modal world models. The technical innovations are clear, experimental validation is comprehensive, and it possesses strong academic and practical value.