iMoWM: Taming Interactive Multi-Modal World Model for Robotic Manipulation
Zhang, Wu, Lu et al.
Learned world models hold significant potential for robotic manipulation, as they can serve as simulator for real-world interactions. While extensive progress has been made in 2D video-based world models, these approaches often lack geometric and spatial reasoning, which is essential for capturing the physical structure of the 3D world. To address this limitation, we introduce iMoWM, a novel interactive world model designed to generate color images, depth maps, and robot arm masks in an autoregressive manner conditioned on actions. To overcome the high computational cost associated with three-dimensional information, we propose MMTokenizer, which unifies multi-modal inputs into a compact token representation. This design enables iMoWM to leverage large-scale pretrained VideoGPT models while maintaining high efficiency and incorporating richer physical information. With its multi-modal representation, iMoWM not only improves the visual quality of future predictions but also serves as an effective simulator for model-based reinforcement learning (MBRL) and facilitates real-world imitation learning. Extensive experiments demonstrate the superiority of iMoWM across these tasks, showcasing the advantages of multi-modal world modeling for robotic manipulation. Homepage: https://xingyoujun.github.io/imowm/
academic
iMoWM: Taming Interactive Multi-Modal World Model for Robotic Manipulation
Learning world models holds tremendous potential for robotic manipulation, serving as simulators for real-world interactions. While significant progress has been made in 2D video-based world models, these methods often lack geometric and spatial reasoning capabilities essential for capturing the physical structure of 3D worlds. To address this limitation, the authors propose iMoWM, a novel interactive world model capable of autoregressively generating RGB images, depth maps, and robot arm masks conditioned on actions. To overcome the high computational costs associated with 3D information, the authors introduce MMTokenizer, which unifies multi-modal inputs into compact token representations. This design enables iMoWM to leverage large-scale pre-trained VideoGPT models while maintaining efficiency and incorporating richer physical information.
Robotic manipulation tasks require accurate prediction of physical dynamics in 3D environments, but existing world models suffer from the following issues:
Lack of Geometric Understanding: Most methods rely solely on RGB video prediction, lacking explicit representations of 3D spatial structure
High Computational Cost: Direct processing of 3D information (e.g., 3D Gaussians) incurs substantial computational overhead
Limited Generalization: Absence of action-conditional constraints makes it difficult to adapt to diverse robotic manipulation scenarios
Robotic manipulation occurs in three-dimensional space, and relying solely on RGB information is prone to errors under visual variations and complex object interactions. While existing 3D methods such as GWM employ 3D Gaussian distributions, they depend on high-quality 3DGS reconstruction, which is limited in monocular scenarios and difficult to scale.
Proposes iMoWM Framework: The first interactive multi-modal world model capable of simultaneously predicting RGB images, depth maps, and robot arm masks
Designs MMTokenizer: An innovative multi-modal tokenizer that unifies heterogeneous inputs into compact token representations, significantly reducing computational costs
Enables Multi-Task Applications: Supports action-conditioned video generation, model-based reinforcement learning (MBRL), and real-world imitation learning
Demonstrates Superior Performance: Achieves state-of-the-art results on public benchmarks and real-world experiments
Given an initial observation O₁ (containing RGB image, depth map, and robot arm mask) and an action sequence {aₜ}ᵀₜ₌₁, iMoWM must predict future multi-modal observation sequences {Oₜ}ᵀₜ₌₂.
MMTokenizer is the core innovation, employing a dual encoder-decoder framework {(Ec, Dc), (Ed, Dd)}:
Context Encoding: Processes initial frames using context encoder
Zᶜₜ = Ec(Oₜ), Ôₜ = Dc(Zᶜₜ) t = 1,...,T₀
Dynamic Encoding: Conditional encoder focuses on dynamic regions
Zᵈₜ = Ep(Oₜ|O₁:T₀), Ôₜ = Dc(Zᵈₜ|O₁:T₀) t = T₀+1,...,T
Modality Adaptation: Replicates first and last layers to handle feature distribution differences across modalities, introducing modality-specific embeddings
Outperforms iVideoGPT and GWM on all 6 Meta-World tasks, with faster convergence and higher final success rates. Geometry-aware rollouts significantly improve RL performance.
Early methods inspired by VideoGPT perform RGB video tokenization, while recent diffusion models advance latent space prediction. GWM employs 3DGS but is limited by monocular scene quality.
This paper cites 63 relevant references covering important works in world models, video prediction, robotic learning, and other domains, providing solid theoretical foundation for the research.
Overall Assessment: This is a high-quality robotics learning paper making important contributions to multi-modal world models. The technical innovations are clear, experimental validation is comprehensive, and it possesses strong academic and practical value.