We introduce 3D4D, an interactive 4D visualization framework that integrates WebGL with Supersplat rendering. It transforms static images and text into coherent 4D scenes through four core modules and employs a foveated rendering strategy for efficient, real-time multi-modal interaction. This framework enables adaptive, user-driven exploration of complex 4D environments. The project page and code are available at https://yunhonghe1021.github.io/NOVA/.
Paper ID : 2511.08536Title : 3D4D: An Interactive, Editable, 4D World Model via 3D Video GenerationAuthors : Yunhong He (Lehigh University), Zhengqing Yuan (University of Notre Dame), Zhengzhong Tu (Texas A&M University), Yanfang Ye (University of Notre Dame), Lichao Sun (Lehigh University)Category : cs.CV (Computer Vision)Publication Date : November 11, 2025 (arXiv v1)Paper Link : https://arxiv.org/abs/2511.08536 Project Homepage : https://yunhonghe1021.github.io/NOVA/ This paper introduces 3D4D, an interactive 4D visualization framework that integrates WebGL and Supersplat rendering technologies. The framework transforms static images and text into coherent 4D scenes through four core modules, employing a foveal rendering strategy to achieve efficient real-time multimodal interaction. The framework supports user-driven adaptive exploration of complex 4D environments.
Existing 4D content generation and visualization systems face three core challenges:
Insufficient Real-time Rendering Capability : Traditional WebGL frameworks struggle with real-time 4D rendering and fine-grained temporal navigationHigh Computational Cost : Elevated computational expenses, latency, and scalability issues limit practical applicationsLack of Interactivity : Existing systems lack truly interactive 4D environments and cannot seamlessly combine high-performance rendering with user interactionWith the advancement of generative models and multimodal learning, text-driven and multimodal interactive generation have become more intuitive. However, the lack of efficient 4D visualization and interaction frameworks severely limits the practical value of 4D content. True 4D interactive environments are crucial for virtual reality, digital twins, and film production.
WonderJourney, LucidDreamer, etc. : Primarily focus on 3D scene generation, lacking dynamic temporal dimension handlingSV4D, 4D-fy and other 4D generation methods : While capable of generating 4D content, they do not support real-time interaction with lower frame rates (16-40 fps)Traditional WebGL frameworks : Do not support fine-grained temporal interaction and efficient 4D scene editingDevelop a 4D visualization framework that simultaneously satisfies high-performance rendering, real-time interaction, and user editing requirements, enabling users to explore and manipulate complex 4D environments in a natural manner.
Proposed 3D4D Framework : The first interactive 4D visualization system integrating WebGL and Supersplat rendering, supporting end-to-end generation from static images and text to 4D scenesFoveal Rendering Strategy : Inspired by human peripheral vision, employing a VLM-guided adaptive rendering strategy that reduces GPU memory usage and latency while maintaining semantic alignment and visual consistencyReal-time Interaction Capability : Achieves 60 fps rendering speed, the first 4D scene generation system supporting truly real-time interactionComplete Editing Toolset : Provides multiple editing tools including rectangular, brush, polygon, lasso, and sphere selection, supporting precise object and region operationsSuperior Performance : Achieves best-in-class performance on CLIP Consistency (30.40) and CLIP Score (0.9951) metrics, significantly surpassing existing methodsInput :
Single static panoramic or ordinary image Natural language text description (prompts for scene dynamic changes) Output :
Interactive 4D scene (3D space + temporal dimension) Visualization environment supporting real-time rendering, editing, and navigation Constraints :
Maintain temporal coherence and visual consistency Meet real-time interaction requirements (≥60 fps) Operate within limited computational resources The 3D4D system comprises backend generation pipeline and frontend rendering system :
3D Scene Reconstruction Module Converts input static images into 3D architectural models Extracts geometric structure and spatial information of scenes Image-to-Video Synthesis Module Generates temporally coherent video sequences based on text prompts Ensures generated videos conform to user-specified dynamic changes Video-to-Frame Decomposition Module Decomposes generated videos into continuous frame sequences Extracts necessary visual information for each frame 4D Scene Generation Module Fuses continuous frames and 3D architectural models Generates complete 4D scene representation (multiple PLY point cloud files) Core Technology Stack :
WebGL : Provides underlying graphics rendering capabilitySupersplat : High-performance 3D Gaussian point cloud rendering engineKey Functionalities :
Real-time 4D Visualization Streams multiple PLY point cloud files to frontend Sequential rendering or looping playback forms continuous 4D video Supports dynamic adjustment of camera pose, playback speed, and frame rate Interactive Timeline Fine-grained temporal navigation control Users can balance between visual quality and efficiency Scene Editing Tools Rectangular selection, brush, polygon, lasso, sphere selection Precise object and region operations All interactions synchronized with backend via API This is the paper's most core technical innovation, inspired by the foveal characteristics of human visual systems:
Workflow :
Input PLY Point Cloud → VLM Analysis → Generate Importance Map → Adaptive Resource Allocation → Rendering Output
Specific Implementation :
VLM Analysis : Uses vision-language models such as Qwen2.5-VL to analyze each frameImportance Map Generation : Identifies semantically critical regions (e.g., people, moving objects)Adaptive Rendering :
Foveal region (important areas): Full-precision rendering Peripheral region (background): Blur, low-cost shading Resource Optimization : WebGL shaders dynamically allocate GPU resourcesAdvantages :
Reduces GPU load without compromising perceptual quality Maintains semantic alignment and visual consistency Achieves real-time performance (60 fps) Video Rendering Functionality :
Users upload PLY scenes and define keyframes System automatically interpolates camera trajectories VLM analyzes in real-time and generates importance maps Frame buffer capture, temporal smoothing, real-time encoding Output .webm or .mp4 format videos Technical Characteristics :
Entirely client-side processing, no server computation required Semantically-aware real-time 4D video generation Balances visual fidelity and computational efficiency Since standard WebGL does not support fine-grained temporal interaction, the team developed multiple custom features:
Precise temporal dimension control Seamless switching between multiple point cloud files Efficient memory management mechanisms Feature Traditional Methods 3D4D Rendering Strategy Uniform rendering Semantically-aware foveal rendering Interactivity Offline or limited interaction Fully real-time interaction Frame Rate 16-40 fps 60 fps Editing Capability Unsupported or limited Complete editing toolset Resource Efficiency High GPU load Adaptive resource allocation
The paper does not provide detailed information about training datasets used, but based on evaluation methods:
Uses panoramic images as input Accompanied by natural language prompts for scene generation Evaluation involves multi-view consistency checks CLIP Score (CS) Definition: CLIP similarity between text scene prompts and rendered images Significance: Evaluates semantic alignment quality; higher values indicate better alignment with text descriptions CLIP Consistency (CC) Definition: Cosine similarity of CLIP embeddings between each novel view image and center reference view Significance: Evaluates visual consistency across different viewpoints; higher values indicate better multi-view consistency FPS (Frames Per Second) Measures rendering speed Key indicator for real-time interaction Real-time Interaction Binary indicator: Whether real-time interaction is supported Criterion: Immediate responsiveness to user operations The paper compares against the following methods:
3D Scene Generation Methods :
WonderJourney (Yu et al. 2024) LucidDreamer Text2Room (Höllein et al. 2023) WonderWorld 4D Content Generation Methods :
SV4D (Xie et al. 2024) 4D-fy (Bahmani et al. 2024) Frontend developed based on WebGL and Supersplat VLM uses Qwen2.5-VL Point cloud format: PLY Video encoding: .webm or .mp4 Rendering target: 60 fps real-time performance Model CLIP Consistency (CC) CLIP Score (CS) WonderJourney 27.34 0.9544 LucidDreamer 26.72 0.8972 Text2Room 24.50 0.9035 WonderWorld 29.47 0.9948 SV4D 30.29 0.8856 4D-fy 11.23 0.6147 3D4D (Ours) 30.40 0.9951
Key Findings :
3D4D achieves 30.40 on CC metric, slightly outperforming SV4D's 30.29 3D4D achieves 0.9951 on CS metric, the highest among all methods 4D-fy shows the poorest performance, possibly due to methodological limitations 3D4D achieves optimal balance in both semantic alignment and visual consistency Model FPS Real-time Interaction SV4D 40 ✗ 4D-fy 16 ✗ 3D4D (Ours) 60 ✓
Key Findings :
3D4D achieves 60 fps, 50% faster than SV4D and 275% faster than 4D-fy 3D4D is the only method supporting truly real-time interaction Frame rate advantage directly translates to superior user experience The paper provides examples (Figure 2) demonstrating:
Input : Single panoramic photograph + natural language promptEvaluation Dimensions :
Controllability Quality Dynamics Multi-view Consistency : Shows scene consistency when viewed from different anglesDemonstrates the effectiveness of adaptive rendering strategy:
Semantically important regions rendered at high resolution Peripheral regions employ color approximation and background processing Visually imperceptible quality loss while significantly reducing computational cost Effectiveness of Semantically-Aware Rendering : VLM-guided foveal rendering strategy significantly improves performance while maintaining visual qualityImportance of Real-time Interaction : 60 fps and real-time interaction capability are key differentiators in user experienceAdvantages of Multimodal Integration : Combining text, image, and 4D rendering enables better understanding and generation of complex scenesScalability : System runs client-side with good scalability and deployment convenienceText-to-image generation: Stable Diffusion (Rombach et al. 2022) Visual instruction tuning: LLaVA (Liu et al. 2023) Multimodal large language models: TinyGPT-V (Yuan et al. 2023) Video generation: MORA (Yuan et al. 2024a), BORA (Sun et al. 2024) Text2Room (Höllein et al. 2023): Extracts textured 3D meshes from 2D text-to-image models WonderJourney (Yu et al. 2024): 3D scene exploration LucidDreamer: 3D scene reconstruction Text2-4D (Singer et al. 2023): Text-to-4D dynamic scene generation SV4D (Xie et al. 2024): Dynamic 3D content with multi-frame and multi-view consistency 4D-fy (Bahmani et al. 2024): Text-to-4D generation using hybrid score distillation sampling SC4D (Wu et al. 2024): Sparse-controlled video-to-4D generation 4K4D (Xu et al. 2024): 4K resolution real-time 4D view synthesis Supersplat: Browser-based 3D Gaussian point cloud editing tool First Truly Interactive 4D System : Existing methods either lack 4D support or real-time interactionEnd-to-End Solution : Complete pipeline from input to renderingSemantically-Aware Optimization : Intelligent resource allocation using VLMStrong Practicality : Based on Web technology, easy to deploy and useTechnical Feasibility : Demonstrates the feasibility of implementing high-performance 4D interactive visualization in browser environmentsPerformance Superiority : Comprehensively surpasses existing methods in semantic alignment, visual consistency, and rendering speedEnhanced User Experience : 60 fps and real-time interaction capability significantly improve 4D content exploration experienceResource Efficiency : Foveal rendering strategy effectively balances visual quality and computational costInsufficient Experimental Details :Training datasets and data scale not clearly specified Lacks detailed ablation studies validating component contributions Absence of user study data Simplified Method Description :Implementation details of backend modules insufficiently detailed Technical details of VLM importance map generation missing Lacks algorithm pseudocode and mathematical formulas Limited Evaluation Scope :Only uses CLIP-related metrics, lacking diverse evaluation approaches Does not evaluate applicability across different scene types Lacks failure case analysis Computational Resource Requirements :Client-side hardware requirements not clearly specified Performance on different devices unknown Scene Complexity Limitations :Maximum scene complexity the system can handle not specified Performance under extreme conditions unknown While not explicitly stated in the paper, the following research directions can be inferred:
Higher Resolution Support : Extend to 8K or higher resolution 4D renderingMore Complex Interactions : Support physics simulation, collision detection, and other advanced interactionsMulti-user Collaboration : Enable multiple users to simultaneously edit and explore the same 4D sceneMobile Optimization : Adapt to mobile device performance and interaction paradigmsAI-Assisted Editing : Leverage AI to automatically optimize scene layout and animationFoveal Rendering Strategy : Cleverly applies human visual system characteristics to computer graphicsVLM-Guided Resource Allocation : First application of vision-language models to rendering optimization, opening new directionsReal-time 4D Interaction : Achieves important technical breakthroughEasy Deployment : Based on Web technology, no complex installation requiredUser-Friendly : Intuitive interaction interface and editing toolsBroad Applications : Applicable to virtual reality, digital twins, film production, and other domainsOpen-Source Friendly : Provides project homepage and codeSOTA Performance : Achieves best results on CC and CS metricsHigh Frame Rate : 60 fps far exceeds competing methodsReal-time Interaction : Only system supporting truly real-time interactionProvides complete pipeline from input to output Integrates generation, rendering, and editing functionalities Well-coordinated frontend and backend design Missing Experimental Details : Training data, hyperparameters, implementation details insufficientMissing Ablation Studies : Does not independently validate component contributionsMissing User Studies : Lacks real user experience evaluationBackend modules described too briefly Lacks algorithm pseudocode and mathematical formulas VLM importance map generation mechanism unclear Single evaluation metric (only CLIP-related) Lacks testing across diverse scenarios No failure case analysis Limited comparison with additional baselines Hardware requirements unclear Scalability boundaries unknown Performance under extreme conditions unevaluated Pioneering Work : First truly real-time interactive 4D visualization systemMethodological Inspiration : Foveal rendering strategy applicable to other graphics tasksTechnology Integration : Demonstrates effective integration of WebGL, Gaussian point clouds, and VLMImmediately Usable : Provides online demonstrations and codeCommercial Potential : Directly applicable to multiple commercial scenariosEducational Value : Provides user-friendly tools for 4D content creationStrengths : Provides project homepage and code commitmentWeaknesses : Insufficient paper details may affect reproducibilityDependencies : Requires specific tools like SupersplatVirtual Reality : Creating interactive VR environmentsDigital Twins : Real-time visualization and editing of digital twin scenesFilm Production : Quick preview and editing of 4D scenesArchitectural Visualization : Demonstrating architectural changes over timeEducational Training : Creating interactive teaching scenariosHigh-Precision Requirements : Such as precise measurements in scientific visualizationComplex Physics Simulation : System lacks integrated physics engineExtremely Large-Scale Scenes : Performance boundaries unknownLow-End Devices : Requires certain GPU performance supportDimension Score Explanation Innovation 8/10 Foveal rendering and VLM-guided optimization are important innovations Technical Depth 6/10 Complete system implementation but insufficient paper description Experimental Sufficiency 5/10 Lacks ablation studies and user research Practical Value 9/10 Highly practical, easy to deploy and use Writing Quality 6/10 Clear structure but insufficient details Overall 7.5/10 Excellent systems work, but paper completeness needs improvement
Rombach et al. (2022) : High-resolution image synthesis with latent diffusion models - Foundational work for Stable DiffusionXie et al. (2024) : SV4D: Dynamic 3d content generation with multi-frame and multi-view consistency - Primary competing methodBahmani et al. (2024) : 4d-fy: Text-to-4d generation using hybrid score distillation sampling - Alternative 4D generation baselineWang et al. (2024) : Qwen2-VL: Enhancing Vision-Language Model's Perception - VLM used in this workPlayCanvas and Contributors (2025) : SuperSplat Online Editor - Core rendering engineSuitable Readers :
Computer graphics researchers Virtual reality developers 4D content creators Web graphics technology engineers Reading Focus :
Design philosophy of foveal rendering strategy Integration methods of WebGL and Gaussian point clouds Application of VLM in graphics rendering Implementation techniques for real-time 4D interaction Supplementary Reading Recommended :
Supersplat technical documentation 3D Gaussian point cloud related papers WebGL performance optimization best practices