2025-11-11T15:34:09.265833

A 3D Generation Framework from Cross Modality to Parameterized Primitive

Liang, Yu, Wang et al.
Recent advancements in AI-driven 3D model generation have leveraged cross modality, yet generating models with smooth surfaces and minimizing storage overhead remain challenges. This paper introduces a novel multi-stage framework for generating 3D models composed of parameterized primitives, guided by textual and image inputs. In the framework, A model generation algorithm based on parameterized primitives, is proposed, which can identifies the shape features of the model constituent elements, and replace the elements with parameterized primitives with high quality surface. In addition, a corresponding model storage method is proposed, it can ensure the original surface quality of the model, while retaining only the parameters of parameterized primitives. Experiments on virtual scene dataset and real scene dataset demonstrate the effectiveness of our method, achieving a Chamfer Distance of 0.003092, a VIoU of 0.545, a F1-Score of 0.9139 and a NC of 0.8369, with primitive parameter files approximately 6KB in size. Our approach is particularly suitable for rapid prototyping of simple models.
academic

A 3D Generation Framework from Cross Modality to Parameterized Primitive

Basic Information

  • Paper ID: 2510.08656
  • Title: A 3D Generation Framework from Cross Modality to Parameterized Primitive
  • Authors: Yiming Liang, Huan Yu, Zili Wang, Shuyou Zhang, Guodong Yi, Jin Wang, Jianrong Tan (Zhejiang University)
  • Classification: cs.GR (Computer Graphics), cs.AI (Artificial Intelligence), cs.CV (Computer Vision)
  • Publication Date: October 9, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.08656

Abstract

This paper addresses the challenges of surface quality and storage overhead in AI-driven 3D model generation by proposing a multi-stage 3D generation framework based on parameterized primitives. The framework generates 3D models composed of parameterized primitives from text and image inputs, replacing original elements with high-quality surface parameterized primitives by identifying shape characteristics of model components. Experimental results demonstrate superior performance on both virtual and real-world scene datasets, achieving a Chamfer distance of 3.092×10⁻³, VIoU of 0.545, F1-Score of 0.9139, NC of 0.8369, with primitive parameter file sizes approximately 6KB.

Research Background and Motivation

Problem Definition

Traditional 3D model generation techniques face two core challenges:

  1. High Storage Requirements: Existing methods typically extract explicit mesh representations from implicit 3D representations using the Marching Cubes algorithm, resulting in enormous storage demands. For example, a 256³ voxel grid requires storing over 16 million voxel information units, consuming up to 0.54GB of memory.
  2. Model Surface Quality: Constrained by resolution and topological structure limitations, low-resolution voxels (such as 32³) lead to loss of detail, while mesh-based methods relying on initial template deformation cannot flexibly handle complex topologies.

Research Motivation

With rapid advances in AI generation technology and computer graphics, 3D model representation techniques have broad applications in virtual reality, medical image processing, industrial design and manufacturing, and game development. Traditional methods typically require substantial prior knowledge and assumptions, limiting their applicability in real-world scenarios. Therefore, there is an urgent need for a generation method that improves model surface quality while reducing storage requirements.

Core Contributions

  1. Proposed primitive fitting and matching algorithms: Capable of replacing superquadric elements constituting models with parameterized geometric primitives of higher surface quality, thereby improving overall 3D model quality.
  2. Proposed a 3D model storage method: Reduces model storage requirements by retaining only primitive element parameters, achieving three orders of magnitude reduction in storage space.
  3. Constructed a three-stage 3D model generation method based on multi-modal information: Takes text and image information as input to generate 3D models composed of parameterized primitives under zero-shot conditions.

Methodology Details

Task Definition

Input: Text description or single image Output: 3D model composed of parameterized primitives Constraints: Zero-shot generation, improved surface quality, reduced storage overhead

Model Architecture

The framework consists of three main stages:

Stage One: Multi-view Depth Image Synthesis and Superquadric Iterative Fitting

  1. Multi-view Depth Image Synthesis:
    • Uses pre-trained ImageDream model to generate multi-view images of target models
    • Guides neural radiance field optimization through Score Distillation Sampling (SDS) loss function
    • Employs NeRFStudio sampling method to sample depth images from 48 different viewpoints of the optimized implicit neural radiance field
  2. Superquadric Iterative Fitting:
    • Constructs truncated signed distance field (TSDF)
    • Defines decreasing signed distance threshold sequence: Tc={t1c,t2c,...,tmc,tm+1c}T^c = \{t_1^c, t_2^c, ..., t_m^c, t_{m+1}^c\}
    • Initial threshold setting: t1c=minxiVt(xi)t_1^c = \min_{x_i \in V} t(x_i), decay formula: tm+1c=αtmct_{m+1}^c = \alpha t_m^c
    • Superquadric parameters: θ=(ε1,ε2,T,R,S)\theta = (\varepsilon_1, \varepsilon_2, T, R, S)
    • Implicit equation: f(x)=((x/a)2/ε2+(y/b)2/ε2)ε2/ε1+(z/c)2/ε1=1f(x) = \left((x/a)^{2/\varepsilon_2} + (y/b)^{2/\varepsilon_2}\right)^{\varepsilon_2/\varepsilon_1} + (z/c)^{2/\varepsilon_1} = 1

Based on shape parameters ε1\varepsilon_1 and ε2\varepsilon_2 of superquadrics, classifies them into three numerical intervals:

  • (0,0.5)(0, 0.5): Cylindrical features
  • [0.5,2][0.5, 2]: Ellipsoidal features
  • (2,+)(2, +\infty): Star-shaped features

By combining shape features in the z-direction and xy-plane, forms nine different superquadric types.

Stage Three: Primitive Fitting and Matching Algorithm

Represents parameterized primitives using polar coordinate equations:

  • Z-direction: Cylindrical coordinates, spherical coordinates, and polar coordinate equations of star curves
  • XY-plane: Rectangular base, elliptical base, and star-shaped base polar coordinate equations

Combines rotation vector R and translation vector T of superquadrics, executes translation and rotation transformations to optimize fitting and matching of target 3D models.

Technical Innovations

  1. Shape Feature Analysis: Through systematic analysis of superquadric parameter effects on shape, establishes mapping relationships from superquadrics to parameterized primitives.
  2. Parameterized Representation: Achieves model storage by saving only primitive parameters (size parameters S, shape parameters ε1\varepsilon_1 and ε2\varepsilon_2, translation vector T, rotation vector R).
  3. Zero-shot Generation: Combines implicit diffusion models and primitive decomposition to achieve zero-shot 3D generation across modalities.

Experimental Setup

Datasets

  1. Virtual Scene Dataset:
    • Primarily based on ShapeNet dataset, containing over 3000 object categories and 220,000 models
    • Includes test images and text from ImageDream, One-2-3-45++, Wonder3D, MVDream, TripoSR and other models
  2. Real-world Scene Dataset:
    • Primarily based on CO3D dataset, providing abundant real-world 3D data
    • Includes partial images from AKB-48 and OmniObject 3D

Evaluation Metrics

  • Chamfer Distance (CD): Measures similarity between two point clouds
  • Volumetric Intersection over Union (VIoU): Evaluates overlap degree of 3D models
  • F1-Score: Comprehensively considers surface reconstruction precision and recall
  • Normal Consistency (NC): Evaluates consistency of surface normal vectors

Comparison Methods

  • EMS
  • SuperDec
  • Marching-Primitives (MP)

Implementation Details

  • Hardware: AMD Ryzen 7 9700X CPU, NVIDIA GeForce RTX 5060Ti
  • Software: Windows 11, Python 3.10
  • TSDF parameters: Voxel space size -13,13, 100 uniform samples per dimension, total 10⁶ voxels
  • Mesh resolution: 100

Experimental Results

Main Results

Virtual Scene Dataset Results

MethodCD(×10⁻³)↓VIoU↑F1-Score↑NC↑
EMS13.10.2180.85720.6607
SuperDec6.380.2460.86290.7101
MP4.950.3900.81930.7284
Proposed Method3.090.5450.91390.8369

Compared to MP method, the proposed method reduces CD by 37.6%, increases VIoU by 39.7%, improves F1-Score by 11.5%, and enhances NC by 14.9%.

Real-world Scene Dataset Results

MethodCD(×10⁻³)↓VIoU↑F1-Score↑NC↑
EMS15.10.1410.89170.7539
SuperDec4.400.3010.83830.6759
MP4.320.4920.77710.5882
Proposed Method2.520.6730.91830.7752

ShapeNet Dataset Detailed Results

On six categories (bench, table, plane, cabinet, bottle, rifle), the proposed method achieves average CD of 0.503×10⁻³, VIoU of 0.742, F1-Score of 0.8896, and NC of 0.4511, demonstrating superior performance across all metrics.

Storage Capacity Comparison Experiment

Input TypeMesh StoragePrimitive Storage
Text4.56MB5KB
Image5.76MB6KB
All5.36MB6KB

Storage capacity reduced by three orders of magnitude, from MB-level to KB-level.

Ablation Study

Ablation experiments conducted on the real-world scene dataset demonstrate that the proposed method achieves best performance on VIoU, F1-Score, and NC metrics, validating the effectiveness of four polar coordinate equations.

Implicit Diffusion Models

Early 3D model generation techniques were primarily based on supervised learning, requiring substantial annotated data. Implicit diffusion models provide new perspectives for single-image 3D reconstruction through Score Distillation Sampling techniques and guidance from pre-trained 2D diffusion models for 3D representation optimization.

Primitive-based 3D Models

Existing research primarily achieves shape representation by decomposing 3D models into multiple simple primitives, including superellipsoids, anisotropic Gaussians, and convex hulls. Related methods such as Marching-Primitives extend the range of generatable models through iterative fitting of truncated signed distance fields.

Conclusions and Discussion

Main Conclusions

The proposed multi-stage cross-modal parameterized primitive generation framework can:

  1. Generate diverse 3D base models responsive to various conditional inputs
  2. Surpass state-of-the-art algorithms on CD, VIoU, F1-Score, and NC metrics
  3. Generate parameterized primitive composite models with improved aesthetic appeal
  4. Achieve significant storage space savings

Limitations

  1. Toroidal Cylinder Fitting Issue: Since superquadrics lack penetrating surfaces, the method cannot effectively match or fit toroidal cylinders
  2. Parameterized Representation Advantages: Fails to sufficiently demonstrate advantages compared to alternative schemes such as NURBS
  3. Complex Model Quality: Limited model quality in invisible viewpoints of complex models due to multi-view generation quality constraints

Future Directions

  1. Use variational autoencoders to encode point clouds of complex primitives for toroidal cylinder primitive matching
  2. Employ other types of surface fitting models for model components to demonstrate advantages of parameterized representation
  3. Better utilize multi-modal information to describe target model features or perform fine-tuning training on downstream tasks

In-depth Evaluation

Strengths

  1. Strong Method Innovation: First to propose systematic mapping from superquadrics to parameterized primitives
  2. Comprehensive Experiments: Thorough validation on both virtual and real-world scene datasets
  3. High Practical Value: Significantly reduces storage requirements, suitable for rapid prototyping
  4. Clear Technical Roadmap: Well-designed three-stage framework with clearly defined module functions

Weaknesses

  1. Limited Applicability: Primarily suitable for simple models with limited capability for handling complex topologies
  2. Dependence on Pre-trained Models: Relies on quality of pre-trained models such as ImageDream
  3. Insufficient Theoretical Analysis: Lacks theoretical analysis of parameterized primitive representation capacity
  4. Limited Evaluation Metrics: Primarily focuses on geometric similarity, lacking subjective assessment of visual quality

Impact

  1. Academic Contribution: Provides new parameterized representation perspectives for 3D generation field
  2. Practical Value: Demonstrates significant improvements in storage efficiency and surface quality
  3. Reproducibility: Detailed method description and clear experimental setup

Applicable Scenarios

  • Rapid prototyping in industrial design
  • Simple 3D asset generation in game development
  • Lightweight 3D content creation for virtual reality scenes
  • 3D model storage and transmission on mobile devices

References

The paper cites 38 related references covering key works in 3D generation, implicit diffusion models, primitive decomposition, and other critical domains, providing a solid theoretical foundation for this research.