2025-11-17T02:58:13.321681

Image-based Facial Rig Inversion

Yang, Volino, Mustafa et al.
We present an image-based rig inversion framework that leverages two modalities: RGB appearance and RGB-encoded normal maps. Each modality is processed by an independent Hiera transformer backbone, and the extracted features are fused to regress 102 rig parameters derived from the Facial Action Coding System (FACS). Experiments on synthetic and scanned datasets demonstrate that the method generalizes to scanned data, producing faithful reconstructions.
academic

Image-based Facial Rig Inversion

Basic Information

  • Paper ID: 2510.13933
  • Title: Image-based Facial Rig Inversion
  • Authors: Tianxiang Yang, Marco Volino, Armin Mustafa, Greg Maguire, Robert Kosk
  • Institutions: University of Surrey & Humain Ltd.
  • Classification: eess.IV (Image and Video Processing)
  • Publication Date: October 15, 2025
  • Paper Link: https://arxiv.org/abs/2510.13933v1

Abstract

This paper proposes an image-based facial rig inversion framework that leverages two modalities: RGB appearance images and RGB-encoded normal maps. Each modality is processed through an independent Hiera transformer backbone network, and the extracted features are fused to regress 102 rig parameters based on the Facial Action Coding System (FACS). Experiments on synthetic and scanned datasets demonstrate that the method generalizes to scanned data and produces faithful reconstruction results.

Research Background and Motivation

Problem Definition

Facial rig inversion is the process of accurately recovering rig control parameters from visual input, which plays a critical role in animation production, virtual avatars, and performance capture pipelines, enabling direct control of production assets.

Research Significance

  1. Animation Production Requirements: Precise control of facial expressions is key to achieving realistic character animation in modern animation production
  2. Virtual Avatar Applications: With the development of metaverse and virtual reality technologies, real-time and accurate facial expression capture becomes increasingly important
  3. Performance Capture Pipeline: Provides high-quality facial animation production tools for the entertainment industry, including film and gaming

Limitations of Existing Methods

  1. Early Approaches: Rely on statistical or regression models trained on animator-created data with limited generalization capability
  2. Mesh-based Methods: While information-rich, they are limited to well-structured topologies and show poor adaptability to scanned data
  3. Lack of Image Domain Exploration: Most prior work relies on mesh-level features, while the image-input direction remains insufficiently explored

Research Motivation

Image domain input offers advantages in generalizing to scanned data. This direction has important practical value but remains understudied; therefore, this paper focuses on developing image-based facial rig inversion methods.

Core Contributions

  1. Dual-Modality Image Processing Framework: First to propose a dual-branch network architecture combining RGB appearance images and RGB-encoded normal maps
  2. Hiera Transformer Application: Applies the latest Hiera vision transformer to the facial rig inversion task
  3. Multi-Supervision Learning Strategy: Provides supervision in both rig parameter space and 3D mesh space, ensuring numerical accuracy and geometric consistency
  4. Scanned Data Generalization: Validates the method's generalization capability on real scanned data, filling a research gap

Method Details

Task Definition

Given an appearance image IaI_a and a normal map InI_n, learn a function fθ:(Ia,In)pR102f_θ : (I_a, I_n) → p ∈ R^{102}, where pp represents the control parameters of the target rig.

Model Architecture

Overall Design

As shown in Figure 1, the proposed dual-branch network architecture contains the following core components:

  1. Dual-Branch Feature Extraction:
    • RGB branch processes appearance images, capturing texture and illumination information
    • Normal map branch processes geometric information, describing per-pixel surface orientation
  2. Hiera Backbone Network:
    • Each branch uses an independent Hiera transformer backbone network
    • Input resolution is increased from the pre-trained 224×224 to 512×512, preserving fine-grained facial features
    • First three encoding stages are frozen to preserve low-level features, while the final stage is trainable
  3. Feature Fusion and Regression:
    • Extracted features are concatenated and fed into a multi-layer perceptron (MLP) regression head
    • Outputs 102 FACS-derived rig control parameters
  4. Procedural Rig Decoding:
    • Uses PyTorch-implemented procedural rigging to decode parameters into 3D mesh
    • Reflects custom Maya facial rigging for mesh reconstruction

Technical Details

  • Image Preprocessing: All images are resized to 512×512 pixels, center-cropped, and normalized using ImageNet statistics
  • Normal Map Encoding: Encoded in tangent space, mapping surface normals in the -1,1 range to the 0,255 RGB range
  • Rendering Setup: Fixed resolution, constant camera pose, and consistent three-point lighting

Technical Innovations

  1. Multi-Modal Fusion Strategy: Cleverly combines appearance and geometric information with strong complementarity
  2. High-Resolution Processing: 512×512 input preserves fine-grained texture and geometric cues necessary for capturing subtle expression changes
  3. Partial Freezing Strategy: Freezes low-level feature layers of the pre-trained model, retaining universal visual representations while adapting to the specific task
  4. Dual Supervision Mechanism: Joint supervision in parameter space and mesh space ensures prediction reasonableness

Experimental Setup

Datasets

Training Set

  • Synthetic Data: Generated using deformation transfer (DT) blend shape rigging
  • Parameter Activation Strategy: Each rig parameter is independently activated, plus 20 manually combined standard expressions
  • Data Augmentation:
    • Random parameter dropout, addition, or replacement to simulate real performance variations
    • Parameter values sampled from normal distributions to create different intensities
    • Rigid transformation augmentation to improve robustness to subtle misalignments in scanned data
  • Scale: 22,575 training samples

Validation Set

  • Real Scanned Data: Contains scan sequences of actors performing 20 expressions
  • Purpose: Evaluates model generalization capability on real data

Training Details

  • Optimizer: AdamW with learning rate 1×10^-4
  • Training Epochs: 200 epochs with batch size 32
  • Hardware: Single NVIDIA 4080 Laptop GPU
  • Training Steps: Approximately 141k steps (706 iterations per epoch)

Loss Function

The combined loss function includes:

  1. Parameter Space Loss: Mean squared error (MSE) between predicted and ground truth rig parameters
  2. Mesh Space Loss: L1 loss of reconstructed mesh through procedural rigging

Experimental Results

Main Results

The model is evaluated on scanned data, with predicted parameters applied to the DT blend shape rig used during training for mesh reconstruction.

Reconstruction Quality Analysis

Reconstruction results shown in Figure 2 demonstrate:

  1. Excellent Mouth Region Performance: Predictions are particularly strong in the mouth region, accurately capturing complex mouth expressions
  2. Eye Movement Challenges: Upward, downward, or lateral gaze directions are relatively more challenging for rig inversion
  3. Overall Fidelity: Reconstruction results are visually faithful to the input scanned expressions

Generalization Capability

Experiments demonstrate good generalization of the method from synthetic training data to real scanned data, which is an important advantage of image-based methods over mesh-based approaches.

Traditional Methods

  1. Statistical Regression Models: Early methods rely on statistical or regression models trained on animator-created data
  2. Inverse Kinematics Learning: Holden et al.'s character pose inverse kinematics learning approach
  3. Neural Rigging: RigNet and other neural rigging methods providing automatic rigging for skeletal characters

Modern Learning Methods

  1. Differentiable Rigging: Bolduc and Phan achieve rig inversion through training differentiable rig functions
  2. Mesh-Level Supervision: Learning methods implementing mesh-level supervision through differentiable rig approximation
  3. Vision Transformers: Applications of hierarchical vision transformers like Hiera in computer vision

Positioning of This Work

This paper is the first to systematically explore image-based facial rig inversion methods, filling an important gap in the field.

Conclusions and Discussion

Main Conclusions

  1. Effectiveness Validation: The image-based facial rig inversion framework effectively combines appearance and normal inputs to recover rig parameters
  2. Generalization Capability: The method successfully generalizes to scanned data, producing faithful reconstruction results
  3. Practical Value: Provides a new technical pathway for animation production and performance capture

Limitations

  1. Partial Freezing Strategy: The current partial freezing strategy may limit model adaptation capability
  2. Gaze Direction Challenges: Complex eye movements remain challenging
  3. Data Dependency: Method performance depends on the quality and diversity of training data

Future Directions

The paper explicitly proposes that extending fine-tuning strategy to the entire network may further improve adaptation to rig inversion settings.

In-Depth Evaluation

Strengths

  1. Technical Innovation:
    • First systematic exploration of image-based facial rig inversion
    • Clever dual-modality fusion design
    • High-resolution processing preserves detailed information
  2. Experimental Sufficiency:
    • Comprehensive evaluation on synthetic and real data
    • Clear experimental setup and implementation details
    • Detailed analysis of performance across different facial regions
  3. Practical Value:
    • Addresses actual industry needs
    • Provides end-to-end solution from image directly to rig parameters
    • Good generalization capability on scanned data

Shortcomings

  1. Missing Quantitative Evaluation: Paper lacks detailed quantitative evaluation metrics and numerical results
  2. Insufficient Comparative Experiments: Lacks sufficient comparison with other baseline methods
  3. Lack of Ablation Studies: No detailed analysis of individual component contributions
  4. Dataset Scale: Validation set scale and diversity may be limited

Impact

  1. Academic Contribution: Opens new direction for image-based facial rig inversion
  2. Industrial Application: Provides practical technology for animation, gaming, virtual reality, and other industries
  3. Technology Promotion: Successful application case of Hiera transformer in professional domains

Applicable Scenarios

  1. Animation Production: Rapidly generate facial animation from reference images
  2. Performance Capture: Real-time facial expression capture and reconstruction
  3. Virtual Avatars: Real-time mapping of user expressions to virtual characters
  4. Film Post-Production: Precise control and adjustment of facial expressions

References

Key references include:

  1. Bolduc & Phan (2022): Rig inversion through differentiable rig function training
  2. Hatamizadeh et al. (2023): Hiera hierarchical vision transformer
  3. Sumner & Popović (2004): Classical method for triangular mesh deformation transfer
  4. Holden et al. (2015): Character pose inverse kinematics learning
  5. Rackovic et al. (2021): Neural rigging RigNet for skeletal characters

Overall Assessment: This is a pioneering work in the field of facial rig inversion. While there is room for improvement in the completeness of experimental evaluation, its technical innovation and practical value make it an important contribution to the field. The paper provides a new technical pathway for image-based facial animation production with good industrial application prospects.