2025-11-17T02:58:13.321681

Image-based Facial Rig Inversion

Yang, Volino, Mustafa et al.

We present an image-based rig inversion framework that leverages two modalities: RGB appearance and RGB-encoded normal maps. Each modality is processed by an independent Hiera transformer backbone, and the extracted features are fused to regress 102 rig parameters derived from the Facial Action Coding System (FACS). Experiments on synthetic and scanned datasets demonstrate that the method generalizes to scanned data, producing faithful reconstructions.

academic

Image-based Facial Rig Inversion

Basic Information

Paper ID: 2510.13933
Title: Image-based Facial Rig Inversion
Authors: Tianxiang Yang, Marco Volino, Armin Mustafa, Greg Maguire, Robert Kosk
Institutions: University of Surrey & Humain Ltd.
Classification: eess.IV (Image and Video Processing)
Publication Date: October 15, 2025
Paper Link: https://arxiv.org/abs/2510.13933v1

Abstract

This paper proposes an image-based facial rig inversion framework that leverages two modalities: RGB appearance images and RGB-encoded normal maps. Each modality is processed through an independent Hiera transformer backbone network, and the extracted features are fused to regress 102 rig parameters based on the Facial Action Coding System (FACS). Experiments on synthetic and scanned datasets demonstrate that the method generalizes to scanned data and produces faithful reconstruction results.

Research Background and Motivation

Problem Definition

Facial rig inversion is the process of accurately recovering rig control parameters from visual input, which plays a critical role in animation production, virtual avatars, and performance capture pipelines, enabling direct control of production assets.

Research Significance

Animation Production Requirements: Precise control of facial expressions is key to achieving realistic character animation in modern animation production
Virtual Avatar Applications: With the development of metaverse and virtual reality technologies, real-time and accurate facial expression capture becomes increasingly important
Performance Capture Pipeline: Provides high-quality facial animation production tools for the entertainment industry, including film and gaming

Limitations of Existing Methods

Early Approaches: Rely on statistical or regression models trained on animator-created data with limited generalization capability
Mesh-based Methods: While information-rich, they are limited to well-structured topologies and show poor adaptability to scanned data
Lack of Image Domain Exploration: Most prior work relies on mesh-level features, while the image-input direction remains insufficiently explored

Research Motivation

Image domain input offers advantages in generalizing to scanned data. This direction has important practical value but remains understudied; therefore, this paper focuses on developing image-based facial rig inversion methods.

Core Contributions

Dual-Modality Image Processing Framework: First to propose a dual-branch network architecture combining RGB appearance images and RGB-encoded normal maps
Hiera Transformer Application: Applies the latest Hiera vision transformer to the facial rig inversion task
Multi-Supervision Learning Strategy: Provides supervision in both rig parameter space and 3D mesh space, ensuring numerical accuracy and geometric consistency
Scanned Data Generalization: Validates the method's generalization capability on real scanned data, filling a research gap

Method Details

Task Definition

Given an appearance image $I_a$ and a normal map $I_n$ , learn a function $f_θ : (I_a, I_n) → p ∈ R^{102}$ , where $p$ represents the control parameters of the target rig.

Model Architecture

Overall Design

As shown in Figure 1, the proposed dual-branch network architecture contains the following core components:

Dual-Branch Feature Extraction:
- RGB branch processes appearance images, capturing texture and illumination information
- Normal map branch processes geometric information, describing per-pixel surface orientation
Hiera Backbone Network:
- Each branch uses an independent Hiera transformer backbone network
- Input resolution is increased from the pre-trained 224×224 to 512×512, preserving fine-grained facial features
- First three encoding stages are frozen to preserve low-level features, while the final stage is trainable
Feature Fusion and Regression:
- Extracted features are concatenated and fed into a multi-layer perceptron (MLP) regression head
- Outputs 102 FACS-derived rig control parameters
Procedural Rig Decoding:
- Uses PyTorch-implemented procedural rigging to decode parameters into 3D mesh
- Reflects custom Maya facial rigging for mesh reconstruction

Technical Details

Image Preprocessing: All images are resized to 512×512 pixels, center-cropped, and normalized using ImageNet statistics
Normal Map Encoding: Encoded in tangent space, mapping surface normals in the -1,1 range to the 0,255 RGB range
Rendering Setup: Fixed resolution, constant camera pose, and consistent three-point lighting

Technical Innovations

Multi-Modal Fusion Strategy: Cleverly combines appearance and geometric information with strong complementarity
High-Resolution Processing: 512×512 input preserves fine-grained texture and geometric cues necessary for capturing subtle expression changes
Partial Freezing Strategy: Freezes low-level feature layers of the pre-trained model, retaining universal visual representations while adapting to the specific task
Dual Supervision Mechanism: Joint supervision in parameter space and mesh space ensures prediction reasonableness

Experimental Setup

Datasets

Training Set

Synthetic Data: Generated using deformation transfer (DT) blend shape rigging
Parameter Activation Strategy: Each rig parameter is independently activated, plus 20 manually combined standard expressions
Data Augmentation:
- Random parameter dropout, addition, or replacement to simulate real performance variations
- Parameter values sampled from normal distributions to create different intensities
- Rigid transformation augmentation to improve robustness to subtle misalignments in scanned data
Scale: 22,575 training samples

Validation Set

Real Scanned Data: Contains scan sequences of actors performing 20 expressions
Purpose: Evaluates model generalization capability on real data

Training Details

Optimizer: AdamW with learning rate 1×10^-4
Training Epochs: 200 epochs with batch size 32
Hardware: Single NVIDIA 4080 Laptop GPU
Training Steps: Approximately 141k steps (706 iterations per epoch)

Loss Function

The combined loss function includes:

Parameter Space Loss: Mean squared error (MSE) between predicted and ground truth rig parameters
Mesh Space Loss: L1 loss of reconstructed mesh through procedural rigging

Experimental Results

Main Results

The model is evaluated on scanned data, with predicted parameters applied to the DT blend shape rig used during training for mesh reconstruction.

Reconstruction Quality Analysis

Reconstruction results shown in Figure 2 demonstrate:

Excellent Mouth Region Performance: Predictions are particularly strong in the mouth region, accurately capturing complex mouth expressions
Eye Movement Challenges: Upward, downward, or lateral gaze directions are relatively more challenging for rig inversion
Overall Fidelity: Reconstruction results are visually faithful to the input scanned expressions

Generalization Capability

Experiments demonstrate good generalization of the method from synthetic training data to real scanned data, which is an important advantage of image-based methods over mesh-based approaches.

Traditional Methods

Statistical Regression Models: Early methods rely on statistical or regression models trained on animator-created data
Inverse Kinematics Learning: Holden et al.'s character pose inverse kinematics learning approach
Neural Rigging: RigNet and other neural rigging methods providing automatic rigging for skeletal characters

Modern Learning Methods

Differentiable Rigging: Bolduc and Phan achieve rig inversion through training differentiable rig functions
Mesh-Level Supervision: Learning methods implementing mesh-level supervision through differentiable rig approximation
Vision Transformers: Applications of hierarchical vision transformers like Hiera in computer vision

Positioning of This Work

This paper is the first to systematically explore image-based facial rig inversion methods, filling an important gap in the field.

Conclusions and Discussion

Main Conclusions

Effectiveness Validation: The image-based facial rig inversion framework effectively combines appearance and normal inputs to recover rig parameters
Generalization Capability: The method successfully generalizes to scanned data, producing faithful reconstruction results
Practical Value: Provides a new technical pathway for animation production and performance capture

Limitations

Partial Freezing Strategy: The current partial freezing strategy may limit model adaptation capability
Gaze Direction Challenges: Complex eye movements remain challenging
Data Dependency: Method performance depends on the quality and diversity of training data

Future Directions

The paper explicitly proposes that extending fine-tuning strategy to the entire network may further improve adaptation to rig inversion settings.

In-Depth Evaluation

Strengths

Technical Innovation:
- First systematic exploration of image-based facial rig inversion
- Clever dual-modality fusion design
- High-resolution processing preserves detailed information
Experimental Sufficiency:
- Comprehensive evaluation on synthetic and real data
- Clear experimental setup and implementation details
- Detailed analysis of performance across different facial regions
Practical Value:
- Addresses actual industry needs
- Provides end-to-end solution from image directly to rig parameters
- Good generalization capability on scanned data

Shortcomings

Missing Quantitative Evaluation: Paper lacks detailed quantitative evaluation metrics and numerical results
Insufficient Comparative Experiments: Lacks sufficient comparison with other baseline methods
Lack of Ablation Studies: No detailed analysis of individual component contributions
Dataset Scale: Validation set scale and diversity may be limited

Impact

Academic Contribution: Opens new direction for image-based facial rig inversion
Industrial Application: Provides practical technology for animation, gaming, virtual reality, and other industries
Technology Promotion: Successful application case of Hiera transformer in professional domains

Applicable Scenarios

Animation Production: Rapidly generate facial animation from reference images
Performance Capture: Real-time facial expression capture and reconstruction
Virtual Avatars: Real-time mapping of user expressions to virtual characters
Film Post-Production: Precise control and adjustment of facial expressions

References

Key references include:

Bolduc & Phan (2022): Rig inversion through differentiable rig function training
Hatamizadeh et al. (2023): Hiera hierarchical vision transformer
Sumner & Popović (2004): Classical method for triangular mesh deformation transfer
Holden et al. (2015): Character pose inverse kinematics learning
Rackovic et al. (2021): Neural rigging RigNet for skeletal characters

Overall Assessment: This is a pioneering work in the field of facial rig inversion. While there is room for improvement in the completeness of experimental evaluation, its technical innovation and practical value make it an important contribution to the field. The paper provides a new technical pathway for image-based facial animation production with good industrial application prospects.