2025-11-25T04:52:17.849949

Motion Capture from Inertial and Vision Sensors

Chen, Liu, Bao et al.
Human motion capture is the foundation for many computer vision and graphics tasks. While industrial motion capture systems with complex camera arrays or expensive wearable sensors have been widely adopted in movie and game production, consumer-affordable and easy-to-use solutions for personal applications are still far from mature. To utilize a mixture of a monocular camera and very few inertial measurement units (IMUs) for accurate multi-modal human motion capture in daily life, we contribute MINIONS in this paper, a large-scale Motion capture dataset collected from INertial and visION Sensors. MINIONS has several featured properties: 1) large scale of over five million frames and 400 minutes duration; 2) multi-modality data of IMUs signals and RGB videos labeled with joint positions, joint rotations, SMPL parameters, etc.; 3) a diverse set of 146 fine-grained single and interactive actions with textual descriptions. With the proposed MINIONS dataset, we propose a SparseNet framework to capture human motion from IMUs and videos by discovering their supplementary features and exploring the possibilities of consumer-affordable motion capture using a monocular camera and very few IMUs. The experiment results emphasize the unique advantages of inertial and vision sensors, showcasing the promise of consumer-affordable multi-modal motion capture and providing a valuable resource for further research and development.
academic

Motion Capture from Inertial and Vision Sensors

Basic Information

  • Paper ID: 2407.16341
  • Title: Motion Capture from Inertial and Vision Sensors
  • Authors: Xiaodong Chen, Wu Liu, Qian Bao, Xinchen Liu, Ruoli Dai, Yongdong Zhang, Tao Mei
  • Category: cs.CV (Computer Vision)
  • Publication Date: July 2024 (arXiv preprint, version v3 updated October 11, 2025)
  • Paper Link: https://arxiv.org/abs/2407.16341

Abstract

Human motion capture is fundamental to numerous computer vision and graphics tasks. While industrial-grade motion capture systems are widely deployed in film and game production, consumer-grade, user-friendly personal application solutions remain immature. To achieve accurate multimodal human motion capture using a monocular camera and minimal inertial measurement units (IMUs), this paper proposes the MINIONS dataset—a large-scale motion capture dataset collected from inertial and vision sensors. The dataset features three distinctive characteristics: 1) Large-scale: over 5.5 million frames and 440 minutes of duration; 2) Multimodal: containing IMU signals and RGB video, annotated with joint positions, joint rotations, SMPL parameters, etc.; 3) Diverse: encompassing 146 fine-grained single-person and interactive actions. Based on the MINIONS dataset, the SparseNet framework is proposed, which captures human motion by discovering complementary features between IMU and video, exploring the feasibility of consumer-grade motion capture using monocular cameras and minimal IMUs.

Research Background and Motivation

Problem Definition

The core problem addressed by this research is: How can consumer-grade devices (monocular camera + minimal IMUs) achieve accurate and stable human motion capture to meet daily application requirements?

Problem Significance

  1. Cost Issues: Industrial-grade systems require dozens of synchronized cameras or expensive wearable sensors, with costs reaching thousands of dollars
  2. Portability Issues: Existing systems have complex configurations, limiting usage scenarios
  3. Application Demands: Consumer-level applications such as XR, mobile video production, and live streaming have urgent needs for low-cost motion capture

Limitations of Existing Methods

  1. Marker-based Systems: Require special clothing or numerous IMUs, inconvenient for natural motion
  2. Multi-camera Systems: Require complex calibration, limiting activity range
  3. Monocular Vision Methods: Affected by depth ambiguity, occlusion, and rapid motion, exhibiting temporal jitter
  4. IMU Methods: Suffer from global position drift, limiting long-duration motion capture

Research Motivation

Existing datasets like TotalCapture have limited scale, single scenarios, and require tight-fitting clothing, creating distribution discrepancies with daily life. This work aims to construct a large-scale, diverse dataset and explore a consumer-grade motion capture solution through vision-inertial fusion.

Core Contributions

  1. Construction of MINIONS Dataset: Contains 5.5 million frames, 440 minutes of multimodal motion capture data, covering 146 fine-grained actions, providing rich annotation information
  2. Proposal of SparseNet Framework: A dual-branch architecture based on Bayesian theory, effectively fusing visual and inertial information for motion capture
  3. Systematic Experimental Analysis: In-depth exploration of performance across different sensor configurations, demonstrating the effectiveness of 4-6 IMUs combined with monocular cameras
  4. Multi-task Benchmark Testing: Provides benchmark results on 2D-3D pose estimation, fine-grained action recognition, and other tasks

Methodology Details

Task Definition

Input: Monocular RGB video sequence V={Vi}i=1LV = \{V_i\}_{i=1}^L and sparse IMU signals I={Ii}i=0LI = \{I_i\}_{i=0}^LOutput: SMPL parameters (shape β\beta, pose θ\theta, global displacement tt) and 3D joint positions Constraints: Using consumer-grade devices with minimum 4 IMU sensors

Model Architecture

Theoretical Foundation

Based on Bayesian fusion strategy, joint rotation θ\theta is modeled as a latent variable:

p(θdv,DI)p(θ)p(dvθ)p(DIθ)p(\theta|d_v, D_I) \propto p(\theta) \cdot p(d_v|\theta) \cdot p(D_I|\theta)

Where:

  • p(θ)p(\theta): Prior distribution of joint rotations (Matrix Fisher distribution)
  • p(dvθ)p(d_v|\theta): von Mises-Fisher distribution of visual bone direction observations
  • p(DIθ)p(D_I|\theta): Distribution of IMU rotation observations

Network Structure

1. Visual Branch

  • Vision Mamba encoder for extracting visual features
  • Shape decoder: Regressing SMPL shape parameters β\beta
  • Pose decoder: Estimating pose prior distribution p(θ)p(\theta)
  • Bone decoder: Estimating bone direction distribution p(dvθ)p(d_v|\theta)

2. Sparse IMU Branch

  • Joint Mamba encoder: Predicting bone positions d0:id_{0:i} from IMU signals
  • IMU Mamba encoder: Processing sparse inertial signals
  • Rotation decoder: Estimating rotation distribution p(DIθ)p(D_I|\theta)
  • Translation decoder: Estimating global translation tIt_I

3. Post-processing Branch

  • Posterior fusion module: Integrating probability distributions from both branches
  • Smooth Mamba encoder: Smoothing final pose sequences
  • PnP solver: Computing global translation

Technical Innovations

  1. Probabilistic Fusion Framework: Bayesian fusion based on Matrix Fisher prior with solid theoretical foundation
  2. Dual-branch Complementary Design: Visual branch provides shape and position information, IMU branch provides rotation and high-frequency motion information
  3. Sparse Sensor Support: Flexible configuration supporting 4-10 IMUs
  4. End-to-end Training: Unified probabilistic framework supporting joint optimization

Experimental Setup

Dataset

MINIONS Dataset Statistics:

  • Scale: 5.5 million frames, 440 minutes of video
  • Modalities: 8 2K cameras + 17 nine-axis IMUs + RGB-D scanner
  • Actions: 146 fine-grained actions (121 single-person + 25 multi-person interactive)
  • Participants: 36 actor groups (20 single-person + 16 multi-person)
  • Annotations: 2D/3D joints, SMPL parameters, action categories, texture information

Data Split:

  • Training set: 12 actors, 3.2 million frames
  • Validation set: 3 actors, 0.9 million frames
  • Test set: 5 actors, 1.4 million frames

Evaluation Metrics

  1. μglo\mu_{glo}: Mean global rotation error (degrees)
  2. σglo\sigma_{glo}: Global rotation error variance (degrees)
  3. MPJPE: Mean per-joint position error (millimeters)
  4. Jitter: Average joint acceleration jitter (102m/s310^2 m/s^3)
  5. PA-MPJPE: Per-joint position error after Procrustes alignment

Comparison Methods

  • IMU Methods: PIP, PnP, IMU-based baseline methods
  • Vision Methods: TokenHMR, PromptHMR
  • Multimodal Methods: DiffCap, VIP, Liu et al.

Implementation Details

  • Training Strategy: Pre-train visual branch (20 epochs), then train IMU and post-processing branches (200 epochs)
  • Optimizer: Adam, learning rate 0.001
  • Batch Size: Visual branch 64, others 512
  • Input Resolution: 512×512
  • Hardware: NVIDIA GTX A100

Experimental Results

Main Results

Multimodal Motion Capture Performance Comparison:

Method Type#IMUs#Camsμglo\mu_{glo}σglo\sigma_{glo}MPJPE↓Jitter↓
IMU-based6011.678.6557.931.17
Vision-based0110.277.2045.6113.02
Multi-modal619.206.1939.991.57

Key Findings:

  1. 4-6 IMU Configuration Optimal: Achieves best balance between cost and performance
  2. Complementary Advantages Evident: Vision methods exhibit large jitter, IMU methods suffer severe position drift, fusion significantly improves both
  3. Diminishing Returns Beyond 8 IMUs: Increased cost with limited performance gains

TotalCapture Dataset Comparison

MethodMPJPE↓PA-MPJPE↓
DiffCap46.229.9
VIP-26.0
Liu et al.45.8-
Ours36.721.6

Ablation Studies

Performance Analysis with Different IMU Counts:

  • 4 IMUs: μglo=9.75°\mu_{glo}=9.75°, MPJPE=41.53mm
  • 6 IMUs: μglo=9.20°\mu_{glo}=9.20°, MPJPE=39.99mm
  • 8 IMUs: μglo=8.86°\mu_{glo}=8.86°, MPJPE=39.39mm
  • 10 IMUs: μglo=8.81°\mu_{glo}=8.81°, MPJPE=39.43mm

Results indicate 6-8 IMUs as optimal configuration.

Other Task Benchmarks

2D-3D Pose Estimation:

  • MotionBERT: MPJPE=18.75mm, PA-MPJPE=13.44mm
  • Dual-Aug (243 frames): MPJPE=19.22mm, PA-MPJPE=13.95mm

Fine-grained Action Recognition:

  • UniFormerV2: Top-1=75.88%, Top-5=96.87%
  • VideoMAE: Top-1=73.75%, Top-5=96.01%

MINIONS is more challenging compared to Kinetics400.

Case Analysis

Visualization results demonstrate:

  1. IMU Method: Accumulates position drift over time, but maintains stable rotation
  2. Vision Method: Accurate position but exhibits temporal jitter
  3. Fusion Method: Combines advantages of both, achieving both stability and accuracy

IMU Motion Capture

  • Industrial Solutions: Perception Neuron, Xsens MVN systems using 17 IMUs
  • Sparse IMU Methods: Optimization and regression paradigms
  • Limitations: Long-term position drift issues

Monocular Vision Motion Capture

  • Optimization Methods: Fitting SMPL parameters to video frames
  • Regression Methods: End-to-end learning of SMPL parameters
  • Challenges: Depth ambiguity, occlusion, rapid motion

Multimodal Fusion

  • Existing Work: Small-scale datasets like TotalCapture
  • Advantages of This Work: Larger scale, greater diversity, everyday clothing

Conclusions and Discussion

Main Conclusions

  1. Technical Feasibility: 4-6 IMUs combined with monocular cameras can achieve stable consumer-grade motion capture
  2. Complementary Value: Visual and inertial sensors exhibit clear complementary advantages
  3. Dataset Contribution: MINIONS provides important data resources for the field
  4. Practical Utility: Method demonstrates good generalization across multiple tasks

Limitations

  1. Sensor Dependency: Still requires multiple IMU sensors, increasing system complexity
  2. Real-time Performance: Paper lacks detailed discussion of real-time performance
  3. Environmental Adaptability: Primarily tested in indoor environments; robustness in complex outdoor environments remains unverified
  4. Clothing Effects: While using everyday clothing, the impact of loose-fitting garments on IMU accuracy requires further investigation

Future Directions

  1. Fewer Sensors: Exploring possibilities with even fewer IMUs
  2. Real-time Optimization: Improving real-time processing capabilities
  3. Environmental Robustness: Enhancing performance in complex environments
  4. Application Expansion: Extending to more practical application scenarios

In-depth Evaluation

Strengths

  1. Significant Dataset Contribution: MINIONS is currently the largest-scale multimodal motion capture dataset, filling an important gap in the field
  2. Solid Theoretical Foundation: Fusion framework based on Bayesian theory has strong mathematical basis
  3. Comprehensive Experimental Design: Experiments cover diverse sensor configurations and multi-task evaluation
  4. High Practical Value: Provides viable technical pathways for consumer-grade motion capture
  5. Reasonable Technical Innovation: Dual-branch design fully leverages advantages of different modalities

Shortcomings

  1. Insufficient Computational Complexity Analysis: Lacks detailed analysis of computational overhead and real-time performance
  2. Limited Failure Case Analysis: Insufficient discussion of method performance in extreme situations
  3. Missing User Studies: Lacks evaluation of real user experience
  4. Long-term Stability: Insufficient verification of stability for prolonged usage

Impact

  1. Academic Value: Provides important data and benchmarks for multimodal motion capture research
  2. Industrial Value: Offers technical reference for consumer-grade motion capture product development
  3. Reproducibility: Clear method description facilitates reproduction and improvement by other researchers
  4. Community Contribution: Large-scale dataset will accelerate rapid development of the field

Applicable Scenarios

  1. Personal Creation: Motion capture needs of video creators and content creators
  2. Fitness Monitoring: Exercise posture analysis and correction
  3. Gaming and Entertainment: Motion-sensing games, virtual reality applications
  4. Education and Training: Action teaching, skill training
  5. Medical Rehabilitation: Movement function assessment and rehabilitation training

References

The paper cites 75 related references, primarily including:

  • Classic motion capture datasets: Human3.6M, TotalCapture, 3DPW, etc.
  • SMPL human body model related work
  • Deep learning pose estimation methods
  • IMU motion capture technology
  • Multimodal fusion methods

Overall Assessment: This is a high-quality computer vision research paper with important contributions in both dataset construction and multimodal fusion methods. The MINIONS dataset's scale and quality will have significant impact on the field, while the SparseNet framework provides an effective technical solution for consumer-grade motion capture. The paper features comprehensive experimental design, credible conclusions, and possesses high academic and practical value.