2025-11-16T06:07:12.262321

Spatio-Temporal Multi-Subgraph GCN for 3D Human Motion Prediction

Wang, Guo, Su
Human motion prediction (HMP) involves forecasting future human motion based on historical data. Graph Convolutional Networks (GCNs) have garnered widespread attention in this field for their proficiency in capturing relationships among joints in human motion. However, existing GCN-based methods tend to focus on either temporal-domain or spatial-domain features, or they combine spatio-temporal features without fully leveraging the complementarity and cross-dependency of these two features. In this paper, we propose the Spatial-Temporal Multi-Subgraph Graph Convolutional Network (STMS-GCN) to capture complex spatio-temporal dependencies in human motion. Specifically, we decouple the modeling of temporal and spatial dependencies, enabling cross-domain knowledge transfer at multiple scales through a spatio-temporal information consistency constraint mechanism. Besides, we utilize multiple subgraphs to extract richer motion information and enhance the learning associations of diverse subgraphs through a homogeneous information constraint mechanism. Extensive experiments on the standard HMP benchmarks demonstrate the superiority of our method.
academic

Spatio-Temporal Multi-Subgraph GCN for 3D Human Motion Prediction

Basic Information

  • Paper ID: 2501.00317
  • Title: Spatio-Temporal Multi-Subgraph GCN for 3D Human Motion Prediction
  • Authors: Jiexin Wang, Yiju Guo, Bing Su (School of Artificial Intelligence, Renmin University of China)
  • Categories: cs.CV (Computer Vision), cs.LG (Machine Learning)
  • Publication Date: December 31, 2024 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2501.00317

Abstract

Human motion prediction (HMP) involves predicting future human motion based on historical data. Graph Convolutional Networks (GCNs) have gained widespread attention in this field due to their ability to capture inter-joint relationships in human motion. However, existing GCN-based methods often focus solely on temporal or spatial features, or fail to fully exploit the complementarity and cross-dependencies between these two modalities when combining spatio-temporal features. This paper proposes Spatio-Temporal Multi-Subgraph Graph Convolutional Network (STMS-GCN) to capture complex spatio-temporal dependencies in human motion. Specifically, we decouple the modeling of temporal and spatial dependencies and achieve multi-scale cross-domain knowledge transfer through a spatio-temporal information consistency constraint mechanism. Furthermore, we leverage multiple subgraphs to extract richer motion information and enhance learning associations across different subgraphs through a homogeneity information constraint mechanism. Extensive experiments on standard HMP benchmarks demonstrate the superiority of our approach.

Research Background and Motivation

Problem Definition

3D skeleton-based human motion prediction aims to predict future motion sequences given a historical motion sequence. This research is crucial for understanding human motion behavior and has broad applications in robot collaboration, autonomous driving, action recognition, and other domains.

Limitations of Existing Methods

  1. Single-domain modeling limitations: Most GCN methods focus only on temporal or spatial feature modeling, neglecting the complementarity between spatio-temporal features
  2. Insufficient feature fusion: Some methods integrate spatio-temporal relationships through mixed convolution kernels but struggle to extract unique temporal and spatial information
  3. Underutilized cross-domain dependencies: Existing decoupled modeling methods primarily focus on complex structural design while overlooking hidden cross-dependencies in spatio-temporal relationships

Research Motivation

To address these issues, this paper proposes modeling temporal and spatial information separately through orthogonal spatio-temporal branches, fully exploiting the uniqueness of spatio-temporal information, and promoting spatio-temporal information interweaving and cross-domain knowledge transfer through consistency constraints.

Core Contributions

  1. Proposes STMS-GCN architecture: Considers the independence and complementarity of spatio-temporal information, utilizing diverse learnable subgraphs to capture richer motion patterns
  2. Cross-domain information contrastive mechanism: Enhances cross-domain information interaction at multiple scales between spatial and temporal information
  3. Homogeneity information constraint mechanism: Fine-tunes the homogeneity information constraint mechanism for subgraph learning
  4. Experimental validation: Conducts extensive experiments on standard HMP benchmarks, demonstrating the effectiveness and superiority of the method in accurately predicting human motion across various scenarios

Methodology Details

Task Definition

Let X=[X1,,XTp]RTp×J×DX = [X_1, \cdots, X_{T_p}] \in \mathbb{R}^{T_p \times J \times D} denote the given historical poses, and Y=[XTp+1,,XTp+Tf]RTf×J×DY = [X_{T_p+1}, \cdots, X_{T_p+T_f}] \in \mathbb{R}^{T_f \times J \times D} denote the predicted motion sequence for the next TfT_f time steps. Each pose XtRJ×DX_t \in \mathbb{R}^{J \times D} describes the DD-dimensional human pose with JJ joints at time tt.

Model Architecture

Spatio-Temporal Multi-Subgraph Block (STMSB)

STMSB comprises two key modules:

  1. Spatio-temporal dual-branch: Separately models temporal and spatial domains
  2. Multi-subgraph learning: Leverages multiple subgraphs to extract richer motion information

Spatio-Temporal Dual-Branch Design

Temporal Modeling:

  • Reshape input XX to XT={XT,i}i=1Tp+TfR(Tp+Tf)×JDX^T = \{X^{T,i}\}_{i=1}^{T_p+T_f} \in \mathbb{R}^{(T_p+T_f) \times J \cdot D}
  • Project XTX^T to CC-dimensional feature space through frame embedding: X^T,i=W2(σ(W1XT,i+b1))+b2\hat{X}^{T,i} = W_2 \cdot (\sigma(W_1 \cdot X^{T,i} + b_1)) + b_2
  • Use GCN to capture inter-frame temporal dependencies

Spatial Modeling:

  • Reshape XX to spatial form XS={XS,n}n=1J×DR(J×D)×(Tp+Tf)X^S = \{X^{S,n}\}_{n=1}^{J \times D} \in \mathbb{R}^{(J \times D) \times (T_p+T_f)}
  • Apply discrete cosine transform and joint embedding to obtain joint representations
  • Use GCN to capture spatial dependencies

Spatio-Temporal Information Interaction

Promote cross-domain knowledge transfer through Mean Per-Joint Position Error (MPJPE) as a constraint: LST=l=1L1(Tp+Tf)Jt=1Tp+Tfj=1JYT,t,jlYS,t,jl2L_{ST} = \sum_{l=1}^L \frac{1}{(T_p + T_f) \cdot J} \sum_{t=1}^{T_p+T_f} \sum_{j=1}^J \|Y_{T,t,j}^l - Y_{S,t,j}^l\|_2

Multi-Subgraph Learning

Employ KK graph convolution kernels ΥTl={ΥTl,1,ΥTl,2,,ΥTl,K}\Upsilon_T^l = \{\Upsilon_{T}^{l,1}, \Upsilon_{T}^{l,2}, \cdots, \Upsilon_{T}^{l,K}\} for feature learning: MTl=Ave(HTl,1,HTl,2,,HTl,K)M_T^l = \text{Ave}(H_T^{l,1}, H_T^{l,2}, \cdots, H_T^{l,K})

To prevent excessive differentiation between kernels, propose homogeneity information learning enhancement strategy: LconT=l=1Lk=1Ku=k+1KATl,kATl,u22L_{con}^T = \sum_{l=1}^L \sum_{k=1}^K \sum_{u=k+1}^K \|A_T^{l,k} - A_T^{l,u}\|_2^2

Technical Innovations

  1. Decoupled modeling: Separately model spatio-temporal dependencies through orthogonal branches, avoiding feature confusion
  2. Cross-domain constraints: Multi-scale consistency constraints enable effective cross-domain knowledge transfer
  3. Multi-subgraph mechanism: Inspired by mixture-of-experts models, use multiple trainable subgraphs to capture different motion patterns
  4. Homogeneity constraints: Ensure consistent information propagation across subgraphs through adjacency matrix similarity constraints

Experimental Setup

Datasets

  • Human3.6M (H3.6M): Standard human motion dataset
  • CMU Motion Capture (CMU Mocap): CMU motion capture dataset

Evaluation Metrics

Use Mean Per-Joint Position Error (MPJPE) to evaluate performance, with lower values indicating better prediction performance.

Comparison Methods

Include mainstream GCN methods such as Traj-GCN, DMGNN, STS-GCN, MSR-GCN, SPGSN, PGBIG, STBMP, and others.

Implementation Details

  • Number of network layers: L=4L = 4
  • Number of graph convolution kernels: K=4K = 4
  • Hyperparameter: λ=0.1\lambda = 0.1

Experimental Results

Main Results

H3.6M Dataset Results:

  • At 80ms prediction, MPJPE is 9.61, achieving 3.71% improvement over the best baseline (STBMP's 9.98)
  • At 160ms prediction, MPJPE is 21.63, achieving 3.13% improvement over the best baseline
  • Achieves best performance across multiple time steps

CMU Mocap Dataset Results:

  • Average MPJPE of 32.43, significantly outperforming all comparison methods
  • Achieves best performance across all prediction time steps

Ablation Studies

  1. Module Contribution Analysis:
    • Spatio-temporal dual-branch: Both branches contribute to performance
    • Constraint mechanisms: Both LconL_{con} and LSTL_{ST} improve performance
    • Complete model achieves best performance (33.80)
  2. Hyperparameter Impact:
    • Performance is optimal at λ=0.1\lambda = 0.1
    • Larger λ\lambda values (1.0) limit branch information uniqueness
  3. Network Structure Impact:
    • Increasing layer count LL and kernel count KK generally improves performance
    • L=4,K=4L=4, K=4 is the optimal configuration

Experimental Findings

  1. Constraint mechanism effectiveness: Adjacency matrix constraints are more effective than weight parameter constraints
  2. Consistency vs. diversity: Enforcing graph construction similarity outperforms diversity constraints
  3. Branch selection: Using spatial branch output as final prediction yields best results

Main Research Directions

  1. CNN/RNN methods: Early approaches using convolutional and recurrent networks, but suffer from filter dependency and error accumulation
  2. GCN methods: Current mainstream approach, excels at modeling kinematic dependencies between joints
  3. Transformer methods: Recently emerging, demonstrates strong performance in sequence modeling

Advantages of This Work

Compared to existing GCN methods, this paper better exploits the complementarity and cross-dependencies of spatio-temporal features through decoupled spatio-temporal modeling, cross-domain constraints, and multi-subgraph learning.

Conclusions and Discussion

Main Conclusions

  1. Decoupled spatio-temporal modeling better captures unique information in each domain
  2. Cross-domain consistency constraints effectively promote knowledge transfer
  3. Multi-subgraph learning enhances motion pattern capture capability
  4. Achieves state-of-the-art performance on standard benchmarks

Limitations

  1. Relatively high model complexity, requiring balance between performance and computational efficiency
  2. Hyperparameter λ\lambda requires tuning for different datasets
  3. Performance on extremely long-term prediction requires further verification

Future Directions

  1. Explore more efficient spatio-temporal feature fusion mechanisms
  2. Investigate adaptive subgraph quantity selection strategies
  3. Extend to more diverse human motion scenarios

In-Depth Evaluation

Strengths

  1. Strong novelty: The decoupled spatio-temporal modeling approach is novel, with ingeniously designed cross-domain constraint mechanisms
  2. Solid theoretical foundation: GCN-based spatial and temporal modeling has sufficient theoretical support
  3. Comprehensive experiments: Includes detailed ablation studies and parameter analysis
  4. Excellent performance: Achieves state-of-the-art results on multiple benchmark datasets
  5. Clear presentation: Well-structured paper with accurate technical descriptions

Weaknesses

  1. Computational complexity: Multi-branch and multi-subgraph designs increase model complexity
  2. Parameter sensitivity: Hyperparameter λ\lambda significantly affects performance and requires careful tuning
  3. Generalization analysis: Lacks analysis of generalization capability to different motion types (e.g., dance, gymnastics)
  4. Real-time considerations: Does not discuss model inference speed and potential for real-time applications

Impact

  1. Academic contribution: Provides new decoupled perspective for spatio-temporal feature modeling
  2. Practical value: Has application prospects in robotics, gaming, and gesture interaction
  3. Reproducibility: Provides detailed implementation details and parameter settings

Applicable Scenarios

  1. High-precision requirements: Suitable for applications demanding high prediction accuracy
  2. Standard motion prediction: Performs well in predicting standardized motions such as daily activities and sports
  3. Short to medium-term prediction: Demonstrates excellent performance in prediction tasks within 1000ms

References

The paper cites over 60 relevant references, covering major methods in human motion prediction, including CNN, RNN, LSTM, Transformer, and GCN approaches, providing readers with comprehensive background knowledge.


Overall Evaluation: This is a high-quality computer vision paper that proposes an innovative solution for the important task of human motion prediction. The core idea of decoupled spatio-temporal modeling has certain generalizability, and the experimental results are convincing. Although there are some challenges in model complexity and hyperparameter tuning, the overall contribution is significant and worthy of attention and further research.