2025-11-16T06:07:12.262321

Spatio-Temporal Multi-Subgraph GCN for 3D Human Motion Prediction

Wang, Guo, Su

Human motion prediction (HMP) involves forecasting future human motion based on historical data. Graph Convolutional Networks (GCNs) have garnered widespread attention in this field for their proficiency in capturing relationships among joints in human motion. However, existing GCN-based methods tend to focus on either temporal-domain or spatial-domain features, or they combine spatio-temporal features without fully leveraging the complementarity and cross-dependency of these two features. In this paper, we propose the Spatial-Temporal Multi-Subgraph Graph Convolutional Network (STMS-GCN) to capture complex spatio-temporal dependencies in human motion. Specifically, we decouple the modeling of temporal and spatial dependencies, enabling cross-domain knowledge transfer at multiple scales through a spatio-temporal information consistency constraint mechanism. Besides, we utilize multiple subgraphs to extract richer motion information and enhance the learning associations of diverse subgraphs through a homogeneous information constraint mechanism. Extensive experiments on the standard HMP benchmarks demonstrate the superiority of our method.

academic

Spatio-Temporal Multi-Subgraph GCN for 3D Human Motion Prediction

Basic Information

Paper ID: 2501.00317
Title: Spatio-Temporal Multi-Subgraph GCN for 3D Human Motion Prediction
Authors: Jiexin Wang, Yiju Guo, Bing Su (School of Artificial Intelligence, Renmin University of China)
Categories: cs.CV (Computer Vision), cs.LG (Machine Learning)
Publication Date: December 31, 2024 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2501.00317

Abstract

Human motion prediction (HMP) involves predicting future human motion based on historical data. Graph Convolutional Networks (GCNs) have gained widespread attention in this field due to their ability to capture inter-joint relationships in human motion. However, existing GCN-based methods often focus solely on temporal or spatial features, or fail to fully exploit the complementarity and cross-dependencies between these two modalities when combining spatio-temporal features. This paper proposes Spatio-Temporal Multi-Subgraph Graph Convolutional Network (STMS-GCN) to capture complex spatio-temporal dependencies in human motion. Specifically, we decouple the modeling of temporal and spatial dependencies and achieve multi-scale cross-domain knowledge transfer through a spatio-temporal information consistency constraint mechanism. Furthermore, we leverage multiple subgraphs to extract richer motion information and enhance learning associations across different subgraphs through a homogeneity information constraint mechanism. Extensive experiments on standard HMP benchmarks demonstrate the superiority of our approach.

Research Background and Motivation

Problem Definition

3D skeleton-based human motion prediction aims to predict future motion sequences given a historical motion sequence. This research is crucial for understanding human motion behavior and has broad applications in robot collaboration, autonomous driving, action recognition, and other domains.

Limitations of Existing Methods

Single-domain modeling limitations: Most GCN methods focus only on temporal or spatial feature modeling, neglecting the complementarity between spatio-temporal features
Insufficient feature fusion: Some methods integrate spatio-temporal relationships through mixed convolution kernels but struggle to extract unique temporal and spatial information
Underutilized cross-domain dependencies: Existing decoupled modeling methods primarily focus on complex structural design while overlooking hidden cross-dependencies in spatio-temporal relationships

Research Motivation

To address these issues, this paper proposes modeling temporal and spatial information separately through orthogonal spatio-temporal branches, fully exploiting the uniqueness of spatio-temporal information, and promoting spatio-temporal information interweaving and cross-domain knowledge transfer through consistency constraints.

Core Contributions

Proposes STMS-GCN architecture: Considers the independence and complementarity of spatio-temporal information, utilizing diverse learnable subgraphs to capture richer motion patterns
Cross-domain information contrastive mechanism: Enhances cross-domain information interaction at multiple scales between spatial and temporal information
Homogeneity information constraint mechanism: Fine-tunes the homogeneity information constraint mechanism for subgraph learning
Experimental validation: Conducts extensive experiments on standard HMP benchmarks, demonstrating the effectiveness and superiority of the method in accurately predicting human motion across various scenarios

Methodology Details

Task Definition

Let $X = [X_1, \cdots, X_{T_p}] \in \mathbb{R}^{T_p \times J \times D}$ denote the given historical poses, and $Y = [X_{T_p+1}, \cdots, X_{T_p+T_f}] \in \mathbb{R}^{T_f \times J \times D}$ denote the predicted motion sequence for the next $T_f$ time steps. Each pose $X_t \in \mathbb{R}^{J \times D}$ describes the $D$ -dimensional human pose with $J$ joints at time $t$ .

Model Architecture

Spatio-Temporal Multi-Subgraph Block (STMSB)

STMSB comprises two key modules:

Spatio-temporal dual-branch: Separately models temporal and spatial domains
Multi-subgraph learning: Leverages multiple subgraphs to extract richer motion information

Spatio-Temporal Dual-Branch Design

Temporal Modeling:

Reshape input $X$ to $X^T = \{X^{T,i}\}_{i=1}^{T_p+T_f} \in \mathbb{R}^{(T_p+T_f) \times J \cdot D}$
Project $X^T$ to $C$ -dimensional feature space through frame embedding: $\hat{X}^{T,i} = W_2 \cdot (\sigma(W_1 \cdot X^{T,i} + b_1)) + b_2$
Use GCN to capture inter-frame temporal dependencies

Spatial Modeling:

Reshape $X$ to spatial form $X^S = \{X^{S,n}\}_{n=1}^{J \times D} \in \mathbb{R}^{(J \times D) \times (T_p+T_f)}$
Apply discrete cosine transform and joint embedding to obtain joint representations
Use GCN to capture spatial dependencies

Spatio-Temporal Information Interaction

Promote cross-domain knowledge transfer through Mean Per-Joint Position Error (MPJPE) as a constraint: $L_{ST} = \sum_{l=1}^L \frac{1}{(T_p + T_f) \cdot J} \sum_{t=1}^{T_p+T_f} \sum_{j=1}^J \|Y_{T,t,j}^l - Y_{S,t,j}^l\|_2$

Multi-Subgraph Learning

Employ $K$ graph convolution kernels $\Upsilon_T^l = \{\Upsilon_{T}^{l,1}, \Upsilon_{T}^{l,2}, \cdots, \Upsilon_{T}^{l,K}\}$ for feature learning: $M_T^l = \text{Ave}(H_T^{l,1}, H_T^{l,2}, \cdots, H_T^{l,K})$

To prevent excessive differentiation between kernels, propose homogeneity information learning enhancement strategy: $L_{con}^T = \sum_{l=1}^L \sum_{k=1}^K \sum_{u=k+1}^K \|A_T^{l,k} - A_T^{l,u}\|_2^2$

Technical Innovations

Decoupled modeling: Separately model spatio-temporal dependencies through orthogonal branches, avoiding feature confusion
Cross-domain constraints: Multi-scale consistency constraints enable effective cross-domain knowledge transfer
Multi-subgraph mechanism: Inspired by mixture-of-experts models, use multiple trainable subgraphs to capture different motion patterns
Homogeneity constraints: Ensure consistent information propagation across subgraphs through adjacency matrix similarity constraints

Experimental Setup

Datasets

Human3.6M (H3.6M): Standard human motion dataset
CMU Motion Capture (CMU Mocap): CMU motion capture dataset

Evaluation Metrics

Use Mean Per-Joint Position Error (MPJPE) to evaluate performance, with lower values indicating better prediction performance.

Comparison Methods

Include mainstream GCN methods such as Traj-GCN, DMGNN, STS-GCN, MSR-GCN, SPGSN, PGBIG, STBMP, and others.

Implementation Details

Number of network layers: $L = 4$
Number of graph convolution kernels: $K = 4$
Hyperparameter: $\lambda = 0.1$

Experimental Results

Main Results

H3.6M Dataset Results:

At 80ms prediction, MPJPE is 9.61, achieving 3.71% improvement over the best baseline (STBMP's 9.98)
At 160ms prediction, MPJPE is 21.63, achieving 3.13% improvement over the best baseline
Achieves best performance across multiple time steps

CMU Mocap Dataset Results:

Average MPJPE of 32.43, significantly outperforming all comparison methods
Achieves best performance across all prediction time steps

Ablation Studies

Module Contribution Analysis:
- Spatio-temporal dual-branch: Both branches contribute to performance
- Constraint mechanisms: Both $L_{con}$ and $L_{ST}$ improve performance
- Complete model achieves best performance (33.80)
Hyperparameter Impact:
- Performance is optimal at $\lambda = 0.1$
- Larger $\lambda$ values (1.0) limit branch information uniqueness
Network Structure Impact:
- Increasing layer count $L$ and kernel count $K$ generally improves performance
- $L=4, K=4$ is the optimal configuration

Experimental Findings

Constraint mechanism effectiveness: Adjacency matrix constraints are more effective than weight parameter constraints
Consistency vs. diversity: Enforcing graph construction similarity outperforms diversity constraints
Branch selection: Using spatial branch output as final prediction yields best results

Main Research Directions

CNN/RNN methods: Early approaches using convolutional and recurrent networks, but suffer from filter dependency and error accumulation
GCN methods: Current mainstream approach, excels at modeling kinematic dependencies between joints
Transformer methods: Recently emerging, demonstrates strong performance in sequence modeling

Advantages of This Work

Compared to existing GCN methods, this paper better exploits the complementarity and cross-dependencies of spatio-temporal features through decoupled spatio-temporal modeling, cross-domain constraints, and multi-subgraph learning.

Conclusions and Discussion

Main Conclusions

Decoupled spatio-temporal modeling better captures unique information in each domain
Cross-domain consistency constraints effectively promote knowledge transfer
Multi-subgraph learning enhances motion pattern capture capability
Achieves state-of-the-art performance on standard benchmarks

Limitations

Relatively high model complexity, requiring balance between performance and computational efficiency
Hyperparameter $\lambda$ requires tuning for different datasets
Performance on extremely long-term prediction requires further verification

Future Directions

Explore more efficient spatio-temporal feature fusion mechanisms
Investigate adaptive subgraph quantity selection strategies
Extend to more diverse human motion scenarios

In-Depth Evaluation

Strengths

Strong novelty: The decoupled spatio-temporal modeling approach is novel, with ingeniously designed cross-domain constraint mechanisms
Solid theoretical foundation: GCN-based spatial and temporal modeling has sufficient theoretical support
Comprehensive experiments: Includes detailed ablation studies and parameter analysis
Excellent performance: Achieves state-of-the-art results on multiple benchmark datasets
Clear presentation: Well-structured paper with accurate technical descriptions

Weaknesses

Computational complexity: Multi-branch and multi-subgraph designs increase model complexity
Parameter sensitivity: Hyperparameter $\lambda$ significantly affects performance and requires careful tuning
Generalization analysis: Lacks analysis of generalization capability to different motion types (e.g., dance, gymnastics)
Real-time considerations: Does not discuss model inference speed and potential for real-time applications

Impact

Academic contribution: Provides new decoupled perspective for spatio-temporal feature modeling
Practical value: Has application prospects in robotics, gaming, and gesture interaction
Reproducibility: Provides detailed implementation details and parameter settings

Applicable Scenarios

High-precision requirements: Suitable for applications demanding high prediction accuracy
Standard motion prediction: Performs well in predicting standardized motions such as daily activities and sports
Short to medium-term prediction: Demonstrates excellent performance in prediction tasks within 1000ms

References

The paper cites over 60 relevant references, covering major methods in human motion prediction, including CNN, RNN, LSTM, Transformer, and GCN approaches, providing readers with comprehensive background knowledge.

Overall Evaluation: This is a high-quality computer vision paper that proposes an innovative solution for the important task of human motion prediction. The core idea of decoupled spatio-temporal modeling has certain generalizability, and the experimental results are convincing. Although there are some challenges in model complexity and hyperparameter tuning, the overall contribution is significant and worthy of attention and further research.