Human motion prediction (HMP) involves forecasting future human motion based on historical data. Graph Convolutional Networks (GCNs) have garnered widespread attention in this field for their proficiency in capturing relationships among joints in human motion. However, existing GCN-based methods tend to focus on either temporal-domain or spatial-domain features, or they combine spatio-temporal features without fully leveraging the complementarity and cross-dependency of these two features. In this paper, we propose the Spatial-Temporal Multi-Subgraph Graph Convolutional Network (STMS-GCN) to capture complex spatio-temporal dependencies in human motion. Specifically, we decouple the modeling of temporal and spatial dependencies, enabling cross-domain knowledge transfer at multiple scales through a spatio-temporal information consistency constraint mechanism. Besides, we utilize multiple subgraphs to extract richer motion information and enhance the learning associations of diverse subgraphs through a homogeneous information constraint mechanism. Extensive experiments on the standard HMP benchmarks demonstrate the superiority of our method.
- Paper ID: 2501.00317
- Title: Spatio-Temporal Multi-Subgraph GCN for 3D Human Motion Prediction
- Authors: Jiexin Wang, Yiju Guo, Bing Su (School of Artificial Intelligence, Renmin University of China)
- Categories: cs.CV (Computer Vision), cs.LG (Machine Learning)
- Publication Date: December 31, 2024 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2501.00317
Human motion prediction (HMP) involves predicting future human motion based on historical data. Graph Convolutional Networks (GCNs) have gained widespread attention in this field due to their ability to capture inter-joint relationships in human motion. However, existing GCN-based methods often focus solely on temporal or spatial features, or fail to fully exploit the complementarity and cross-dependencies between these two modalities when combining spatio-temporal features. This paper proposes Spatio-Temporal Multi-Subgraph Graph Convolutional Network (STMS-GCN) to capture complex spatio-temporal dependencies in human motion. Specifically, we decouple the modeling of temporal and spatial dependencies and achieve multi-scale cross-domain knowledge transfer through a spatio-temporal information consistency constraint mechanism. Furthermore, we leverage multiple subgraphs to extract richer motion information and enhance learning associations across different subgraphs through a homogeneity information constraint mechanism. Extensive experiments on standard HMP benchmarks demonstrate the superiority of our approach.
3D skeleton-based human motion prediction aims to predict future motion sequences given a historical motion sequence. This research is crucial for understanding human motion behavior and has broad applications in robot collaboration, autonomous driving, action recognition, and other domains.
- Single-domain modeling limitations: Most GCN methods focus only on temporal or spatial feature modeling, neglecting the complementarity between spatio-temporal features
- Insufficient feature fusion: Some methods integrate spatio-temporal relationships through mixed convolution kernels but struggle to extract unique temporal and spatial information
- Underutilized cross-domain dependencies: Existing decoupled modeling methods primarily focus on complex structural design while overlooking hidden cross-dependencies in spatio-temporal relationships
To address these issues, this paper proposes modeling temporal and spatial information separately through orthogonal spatio-temporal branches, fully exploiting the uniqueness of spatio-temporal information, and promoting spatio-temporal information interweaving and cross-domain knowledge transfer through consistency constraints.
- Proposes STMS-GCN architecture: Considers the independence and complementarity of spatio-temporal information, utilizing diverse learnable subgraphs to capture richer motion patterns
- Cross-domain information contrastive mechanism: Enhances cross-domain information interaction at multiple scales between spatial and temporal information
- Homogeneity information constraint mechanism: Fine-tunes the homogeneity information constraint mechanism for subgraph learning
- Experimental validation: Conducts extensive experiments on standard HMP benchmarks, demonstrating the effectiveness and superiority of the method in accurately predicting human motion across various scenarios
Let X=[X1,⋯,XTp]∈RTp×J×D denote the given historical poses, and Y=[XTp+1,⋯,XTp+Tf]∈RTf×J×D denote the predicted motion sequence for the next Tf time steps. Each pose Xt∈RJ×D describes the D-dimensional human pose with J joints at time t.
STMSB comprises two key modules:
- Spatio-temporal dual-branch: Separately models temporal and spatial domains
- Multi-subgraph learning: Leverages multiple subgraphs to extract richer motion information
Temporal Modeling:
- Reshape input X to XT={XT,i}i=1Tp+Tf∈R(Tp+Tf)×J⋅D
- Project XT to C-dimensional feature space through frame embedding:
X^T,i=W2⋅(σ(W1⋅XT,i+b1))+b2
- Use GCN to capture inter-frame temporal dependencies
Spatial Modeling:
- Reshape X to spatial form XS={XS,n}n=1J×D∈R(J×D)×(Tp+Tf)
- Apply discrete cosine transform and joint embedding to obtain joint representations
- Use GCN to capture spatial dependencies
Promote cross-domain knowledge transfer through Mean Per-Joint Position Error (MPJPE) as a constraint:
LST=∑l=1L(Tp+Tf)⋅J1∑t=1Tp+Tf∑j=1J∥YT,t,jl−YS,t,jl∥2
Employ K graph convolution kernels ΥTl={ΥTl,1,ΥTl,2,⋯,ΥTl,K} for feature learning:
MTl=Ave(HTl,1,HTl,2,⋯,HTl,K)
To prevent excessive differentiation between kernels, propose homogeneity information learning enhancement strategy:
LconT=∑l=1L∑k=1K∑u=k+1K∥ATl,k−ATl,u∥22
- Decoupled modeling: Separately model spatio-temporal dependencies through orthogonal branches, avoiding feature confusion
- Cross-domain constraints: Multi-scale consistency constraints enable effective cross-domain knowledge transfer
- Multi-subgraph mechanism: Inspired by mixture-of-experts models, use multiple trainable subgraphs to capture different motion patterns
- Homogeneity constraints: Ensure consistent information propagation across subgraphs through adjacency matrix similarity constraints
- Human3.6M (H3.6M): Standard human motion dataset
- CMU Motion Capture (CMU Mocap): CMU motion capture dataset
Use Mean Per-Joint Position Error (MPJPE) to evaluate performance, with lower values indicating better prediction performance.
Include mainstream GCN methods such as Traj-GCN, DMGNN, STS-GCN, MSR-GCN, SPGSN, PGBIG, STBMP, and others.
- Number of network layers: L=4
- Number of graph convolution kernels: K=4
- Hyperparameter: λ=0.1
H3.6M Dataset Results:
- At 80ms prediction, MPJPE is 9.61, achieving 3.71% improvement over the best baseline (STBMP's 9.98)
- At 160ms prediction, MPJPE is 21.63, achieving 3.13% improvement over the best baseline
- Achieves best performance across multiple time steps
CMU Mocap Dataset Results:
- Average MPJPE of 32.43, significantly outperforming all comparison methods
- Achieves best performance across all prediction time steps
- Module Contribution Analysis:
- Spatio-temporal dual-branch: Both branches contribute to performance
- Constraint mechanisms: Both Lcon and LST improve performance
- Complete model achieves best performance (33.80)
- Hyperparameter Impact:
- Performance is optimal at λ=0.1
- Larger λ values (1.0) limit branch information uniqueness
- Network Structure Impact:
- Increasing layer count L and kernel count K generally improves performance
- L=4,K=4 is the optimal configuration
- Constraint mechanism effectiveness: Adjacency matrix constraints are more effective than weight parameter constraints
- Consistency vs. diversity: Enforcing graph construction similarity outperforms diversity constraints
- Branch selection: Using spatial branch output as final prediction yields best results
- CNN/RNN methods: Early approaches using convolutional and recurrent networks, but suffer from filter dependency and error accumulation
- GCN methods: Current mainstream approach, excels at modeling kinematic dependencies between joints
- Transformer methods: Recently emerging, demonstrates strong performance in sequence modeling
Compared to existing GCN methods, this paper better exploits the complementarity and cross-dependencies of spatio-temporal features through decoupled spatio-temporal modeling, cross-domain constraints, and multi-subgraph learning.
- Decoupled spatio-temporal modeling better captures unique information in each domain
- Cross-domain consistency constraints effectively promote knowledge transfer
- Multi-subgraph learning enhances motion pattern capture capability
- Achieves state-of-the-art performance on standard benchmarks
- Relatively high model complexity, requiring balance between performance and computational efficiency
- Hyperparameter λ requires tuning for different datasets
- Performance on extremely long-term prediction requires further verification
- Explore more efficient spatio-temporal feature fusion mechanisms
- Investigate adaptive subgraph quantity selection strategies
- Extend to more diverse human motion scenarios
- Strong novelty: The decoupled spatio-temporal modeling approach is novel, with ingeniously designed cross-domain constraint mechanisms
- Solid theoretical foundation: GCN-based spatial and temporal modeling has sufficient theoretical support
- Comprehensive experiments: Includes detailed ablation studies and parameter analysis
- Excellent performance: Achieves state-of-the-art results on multiple benchmark datasets
- Clear presentation: Well-structured paper with accurate technical descriptions
- Computational complexity: Multi-branch and multi-subgraph designs increase model complexity
- Parameter sensitivity: Hyperparameter λ significantly affects performance and requires careful tuning
- Generalization analysis: Lacks analysis of generalization capability to different motion types (e.g., dance, gymnastics)
- Real-time considerations: Does not discuss model inference speed and potential for real-time applications
- Academic contribution: Provides new decoupled perspective for spatio-temporal feature modeling
- Practical value: Has application prospects in robotics, gaming, and gesture interaction
- Reproducibility: Provides detailed implementation details and parameter settings
- High-precision requirements: Suitable for applications demanding high prediction accuracy
- Standard motion prediction: Performs well in predicting standardized motions such as daily activities and sports
- Short to medium-term prediction: Demonstrates excellent performance in prediction tasks within 1000ms
The paper cites over 60 relevant references, covering major methods in human motion prediction, including CNN, RNN, LSTM, Transformer, and GCN approaches, providing readers with comprehensive background knowledge.
Overall Evaluation: This is a high-quality computer vision paper that proposes an innovative solution for the important task of human motion prediction. The core idea of decoupled spatio-temporal modeling has certain generalizability, and the experimental results are convincing. Although there are some challenges in model complexity and hyperparameter tuning, the overall contribution is significant and worthy of attention and further research.