Graph convolutional network (GCN)-based methods have shown strong performance in 3D human pose estimation by leveraging the natural graph structure of the human skeleton. However, their local receptive field limits their ability to capture long-range dependencies essential for handling occlusions and depth ambiguities. They also exhibit spectral bias, which prioritizes low-frequency components while struggling to model high-frequency details. In this paper, we introduce PoseKAN, an adaptive graph Kolmogorov-Arnold Network (KAN), framework that extends KANs to graph-based learning for 2D-to-3D pose lifting from a single image. Unlike GCNs that use fixed activation functions, KANs employ learnable functions on graph edges, allowing data-driven, adaptive feature transformations. This enhances the model's adaptability and expressiveness, making it more expressive in learning complex pose variations. Our model employs multi-hop feature aggregation, ensuring the body joints can leverage information from both local and distant neighbors, leading to improved spatial awareness. It also incorporates residual PoseKAN blocks for deeper feature refinement, and a global response normalization for improved feature selectivity and contrast. Extensive experiments on benchmark datasets demonstrate the competitive performance of our model against state-of-the-art methods.
- Paper ID: 2511.08809
- Title: Adaptive Graph Kolmogorov-Arnold Network for 3D Human Pose Estimation
- Authors: Abu Taib Mohammed Shahjahan and A. Ben Hamza (Concordia University, Montreal, Canada)
- Category: cs.CV (Computer Vision)
- Submission Date: November 11, 2025 (arXiv)
- Paper Link: https://arxiv.org/abs/2511.08809
- Code Link: https://github.com/shahjahan0275/PoseKAN
This paper proposes PoseKAN—an adaptive graph Kolmogorov-Arnold network framework for 3D human pose estimation. The method addresses three core limitations of traditional graph convolutional networks (GCNs): restricted local receptive fields, spectral bias, and insufficient expressiveness of fixed activation functions. PoseKAN employs learnable function transformations on graph edges to replace fixed activation functions, combined with multi-hop feature aggregation mechanisms, enabling effective modeling of both local and long-range joint dependencies. Experiments on the Human3.6M and MPI-INF-3DHP benchmark datasets demonstrate that the method achieves performance comparable to state-of-the-art approaches.
3D human pose estimation aims to infer 3D coordinates of body joints from 2D images or videos, which is crucial for understanding human motion. However, it remains highly challenging due to inherent depth ambiguity and occlusion in input data.
- Broad Applications: Human-computer interaction, action recognition, sports analysis, medical rehabilitation, etc.
- Technical Challenges: Missing depth information in monocular images, self-occlusion, complex pose variations
Three Major Limitations of GCN Methods:
- Restricted Local Receptive Field: Primarily relies on one-hop neighbor aggregation, making it difficult to capture long-range dependencies between distant joints
- Spectral Bias Problem: Due to using MLPs as core components, tends to learn low-frequency components while struggling to capture high-frequency details (e.g., rapid motion, fine joint interactions)
- Insufficient Expressiveness: Uses predefined fixed activation functions and trainable weight matrices, lacking dynamic adaptability and interpretability
Inspired by the Kolmogorov-Arnold representation theorem, KAN networks replace fixed activation functions with learnable univariate functions, providing stronger function approximation capability and interpretability. This paper extends KAN to graph learning, specifically targeting the 2D-to-3D lifting task for 3D pose estimation.
- Proposes PoseKAN Framework: First extension of Kolmogorov-Arnold networks to graph-structured data for 3D human pose estimation, enhancing model adaptability and generalization through learnable function-based transformations
- Designs Multi-hop Feature Propagation Mechanism: Introduces scaling parameter s to balance local and global feature aggregation, with propagation matrix P = (1-s) + s² considering both one-hop and two-hop neighbors, improving robustness to occlusion and depth ambiguity
- Innovative Architecture Design:
- Residual PoseKAN blocks for deep feature refinement
- Global Response Normalization (GRN) enhancing feature selectivity and contrast
- Integration of GELU nonlinearity for enhanced expressiveness
- Comprehensive Experimental Validation: Detailed comparative experiments and ablation studies on Human3.6M and MPI-INF-3DHP datasets, demonstrating method effectiveness
Given training set D = {(xᵢ, yᵢ)}ᴺᵢ₌₁, where:
- Input: xᵢ ∈ ℝ² represents 2D joint positions (provided by off-the-shelf 2D pose detectors)
- Output: yᵢ ∈ ℝ³ represents corresponding ground-truth 3D joint positions
- Objective: Learn regression model parameters ω for fω: X → Y
Human skeleton represented as graph G = (V, E, X):
- V = {1,...,J} represents J nodes (joints)
- E ⊆ V × V represents edge set
- X ∈ ℝᴶˣᶠ represents node feature matrix
- A is adjacency matrix, Â = D⁻¹/²AD⁻¹/² is normalized adjacency matrix
The core of KAN layers is learnable activation functions, defined as:
ϕ(x) = wᵦb(x) + wₛspline(x)
where:
- b(x) = SiLU(x) = x/(1+e⁻ˣ) is sigmoid linear unit
- spline(x) = Σᵢ cᵢBᵢ(x) is weighted sum of B-spline basis functions
- wᵦ, wₛ, cᵢ are learnable parameters
The proposed innovative spectral modulation filter:
hₛ(λ) = 1/((1+s)λ - sλ²)
where s ∈ (0,1) is scaling parameter controlling filter attenuation behavior for different frequency components. The filter exhibits adaptive low-pass characteristics.
Solved via fixed-point iteration:
H⁽ᵗ⁺¹⁾ = ((1-s)I + sÂ)ÂH⁽ᵗ⁾ + X
Core layer-wise update formula:
H⁽ˡ⁺¹⁾ = KAN⁽ˡ⁾(((1-s) + s²)H⁽ˡ⁾ + X)
Decomposable into two operations:
Feature Propagation:
G⁽ˡ⁾ = PH⁽ˡ⁾ + X
where P = (1-s) + s² is propagation matrix balancing one-hop and two-hop neighbor information
Feature Embedding:
H⁽ˡ⁺¹⁾ = KAN⁽ˡ⁾(G⁽ˡ⁾)
Each graph edge associates with a learnable univariate function
- Initial PoseKAN Layer: Maps 2D input to latent space
- 4 Residual PoseKAN Blocks: Each block contains
- 5 PoseKAN layers for hierarchical feature learning
- Layer normalization for training stability
- Additional PoseKAN layer + GELU nonlinearity
- Residual connections preventing gradient vanishing
- Global Response Normalization (GRN): Calibrates feature magnitude before prediction
- Final PoseKAN Layer: Projects back to 3D pose space
Hybrid loss function (inspired by elastic net):
L = 1/N(1-α)Σᵢ||yᵢ - ŷᵢ||₂² + αΣᵢ||yᵢ - ŷᵢ||₁
where α ∈ 0,1 controls weight balance between MSE and MAE
- GCN: Uses fixed activation functions (e.g., ReLU) and trainable weight matrices, essentially node-level linear mappings
- PoseKAN: Employs learnable univariate functions on edges, providing data-driven adaptive feature transformations with stronger expressiveness
Through propagation matrix P = (1-s) + s²:
- Explicitly combines one-hop and two-hop neighbor information
- Parameter s adjusts balance between local vs global information
- Avoids explicit computation of ² (using right-to-left multiplication strategy)
KAN's function-based transformations capture both low and high-frequency components:
- Low-frequency: Smooth, gradual joint position changes (e.g., Walking, Eating)
- High-frequency: Rapid, abrupt motions (e.g., sudden actions in Greeting)
- Time Complexity: O(L||Â||₀F + LGF²)
- First term: feature propagation (depends on graph edge count)
- Second term: KAN transformation (G is grid size)
- Space Complexity: O(LJF + 2kGLF²)
- 2k from recursive computation of k-order splines
Since k and G are typically small, additional overhead is manageable
- Scale: 11 actors (6 male, 5 female), 15 indoor activities
- Acquisition: 50Hz, 4 synchronized cameras
- Annotation: Precise 3D joint coordinates via motion capture
- Split:
- Training: 5 actors (S1, S5, S6, S7, S8)
- Testing: 2 actors (S9, S11)
- Preprocessing: Normalization, zero-centered at hip joint
- Scale: 8 actors (4 male, 4 female), 8 activity sequences
- Acquisition: 14 different viewpoints, indoor/outdoor scenes
- Characteristics: More diverse than Human3.6M, including basic to dynamic high-intensity actions
- Protocol #1: MPJPE (Mean Per-Joint Position Error) - average per-joint position error in millimeters
- Protocol #2: PA-MPJPE (Procrustes-Aligned MPJPE) - error after Procrustes alignment
- PCK (Percentage of Correct Keypoint): Percentage of correct keypoints
- AUC (Area Under Curve): Area under the curve
- GCN Series: SemGCN, High-order GCN, CompGCN, Modulated GCN, Group GCN, MM-GCN, Flex-GCN
- Hybrid Methods: GraphMLP (combining MLP and GCN)
- Others: HOIF-Net, PoseGraphNet, WSGN, etc.
- Hardware: Single NVIDIA RTX A4500 GPU (20GB)
- Framework: PyTorch
- Optimizer: AMSGrad
- Training Epochs: 30
- Learning Rate: Initial 0.001, decayed by 0.99 every 4 epochs
- Batch Size: 64
- Embedding Dimension: F = 240
- Key Hyperparameters: s = 0.2, α = 0.03 (determined via grid search)
- Regularization: Dropout = 0.2 after each PoseKAN layer
- Spline Settings: Order = 3, Grid Size = 5
Overall Performance:
- PoseKAN: 46.7mm (optimal)
- GraphMLP: 48.0mm (second)
- Modulated GCN: 49.4mm
- Relative Error Reduction:
- vs GraphMLP: 2.7%
- vs Modulated GCN: 5.47%
- vs High-order GCN: 15.99%
Key Action Performance (occlusion challenges):
- Eating: 44.4mm (significantly better than other methods)
- Sitting: 54.6mm
- Smoking: 46.1mm
- Superior to Modulated GCN in 14 out of 15 actions
Overall Performance:
- PoseKAN: 38.3mm (optimal)
- GraphMLP: 38.4mm (relative error reduction 0.26%)
- Modulated GCN: 39.1mm (relative error reduction 2.04%)
- High-order GCN: 43.7mm (relative error reduction 12.35%)
Advantageous Actions:
- Superior to GraphMLP in 11 out of 15 actions
- Superior to Modulated GCN in 13 out of 15 actions
- Particularly outstanding in heavily occluded scenarios (Greeting, Sitting, Smoking)
Training on Human3.6M, testing on MPI-INF-3DHP:
- PCK: 86.0% (highest)
- AUC: 52.9% (second, only behind ICFNet's 54.3%)
- PCK improvement over ICFNet: 0.5%
- MPJPE: 33.51mm
- Relative Error Reduction:
- vs SemGCN: 19.62%
- vs High-order GCN: 14.29%
- vs GraphMLP: 2.01%
- PA-MPJPE: 28.01mm (optimal)
| Configuration | MPJPE | PA-MPJPE |
|---|
| Without IRC | 34.44mm | 28.79mm |
| With IRC | 33.51mm | 28.01mm |
| Improvement | 1.65% | 1.49% |
Conclusion: IRC stabilizes training by preserving initial features, preventing information loss
- Order 2: MPJPE=47.43mm, PA-MPJPE=38.86mm
- Order 3: MPJPE=46.77mm, PA-MPJPE=38.36mm (optimal)
- Order 4: MPJPE=47.10mm, PA-MPJPE=38.59mm
Conclusion: Order 3 achieves optimal balance; higher orders increase complexity without benefit
- Size 4: MPJPE=47.40mm, PA-MPJPE=38.91mm
- Size 5: MPJPE=46.77mm, PA-MPJPE=38.36mm (optimal)
- Size 6: MPJPE=47.98mm, PA-MPJPE=39.11mm
Conclusion: Grid size 5 provides sufficient function approximation capability
Test range: s ∈ {0.1, 0.2, 0.3, 0.5, 0.7, 0.9}
- Optimal value: s=0.2
- Smaller s emphasizes local information while moderately considering distant nodes
- Both excessively large and small s values degrade performance
- 224: MPJPE=47.38mm
- 240: MPJPE=46.77mm (optimal)
- 256: MPJPE=47.29mm
Conclusion: 240 dimensions provide sufficient expressiveness without overfitting
Qualitative Visualization (Figure 2) demonstrates PoseKAN predictions across various action categories:
- Predicted 3D poses highly align with ground truth
- Superior performance in self-occlusion scenarios (e.g., crossed arms, sitting)
- GraphMLP occasionally produces unnatural joint positions, while PoseKAN maintains skeleton structure consistency
- Precise joint placement and natural limb configuration validate the model's ability to alleviate depth ambiguity
- Clear Advantages of Learnable Functions: Compared to fixed activation functions, learnable functions on edges provide stronger adaptability
- Multi-hop Aggregation is Critical: Significantly improves handling of occlusion and complex poses
- High Parameter Efficiency: PoseKAN has only 5.72M parameters, far fewer than GraphMLP's 9.49M
- Strong Cross-dataset Generalization: Performance on MPI-INF-3DHP demonstrates good generalization
- Sensitivity to High-frequency Details: Shows clear advantages in actions requiring rapid motion details (e.g., Greeting)
- Directly regress 3D joint coordinates from images
- Representatives: Integral Human Pose Regression, Compositional Human Pose Regression
- Limitations: Susceptible to occlusion, lower accuracy
- Stage 1: Detect 2D joint positions
- Stage 2: Lift to 3D space
- Representatives: SimpleBaseline, LCN
- Advantages: Modular design, flexible 2D detector selection, stronger robustness
- This paper belongs to this category
- SemGCN: First application of GCN to 3D pose estimation
- Limitations: One-hop neighbor aggregation, limited local receptive field
- High-order GCN: Extends to multi-hop neighborhoods
- Modulated GCN: Adjacency matrix modulation, learns additional edges
- GroupGCN: Group graph convolution
- MM-GCN: Multi-hop modulated GCN, fuses multi-hop neighborhood information
- GraphMLP: Combines MLP and GCN, leveraging global and local skeleton interactions
- Limitations: Still uses fixed activation functions, suffers from spectral bias
- Theoretical Foundation: Kolmogorov-Arnold representation theorem (any continuous multivariate function can be expressed as finite combinations of univariate functions)
- KAN Networks: Replace fixed activations with learnable univariate functions, improving interpretability and adaptability
- KAGNN: Recently extends KAN to graph learning (node/graph classification, link prediction)
- This Paper's Innovation: First application of KAN to 2D-to-3D lifting for 3D pose estimation
- vs Standard GCN: Learnable functions vs fixed activation, multi-hop aggregation vs one-hop
- vs High-order GCN: Adaptive function transformation vs fixed high-order convolution
- vs GraphMLP: Alleviates spectral bias, stronger expressiveness
- vs KAGNN: Specifically designed for pose estimation, introduces spectral modulation filter
- Method Effectiveness: PoseKAN achieves or surpasses state-of-the-art methods on Human3.6M and MPI-INF-3DHP datasets
- Core Advantages:
- Learnable functions provide stronger adaptability and expressiveness
- Multi-hop feature aggregation effectively captures long-range dependencies
- Alleviates spectral bias, simultaneously learning low and high-frequency components
- Practicality: High parameter efficiency (5.72M), manageable computational overhead, suitable for practical applications
- Generalization Capability: Excellent cross-dataset evaluation performance demonstrates good generalization
- Interpretability Challenges: While more interpretable than GCN, visualizing how each learnable activation adapts across different skeleton parts remains challenging
- Computational Cost: Learnable activations increase per-layer computational overhead; spline basis functions require additional memory
- Memory Consumption: Significant memory requirements for large-scale datasets and deep network training
- Optimization Space: Further improvements needed in computational efficiency, interpretability, and robustness
- Single-person Pose Limitation: Currently handles only single-person poses, not extended to multi-person scenarios
- 2D Detection Dependency: Performance depends on quality of 2D pose detector
- Static Graph Structure: While learning edge weights, topology is predefined
- Hyperparameter Sensitivity: Parameters like s and α require careful tuning
- Multi-person Pose Estimation: Extension to multi-person scenarios handling interpersonal interactions
- Other Graph Learning Tasks: Action recognition, anomaly detection, etc.
- Temporal Modeling: Incorporating temporal information from video sequences
- End-to-end Learning: Joint optimization of 2D detection and 3D lifting
- Adaptive Graph Structure: Dynamically learn graph topology rather than predefined structure
- Lightweight Design: Model compression for mobile devices
- Theoretical Innovation: First extension of KAN to graph learning for 3D pose estimation, solid theoretical foundation
- Technical Innovation: Ingenious spectral modulation filter design, effective multi-hop aggregation mechanism
- Architecture Innovation: Reasonable combination of residual PoseKAN blocks and GRN
- Dataset Diversity: Human3.6M (indoor) + MPI-INF-3DHP (indoor/outdoor)
- Comprehensive Comparison: Comparison with 10+ state-of-the-art methods
- Detailed Ablation: IRC, spline order, grid size, scaling factor, embedding dimension, etc.
- Qualitative Analysis: Visualization case comparisons provided
- Leading Performance: Achieves SOTA or near-SOTA on multiple metrics
- Good Consistency: Stable performance across datasets and protocols
- Statistical Significance: Substantial relative error reduction (up to 19.62%)
- Parameter Efficiency: 5.72M parameters superior to GraphMLP's 9.49M
- Clear Structure: Logical progression from motivation to method to experiments
- Mathematical Rigor: Complete formula derivations, clear symbol definitions
- Rich Figures: Architecture diagrams, comparison tables, ablation charts comprehensive
- Supplementary Materials: Detailed appendix explanations provided
- Computational Overhead: While claimed manageable, spline computation and function learning do increase complexity
- Memory Requirements: O(2kGLF²) memory complexity may become bottleneck in large-scale applications
- Single-person Limitation: Doesn't handle multi-person scenarios, limiting practical application scope
- Hyperparameter Search: s=0.2 and α=0.03 determined via grid search, but search range and process not reported
- Statistical Testing: Lacks significance tests (e.g., t-test)
- Failure Cases: No demonstration of typical failure cases and analysis of failure reasons
- Interpretability: Claims greater interpretability than GCN but lacks specific function visualization or analysis
- Frequency Analysis: Mentions alleviating spectral bias but lacks quantitative spectral analysis evidence
- Error Distribution: No analysis of error distribution patterns across different joints and actions
- Input Consistency: Uses same 2D detector, but doesn't report detector error impact on results
- Implementation Details: Baseline methods may use different training strategies, affecting fair comparison
- Theoretical Contribution: Introduces KAN to graph-based pose estimation, opens new direction
- Method Contribution: Spectral modulation filter and multi-hop aggregation mechanisms transferable to other graph tasks
- Empirical Contribution: Establishes new performance benchmarks on standard datasets
- Performance Improvement: 2-19% relative improvement meaningful for practical applications
- Parameter Efficiency: 5.72M parameters moderate, deployable
- Limitations: Single-person limitation and computational overhead restrict real-time applications
- Code Release: GitHub link provided, facilitating reproduction and application
- Sufficient Details: Hyperparameters, training strategies, network configuration detailed
- Open Source Code: Commits to open-source code
- Standard Data: Uses public datasets and standard protocols
- Potential Issues: KAN implementation details (spline computation) may have technical barriers
- High-precision Requirement Scenarios: Sports analysis, medical diagnosis requiring high accuracy
- Heavy Occlusion Scenarios: Multi-hop aggregation mechanism shows clear advantages under occlusion
- Complex Action Analysis: High-frequency detail capture capability suits rapid complex actions
- Offline Processing: Scenarios requiring high precision without real-time constraints
- Real-time Applications: Relatively high computational overhead, unsuitable for real-time processing
- Multi-person Scenarios: Current architecture doesn't consider multi-person interactions
- Resource-constrained Devices: Large memory requirements unsuitable for mobile deployment
- Large-scale Deployment: Training and inference costs may limit large-scale application
- Video Sequences: Extendable to temporal modeling
- Other Graph Tasks: Action recognition, human mesh recovery, etc.
- Multi-modal Fusion: Combining RGB, depth, IMU and other multi-source data
- Transfer Learning: Pretrained model transfer to other pose estimation tasks
- Liu et al., 2025 - KAN: Kolmogorov-Arnold networks (ICLR 2025) - Original KAN proposal
- Zhao et al., 2019 - SemGCN - First GCN application to 3D pose estimation
- Zou & Tang, 2021 - Modulated GCN - Adjacency matrix modulation method
- Li et al., 2025 - GraphMLP - One of the strongest baselines
- Bresson et al., 2025 - KAGNNs - KAN application in graph learning
- Ionescu et al., 2013 - Human3.6M dataset - Standard benchmark dataset
- Martinez et al., 2017 - SimpleBaseline - Classical 2D-to-3D lifting method
- Novelty: 9/10
- Technical Quality: 8/10
- Experimental Sufficiency: 8/10
- Writing Quality: 9/10
- Practical Value: 7/10
- Comprehensive Score: 8.2/10
Recommendation: ★★★★☆ (Strongly recommended for reading, especially for researchers interested in graph neural networks and 3D vision)