2025-11-17T17:34:12.559157

Adaptive graph Kolmogorov-Arnold network for 3D human pose estimation

Shahjahan, Hamza
Graph convolutional network (GCN)-based methods have shown strong performance in 3D human pose estimation by leveraging the natural graph structure of the human skeleton. However, their local receptive field limits their ability to capture long-range dependencies essential for handling occlusions and depth ambiguities. They also exhibit spectral bias, which prioritizes low-frequency components while struggling to model high-frequency details. In this paper, we introduce PoseKAN, an adaptive graph Kolmogorov-Arnold Network (KAN), framework that extends KANs to graph-based learning for 2D-to-3D pose lifting from a single image. Unlike GCNs that use fixed activation functions, KANs employ learnable functions on graph edges, allowing data-driven, adaptive feature transformations. This enhances the model's adaptability and expressiveness, making it more expressive in learning complex pose variations. Our model employs multi-hop feature aggregation, ensuring the body joints can leverage information from both local and distant neighbors, leading to improved spatial awareness. It also incorporates residual PoseKAN blocks for deeper feature refinement, and a global response normalization for improved feature selectivity and contrast. Extensive experiments on benchmark datasets demonstrate the competitive performance of our model against state-of-the-art methods.
academic

Adaptive Graph Kolmogorov-Arnold Network for 3D Human Pose Estimation

Basic Information

  • Paper ID: 2511.08809
  • Title: Adaptive Graph Kolmogorov-Arnold Network for 3D Human Pose Estimation
  • Authors: Abu Taib Mohammed Shahjahan and A. Ben Hamza (Concordia University, Montreal, Canada)
  • Category: cs.CV (Computer Vision)
  • Submission Date: November 11, 2025 (arXiv)
  • Paper Link: https://arxiv.org/abs/2511.08809
  • Code Link: https://github.com/shahjahan0275/PoseKAN

Abstract

This paper proposes PoseKAN—an adaptive graph Kolmogorov-Arnold network framework for 3D human pose estimation. The method addresses three core limitations of traditional graph convolutional networks (GCNs): restricted local receptive fields, spectral bias, and insufficient expressiveness of fixed activation functions. PoseKAN employs learnable function transformations on graph edges to replace fixed activation functions, combined with multi-hop feature aggregation mechanisms, enabling effective modeling of both local and long-range joint dependencies. Experiments on the Human3.6M and MPI-INF-3DHP benchmark datasets demonstrate that the method achieves performance comparable to state-of-the-art approaches.

Research Background and Motivation

1. Core Problem

3D human pose estimation aims to infer 3D coordinates of body joints from 2D images or videos, which is crucial for understanding human motion. However, it remains highly challenging due to inherent depth ambiguity and occlusion in input data.

2. Problem Significance

  • Broad Applications: Human-computer interaction, action recognition, sports analysis, medical rehabilitation, etc.
  • Technical Challenges: Missing depth information in monocular images, self-occlusion, complex pose variations

3. Limitations of Existing Methods

Three Major Limitations of GCN Methods:

  • Restricted Local Receptive Field: Primarily relies on one-hop neighbor aggregation, making it difficult to capture long-range dependencies between distant joints
  • Spectral Bias Problem: Due to using MLPs as core components, tends to learn low-frequency components while struggling to capture high-frequency details (e.g., rapid motion, fine joint interactions)
  • Insufficient Expressiveness: Uses predefined fixed activation functions and trainable weight matrices, lacking dynamic adaptability and interpretability

4. Research Motivation

Inspired by the Kolmogorov-Arnold representation theorem, KAN networks replace fixed activation functions with learnable univariate functions, providing stronger function approximation capability and interpretability. This paper extends KAN to graph learning, specifically targeting the 2D-to-3D lifting task for 3D pose estimation.

Core Contributions

  1. Proposes PoseKAN Framework: First extension of Kolmogorov-Arnold networks to graph-structured data for 3D human pose estimation, enhancing model adaptability and generalization through learnable function-based transformations
  2. Designs Multi-hop Feature Propagation Mechanism: Introduces scaling parameter s to balance local and global feature aggregation, with propagation matrix P = (1-s) + s² considering both one-hop and two-hop neighbors, improving robustness to occlusion and depth ambiguity
  3. Innovative Architecture Design:
    • Residual PoseKAN blocks for deep feature refinement
    • Global Response Normalization (GRN) enhancing feature selectivity and contrast
    • Integration of GELU nonlinearity for enhanced expressiveness
  4. Comprehensive Experimental Validation: Detailed comparative experiments and ablation studies on Human3.6M and MPI-INF-3DHP datasets, demonstrating method effectiveness

Method Details

Task Definition

Given training set D = {(xᵢ, yᵢ)}ᴺᵢ₌₁, where:

  • Input: xᵢ ∈ ℝ² represents 2D joint positions (provided by off-the-shelf 2D pose detectors)
  • Output: yᵢ ∈ ℝ³ represents corresponding ground-truth 3D joint positions
  • Objective: Learn regression model parameters ω for fω: X → Y

Human skeleton represented as graph G = (V, E, X):

  • V = {1,...,J} represents J nodes (joints)
  • E ⊆ V × V represents edge set
  • X ∈ ℝᴶˣᶠ represents node feature matrix
  • A is adjacency matrix, Â = D⁻¹/²AD⁻¹/² is normalized adjacency matrix

Model Architecture

1. Kolmogorov-Arnold Network Foundation

The core of KAN layers is learnable activation functions, defined as:

ϕ(x) = wᵦb(x) + wₛspline(x)

where:

  • b(x) = SiLU(x) = x/(1+e⁻ˣ) is sigmoid linear unit
  • spline(x) = Σᵢ cᵢBᵢ(x) is weighted sum of B-spline basis functions
  • wᵦ, wₛ, cᵢ are learnable parameters

2. Spectral Modulation Filter

The proposed innovative spectral modulation filter:

hₛ(λ) = 1/((1+s)λ - sλ²)

where s ∈ (0,1) is scaling parameter controlling filter attenuation behavior for different frequency components. The filter exhibits adaptive low-pass characteristics.

Solved via fixed-point iteration: H⁽ᵗ⁺¹⁾ = ((1-s)I + sÂ)ÂH⁽ᵗ⁾ + X

3. PoseKAN Layer Update Rule

Core layer-wise update formula:

H⁽ˡ⁺¹⁾ = KAN⁽ˡ⁾(((1-s) + s²)H⁽ˡ⁾ + X)

Decomposable into two operations:

Feature Propagation: G⁽ˡ⁾ = PH⁽ˡ⁾ + X

where P = (1-s) + s² is propagation matrix balancing one-hop and two-hop neighbor information

Feature Embedding: H⁽ˡ⁺¹⁾ = KAN⁽ˡ⁾(G⁽ˡ⁾)

Each graph edge associates with a learnable univariate function

4. Overall Architecture

  • Initial PoseKAN Layer: Maps 2D input to latent space
  • 4 Residual PoseKAN Blocks: Each block contains
    • 5 PoseKAN layers for hierarchical feature learning
    • Layer normalization for training stability
    • Additional PoseKAN layer + GELU nonlinearity
    • Residual connections preventing gradient vanishing
  • Global Response Normalization (GRN): Calibrates feature magnitude before prediction
  • Final PoseKAN Layer: Projects back to 3D pose space

5. Loss Function

Hybrid loss function (inspired by elastic net):

L = 1/N(1-α)Σᵢ||yᵢ - ŷᵢ||₂² + αΣᵢ||yᵢ - ŷᵢ||₁

where α ∈ 0,1 controls weight balance between MSE and MAE

Technical Innovation Points

1. Learnable Function Transformation vs Fixed Activation

  • GCN: Uses fixed activation functions (e.g., ReLU) and trainable weight matrices, essentially node-level linear mappings
  • PoseKAN: Employs learnable univariate functions on edges, providing data-driven adaptive feature transformations with stronger expressiveness

2. Multi-hop Dependency Modeling

Through propagation matrix P = (1-s) + s²:

  • Explicitly combines one-hop and two-hop neighbor information
  • Parameter s adjusts balance between local vs global information
  • Avoids explicit computation of ² (using right-to-left multiplication strategy)

3. Alleviating Spectral Bias

KAN's function-based transformations capture both low and high-frequency components:

  • Low-frequency: Smooth, gradual joint position changes (e.g., Walking, Eating)
  • High-frequency: Rapid, abrupt motions (e.g., sudden actions in Greeting)

4. Computational Complexity Analysis

  • Time Complexity: O(L||Â||₀F + LGF²)
    • First term: feature propagation (depends on graph edge count)
    • Second term: KAN transformation (G is grid size)
  • Space Complexity: O(LJF + 2kGLF²)
    • 2k from recursive computation of k-order splines

Since k and G are typically small, additional overhead is manageable

Experimental Setup

Datasets

1. Human3.6M

  • Scale: 11 actors (6 male, 5 female), 15 indoor activities
  • Acquisition: 50Hz, 4 synchronized cameras
  • Annotation: Precise 3D joint coordinates via motion capture
  • Split:
    • Training: 5 actors (S1, S5, S6, S7, S8)
    • Testing: 2 actors (S9, S11)
  • Preprocessing: Normalization, zero-centered at hip joint

2. MPI-INF-3DHP

  • Scale: 8 actors (4 male, 4 female), 8 activity sequences
  • Acquisition: 14 different viewpoints, indoor/outdoor scenes
  • Characteristics: More diverse than Human3.6M, including basic to dynamic high-intensity actions

Evaluation Metrics

Human3.6M

  • Protocol #1: MPJPE (Mean Per-Joint Position Error) - average per-joint position error in millimeters
  • Protocol #2: PA-MPJPE (Procrustes-Aligned MPJPE) - error after Procrustes alignment

MPI-INF-3DHP

  • PCK (Percentage of Correct Keypoint): Percentage of correct keypoints
  • AUC (Area Under Curve): Area under the curve

Comparison Methods

  • GCN Series: SemGCN, High-order GCN, CompGCN, Modulated GCN, Group GCN, MM-GCN, Flex-GCN
  • Hybrid Methods: GraphMLP (combining MLP and GCN)
  • Others: HOIF-Net, PoseGraphNet, WSGN, etc.

Implementation Details

  • Hardware: Single NVIDIA RTX A4500 GPU (20GB)
  • Framework: PyTorch
  • Optimizer: AMSGrad
  • Training Epochs: 30
  • Learning Rate: Initial 0.001, decayed by 0.99 every 4 epochs
  • Batch Size: 64
  • Embedding Dimension: F = 240
  • Key Hyperparameters: s = 0.2, α = 0.03 (determined via grid search)
  • Regularization: Dropout = 0.2 after each PoseKAN layer
  • Spline Settings: Order = 3, Grid Size = 5

Experimental Results

Main Results

Human3.6M - Protocol #1 (MPJPE)

Overall Performance:

  • PoseKAN: 46.7mm (optimal)
  • GraphMLP: 48.0mm (second)
  • Modulated GCN: 49.4mm
  • Relative Error Reduction:
    • vs GraphMLP: 2.7%
    • vs Modulated GCN: 5.47%
    • vs High-order GCN: 15.99%

Key Action Performance (occlusion challenges):

  • Eating: 44.4mm (significantly better than other methods)
  • Sitting: 54.6mm
  • Smoking: 46.1mm
  • Superior to Modulated GCN in 14 out of 15 actions

Human3.6M - Protocol #2 (PA-MPJPE)

Overall Performance:

  • PoseKAN: 38.3mm (optimal)
  • GraphMLP: 38.4mm (relative error reduction 0.26%)
  • Modulated GCN: 39.1mm (relative error reduction 2.04%)
  • High-order GCN: 43.7mm (relative error reduction 12.35%)

Advantageous Actions:

  • Superior to GraphMLP in 11 out of 15 actions
  • Superior to Modulated GCN in 13 out of 15 actions
  • Particularly outstanding in heavily occluded scenarios (Greeting, Sitting, Smoking)

MPI-INF-3DHP (Cross-dataset Generalization)

Training on Human3.6M, testing on MPI-INF-3DHP:

  • PCK: 86.0% (highest)
  • AUC: 52.9% (second, only behind ICFNet's 54.3%)
  • PCK improvement over ICFNet: 0.5%

Using Ground Truth 2D Input

  • MPJPE: 33.51mm
  • Relative Error Reduction:
    • vs SemGCN: 19.62%
    • vs High-order GCN: 14.29%
    • vs GraphMLP: 2.01%
  • PA-MPJPE: 28.01mm (optimal)

Ablation Studies

1. Impact of Initial Residual Connection (IRC)

ConfigurationMPJPEPA-MPJPE
Without IRC34.44mm28.79mm
With IRC33.51mm28.01mm
Improvement1.65%1.49%

Conclusion: IRC stabilizes training by preserving initial features, preventing information loss

2. Spline Order Impact

  • Order 2: MPJPE=47.43mm, PA-MPJPE=38.86mm
  • Order 3: MPJPE=46.77mm, PA-MPJPE=38.36mm (optimal)
  • Order 4: MPJPE=47.10mm, PA-MPJPE=38.59mm

Conclusion: Order 3 achieves optimal balance; higher orders increase complexity without benefit

3. Grid Size Impact

  • Size 4: MPJPE=47.40mm, PA-MPJPE=38.91mm
  • Size 5: MPJPE=46.77mm, PA-MPJPE=38.36mm (optimal)
  • Size 6: MPJPE=47.98mm, PA-MPJPE=39.11mm

Conclusion: Grid size 5 provides sufficient function approximation capability

4. Scaling Factor s Impact

Test range: s ∈ {0.1, 0.2, 0.3, 0.5, 0.7, 0.9}

  • Optimal value: s=0.2
  • Smaller s emphasizes local information while moderately considering distant nodes
  • Both excessively large and small s values degrade performance

5. Embedding Dimension Impact

  • 224: MPJPE=47.38mm
  • 240: MPJPE=46.77mm (optimal)
  • 256: MPJPE=47.29mm

Conclusion: 240 dimensions provide sufficient expressiveness without overfitting

Case Analysis

Qualitative Visualization (Figure 2) demonstrates PoseKAN predictions across various action categories:

  • Predicted 3D poses highly align with ground truth
  • Superior performance in self-occlusion scenarios (e.g., crossed arms, sitting)
  • GraphMLP occasionally produces unnatural joint positions, while PoseKAN maintains skeleton structure consistency
  • Precise joint placement and natural limb configuration validate the model's ability to alleviate depth ambiguity

Experimental Findings

  1. Clear Advantages of Learnable Functions: Compared to fixed activation functions, learnable functions on edges provide stronger adaptability
  2. Multi-hop Aggregation is Critical: Significantly improves handling of occlusion and complex poses
  3. High Parameter Efficiency: PoseKAN has only 5.72M parameters, far fewer than GraphMLP's 9.49M
  4. Strong Cross-dataset Generalization: Performance on MPI-INF-3DHP demonstrates good generalization
  5. Sensitivity to High-frequency Details: Shows clear advantages in actions requiring rapid motion details (e.g., Greeting)

1. 3D Human Pose Estimation Methods Classification

Single-stage Methods

  • Directly regress 3D joint coordinates from images
  • Representatives: Integral Human Pose Regression, Compositional Human Pose Regression
  • Limitations: Susceptible to occlusion, lower accuracy

Two-stage Methods (2D-to-3D Lifting)

  • Stage 1: Detect 2D joint positions
  • Stage 2: Lift to 3D space
  • Representatives: SimpleBaseline, LCN
  • Advantages: Modular design, flexible 2D detector selection, stronger robustness
  • This paper belongs to this category

2. Graph-based 3D Pose Estimation

Standard GCN Methods

  • SemGCN: First application of GCN to 3D pose estimation
  • Limitations: One-hop neighbor aggregation, limited local receptive field

High-order GCN Extensions

  • High-order GCN: Extends to multi-hop neighborhoods
  • Modulated GCN: Adjacency matrix modulation, learns additional edges
  • GroupGCN: Group graph convolution
  • MM-GCN: Multi-hop modulated GCN, fuses multi-hop neighborhood information

Hybrid Architectures

  • GraphMLP: Combines MLP and GCN, leveraging global and local skeleton interactions
  • Limitations: Still uses fixed activation functions, suffers from spectral bias

3. Kolmogorov-Arnold Networks

  • Theoretical Foundation: Kolmogorov-Arnold representation theorem (any continuous multivariate function can be expressed as finite combinations of univariate functions)
  • KAN Networks: Replace fixed activations with learnable univariate functions, improving interpretability and adaptability
  • KAGNN: Recently extends KAN to graph learning (node/graph classification, link prediction)
  • This Paper's Innovation: First application of KAN to 2D-to-3D lifting for 3D pose estimation

4. Relative Advantages of This Paper

  • vs Standard GCN: Learnable functions vs fixed activation, multi-hop aggregation vs one-hop
  • vs High-order GCN: Adaptive function transformation vs fixed high-order convolution
  • vs GraphMLP: Alleviates spectral bias, stronger expressiveness
  • vs KAGNN: Specifically designed for pose estimation, introduces spectral modulation filter

Conclusions and Discussion

Main Conclusions

  1. Method Effectiveness: PoseKAN achieves or surpasses state-of-the-art methods on Human3.6M and MPI-INF-3DHP datasets
  2. Core Advantages:
    • Learnable functions provide stronger adaptability and expressiveness
    • Multi-hop feature aggregation effectively captures long-range dependencies
    • Alleviates spectral bias, simultaneously learning low and high-frequency components
  3. Practicality: High parameter efficiency (5.72M), manageable computational overhead, suitable for practical applications
  4. Generalization Capability: Excellent cross-dataset evaluation performance demonstrates good generalization

Limitations

Author-acknowledged Limitations

  1. Interpretability Challenges: While more interpretable than GCN, visualizing how each learnable activation adapts across different skeleton parts remains challenging
  2. Computational Cost: Learnable activations increase per-layer computational overhead; spline basis functions require additional memory
  3. Memory Consumption: Significant memory requirements for large-scale datasets and deep network training
  4. Optimization Space: Further improvements needed in computational efficiency, interpretability, and robustness

Potential Limitations

  1. Single-person Pose Limitation: Currently handles only single-person poses, not extended to multi-person scenarios
  2. 2D Detection Dependency: Performance depends on quality of 2D pose detector
  3. Static Graph Structure: While learning edge weights, topology is predefined
  4. Hyperparameter Sensitivity: Parameters like s and α require careful tuning

Future Directions

Author-proposed

  1. Multi-person Pose Estimation: Extension to multi-person scenarios handling interpersonal interactions
  2. Other Graph Learning Tasks: Action recognition, anomaly detection, etc.

Potential Extensions

  1. Temporal Modeling: Incorporating temporal information from video sequences
  2. End-to-end Learning: Joint optimization of 2D detection and 3D lifting
  3. Adaptive Graph Structure: Dynamically learn graph topology rather than predefined structure
  4. Lightweight Design: Model compression for mobile devices

In-depth Evaluation

Strengths

1. Method Novelty (★★★★★)

  • Theoretical Innovation: First extension of KAN to graph learning for 3D pose estimation, solid theoretical foundation
  • Technical Innovation: Ingenious spectral modulation filter design, effective multi-hop aggregation mechanism
  • Architecture Innovation: Reasonable combination of residual PoseKAN blocks and GRN

2. Experimental Sufficiency (★★★★☆)

  • Dataset Diversity: Human3.6M (indoor) + MPI-INF-3DHP (indoor/outdoor)
  • Comprehensive Comparison: Comparison with 10+ state-of-the-art methods
  • Detailed Ablation: IRC, spline order, grid size, scaling factor, embedding dimension, etc.
  • Qualitative Analysis: Visualization case comparisons provided

3. Result Convincingness (★★★★☆)

  • Leading Performance: Achieves SOTA or near-SOTA on multiple metrics
  • Good Consistency: Stable performance across datasets and protocols
  • Statistical Significance: Substantial relative error reduction (up to 19.62%)
  • Parameter Efficiency: 5.72M parameters superior to GraphMLP's 9.49M

4. Writing Clarity (★★★★★)

  • Clear Structure: Logical progression from motivation to method to experiments
  • Mathematical Rigor: Complete formula derivations, clear symbol definitions
  • Rich Figures: Architecture diagrams, comparison tables, ablation charts comprehensive
  • Supplementary Materials: Detailed appendix explanations provided

Weaknesses

1. Method Limitations

  • Computational Overhead: While claimed manageable, spline computation and function learning do increase complexity
  • Memory Requirements: O(2kGLF²) memory complexity may become bottleneck in large-scale applications
  • Single-person Limitation: Doesn't handle multi-person scenarios, limiting practical application scope

2. Experimental Setup

  • Hyperparameter Search: s=0.2 and α=0.03 determined via grid search, but search range and process not reported
  • Statistical Testing: Lacks significance tests (e.g., t-test)
  • Failure Cases: No demonstration of typical failure cases and analysis of failure reasons

3. Analysis Depth

  • Interpretability: Claims greater interpretability than GCN but lacks specific function visualization or analysis
  • Frequency Analysis: Mentions alleviating spectral bias but lacks quantitative spectral analysis evidence
  • Error Distribution: No analysis of error distribution patterns across different joints and actions

4. Comparison Fairness

  • Input Consistency: Uses same 2D detector, but doesn't report detector error impact on results
  • Implementation Details: Baseline methods may use different training strategies, affecting fair comparison

Impact Assessment

1. Contribution to Field (★★★★☆)

  • Theoretical Contribution: Introduces KAN to graph-based pose estimation, opens new direction
  • Method Contribution: Spectral modulation filter and multi-hop aggregation mechanisms transferable to other graph tasks
  • Empirical Contribution: Establishes new performance benchmarks on standard datasets

2. Practical Value (★★★☆☆)

  • Performance Improvement: 2-19% relative improvement meaningful for practical applications
  • Parameter Efficiency: 5.72M parameters moderate, deployable
  • Limitations: Single-person limitation and computational overhead restrict real-time applications
  • Code Release: GitHub link provided, facilitating reproduction and application

3. Reproducibility (★★★★☆)

  • Sufficient Details: Hyperparameters, training strategies, network configuration detailed
  • Open Source Code: Commits to open-source code
  • Standard Data: Uses public datasets and standard protocols
  • Potential Issues: KAN implementation details (spline computation) may have technical barriers

Applicable Scenarios

Suitable Applications

  1. High-precision Requirement Scenarios: Sports analysis, medical diagnosis requiring high accuracy
  2. Heavy Occlusion Scenarios: Multi-hop aggregation mechanism shows clear advantages under occlusion
  3. Complex Action Analysis: High-frequency detail capture capability suits rapid complex actions
  4. Offline Processing: Scenarios requiring high precision without real-time constraints

Less Suitable Scenarios

  1. Real-time Applications: Relatively high computational overhead, unsuitable for real-time processing
  2. Multi-person Scenarios: Current architecture doesn't consider multi-person interactions
  3. Resource-constrained Devices: Large memory requirements unsuitable for mobile deployment
  4. Large-scale Deployment: Training and inference costs may limit large-scale application

Extension Potential

  • Video Sequences: Extendable to temporal modeling
  • Other Graph Tasks: Action recognition, human mesh recovery, etc.
  • Multi-modal Fusion: Combining RGB, depth, IMU and other multi-source data
  • Transfer Learning: Pretrained model transfer to other pose estimation tasks

Key References

  1. Liu et al., 2025 - KAN: Kolmogorov-Arnold networks (ICLR 2025) - Original KAN proposal
  2. Zhao et al., 2019 - SemGCN - First GCN application to 3D pose estimation
  3. Zou & Tang, 2021 - Modulated GCN - Adjacency matrix modulation method
  4. Li et al., 2025 - GraphMLP - One of the strongest baselines
  5. Bresson et al., 2025 - KAGNNs - KAN application in graph learning
  6. Ionescu et al., 2013 - Human3.6M dataset - Standard benchmark dataset
  7. Martinez et al., 2017 - SimpleBaseline - Classical 2D-to-3D lifting method

Overall Assessment

  • Novelty: 9/10
  • Technical Quality: 8/10
  • Experimental Sufficiency: 8/10
  • Writing Quality: 9/10
  • Practical Value: 7/10
  • Comprehensive Score: 8.2/10

Recommendation: ★★★★☆ (Strongly recommended for reading, especially for researchers interested in graph neural networks and 3D vision)