2025-11-17T17:34:12.559157

Adaptive graph Kolmogorov-Arnold network for 3D human pose estimation

Shahjahan, Hamza

Graph convolutional network (GCN)-based methods have shown strong performance in 3D human pose estimation by leveraging the natural graph structure of the human skeleton. However, their local receptive field limits their ability to capture long-range dependencies essential for handling occlusions and depth ambiguities. They also exhibit spectral bias, which prioritizes low-frequency components while struggling to model high-frequency details. In this paper, we introduce PoseKAN, an adaptive graph Kolmogorov-Arnold Network (KAN), framework that extends KANs to graph-based learning for 2D-to-3D pose lifting from a single image. Unlike GCNs that use fixed activation functions, KANs employ learnable functions on graph edges, allowing data-driven, adaptive feature transformations. This enhances the model's adaptability and expressiveness, making it more expressive in learning complex pose variations. Our model employs multi-hop feature aggregation, ensuring the body joints can leverage information from both local and distant neighbors, leading to improved spatial awareness. It also incorporates residual PoseKAN blocks for deeper feature refinement, and a global response normalization for improved feature selectivity and contrast. Extensive experiments on benchmark datasets demonstrate the competitive performance of our model against state-of-the-art methods.

academic

Adaptive Graph Kolmogorov-Arnold Network for 3D Human Pose Estimation

Basic Information

Paper ID: 2511.08809
Title: Adaptive Graph Kolmogorov-Arnold Network for 3D Human Pose Estimation
Authors: Abu Taib Mohammed Shahjahan and A. Ben Hamza (Concordia University, Montreal, Canada)
Category: cs.CV (Computer Vision)
Submission Date: November 11, 2025 (arXiv)
Paper Link: https://arxiv.org/abs/2511.08809
Code Link: https://github.com/shahjahan0275/PoseKAN

Abstract

This paper proposes PoseKAN—an adaptive graph Kolmogorov-Arnold network framework for 3D human pose estimation. The method addresses three core limitations of traditional graph convolutional networks (GCNs): restricted local receptive fields, spectral bias, and insufficient expressiveness of fixed activation functions. PoseKAN employs learnable function transformations on graph edges to replace fixed activation functions, combined with multi-hop feature aggregation mechanisms, enabling effective modeling of both local and long-range joint dependencies. Experiments on the Human3.6M and MPI-INF-3DHP benchmark datasets demonstrate that the method achieves performance comparable to state-of-the-art approaches.

Research Background and Motivation

1. Core Problem

3D human pose estimation aims to infer 3D coordinates of body joints from 2D images or videos, which is crucial for understanding human motion. However, it remains highly challenging due to inherent depth ambiguity and occlusion in input data.

2. Problem Significance

Broad Applications: Human-computer interaction, action recognition, sports analysis, medical rehabilitation, etc.
Technical Challenges: Missing depth information in monocular images, self-occlusion, complex pose variations

3. Limitations of Existing Methods

Three Major Limitations of GCN Methods:

Restricted Local Receptive Field: Primarily relies on one-hop neighbor aggregation, making it difficult to capture long-range dependencies between distant joints
Spectral Bias Problem: Due to using MLPs as core components, tends to learn low-frequency components while struggling to capture high-frequency details (e.g., rapid motion, fine joint interactions)
Insufficient Expressiveness: Uses predefined fixed activation functions and trainable weight matrices, lacking dynamic adaptability and interpretability

4. Research Motivation

Inspired by the Kolmogorov-Arnold representation theorem, KAN networks replace fixed activation functions with learnable univariate functions, providing stronger function approximation capability and interpretability. This paper extends KAN to graph learning, specifically targeting the 2D-to-3D lifting task for 3D pose estimation.

Core Contributions

Proposes PoseKAN Framework: First extension of Kolmogorov-Arnold networks to graph-structured data for 3D human pose estimation, enhancing model adaptability and generalization through learnable function-based transformations
Designs Multi-hop Feature Propagation Mechanism: Introduces scaling parameter s to balance local and global feature aggregation, with propagation matrix P = (1-s)Â + sÂ² considering both one-hop and two-hop neighbors, improving robustness to occlusion and depth ambiguity
Innovative Architecture Design:
- Residual PoseKAN blocks for deep feature refinement
- Global Response Normalization (GRN) enhancing feature selectivity and contrast
- Integration of GELU nonlinearity for enhanced expressiveness
Comprehensive Experimental Validation: Detailed comparative experiments and ablation studies on Human3.6M and MPI-INF-3DHP datasets, demonstrating method effectiveness

Method Details

Task Definition

Given training set D = {(xᵢ, yᵢ)}ᴺᵢ₌₁, where:

Input: xᵢ ∈ ℝ² represents 2D joint positions (provided by off-the-shelf 2D pose detectors)
Output: yᵢ ∈ ℝ³ represents corresponding ground-truth 3D joint positions
Objective: Learn regression model parameters ω for fω: X → Y

Human skeleton represented as graph G = (V, E, X):

V = {1,...,J} represents J nodes (joints)
E ⊆ V × V represents edge set
X ∈ ℝᴶˣᶠ represents node feature matrix
A is adjacency matrix, Â = D⁻¹/²AD⁻¹/² is normalized adjacency matrix

Model Architecture

1. Kolmogorov-Arnold Network Foundation

The core of KAN layers is learnable activation functions, defined as:

ϕ(x) = wᵦb(x) + wₛspline(x)

where:

b(x) = SiLU(x) = x/(1+e⁻ˣ) is sigmoid linear unit
spline(x) = Σᵢ cᵢBᵢ(x) is weighted sum of B-spline basis functions
wᵦ, wₛ, cᵢ are learnable parameters

2. Spectral Modulation Filter

The proposed innovative spectral modulation filter:

hₛ(λ) = 1/((1+s)λ - sλ²)

where s ∈ (0,1) is scaling parameter controlling filter attenuation behavior for different frequency components. The filter exhibits adaptive low-pass characteristics.

Solved via fixed-point iteration: H⁽ᵗ⁺¹⁾ = ((1-s)I + sÂ)ÂH⁽ᵗ⁾ + X

3. PoseKAN Layer Update Rule

Core layer-wise update formula:

H⁽ˡ⁺¹⁾ = KAN⁽ˡ⁾(((1-s)Â + sÂ²)H⁽ˡ⁾ + X)

Decomposable into two operations:

Feature Propagation: G⁽ˡ⁾ = PH⁽ˡ⁾ + X

where P = (1-s)Â + sÂ² is propagation matrix balancing one-hop and two-hop neighbor information

Feature Embedding: H⁽ˡ⁺¹⁾ = KAN⁽ˡ⁾(G⁽ˡ⁾)

Each graph edge associates with a learnable univariate function

4. Overall Architecture

Initial PoseKAN Layer: Maps 2D input to latent space
4 Residual PoseKAN Blocks: Each block contains
- 5 PoseKAN layers for hierarchical feature learning
- Layer normalization for training stability
- Additional PoseKAN layer + GELU nonlinearity
- Residual connections preventing gradient vanishing
Global Response Normalization (GRN): Calibrates feature magnitude before prediction
Final PoseKAN Layer: Projects back to 3D pose space

5. Loss Function

Hybrid loss function (inspired by elastic net):

L = 1/N(1-α)Σᵢ||yᵢ - ŷᵢ||₂² + αΣᵢ||yᵢ - ŷᵢ||₁

where α ∈ 0,1 controls weight balance between MSE and MAE

Technical Innovation Points

1. Learnable Function Transformation vs Fixed Activation

GCN: Uses fixed activation functions (e.g., ReLU) and trainable weight matrices, essentially node-level linear mappings
PoseKAN: Employs learnable univariate functions on edges, providing data-driven adaptive feature transformations with stronger expressiveness

2. Multi-hop Dependency Modeling

Through propagation matrix P = (1-s)Â + sÂ²:

Explicitly combines one-hop and two-hop neighbor information
Parameter s adjusts balance between local vs global information
Avoids explicit computation of Â² (using right-to-left multiplication strategy)

3. Alleviating Spectral Bias

KAN's function-based transformations capture both low and high-frequency components:

Low-frequency: Smooth, gradual joint position changes (e.g., Walking, Eating)
High-frequency: Rapid, abrupt motions (e.g., sudden actions in Greeting)

4. Computational Complexity Analysis

Time Complexity: O(L||Â||₀F + LGF²)
- First term: feature propagation (depends on graph edge count)
- Second term: KAN transformation (G is grid size)
Space Complexity: O(LJF + 2kGLF²)
- 2k from recursive computation of k-order splines

Since k and G are typically small, additional overhead is manageable

Experimental Setup

Datasets

1. Human3.6M

Scale: 11 actors (6 male, 5 female), 15 indoor activities
Acquisition: 50Hz, 4 synchronized cameras
Annotation: Precise 3D joint coordinates via motion capture
Split:
- Training: 5 actors (S1, S5, S6, S7, S8)
- Testing: 2 actors (S9, S11)
Preprocessing: Normalization, zero-centered at hip joint

2. MPI-INF-3DHP

Scale: 8 actors (4 male, 4 female), 8 activity sequences
Acquisition: 14 different viewpoints, indoor/outdoor scenes
Characteristics: More diverse than Human3.6M, including basic to dynamic high-intensity actions

Evaluation Metrics

Human3.6M

Protocol #1: MPJPE (Mean Per-Joint Position Error) - average per-joint position error in millimeters
Protocol #2: PA-MPJPE (Procrustes-Aligned MPJPE) - error after Procrustes alignment

MPI-INF-3DHP

PCK (Percentage of Correct Keypoint): Percentage of correct keypoints
AUC (Area Under Curve): Area under the curve

Comparison Methods

GCN Series: SemGCN, High-order GCN, CompGCN, Modulated GCN, Group GCN, MM-GCN, Flex-GCN
Hybrid Methods: GraphMLP (combining MLP and GCN)
Others: HOIF-Net, PoseGraphNet, WSGN, etc.

Implementation Details

Hardware: Single NVIDIA RTX A4500 GPU (20GB)
Framework: PyTorch
Optimizer: AMSGrad
Training Epochs: 30
Learning Rate: Initial 0.001, decayed by 0.99 every 4 epochs
Batch Size: 64
Embedding Dimension: F = 240
Key Hyperparameters: s = 0.2, α = 0.03 (determined via grid search)
Regularization: Dropout = 0.2 after each PoseKAN layer
Spline Settings: Order = 3, Grid Size = 5

Experimental Results

Main Results

Human3.6M - Protocol #1 (MPJPE)

Overall Performance:

PoseKAN: 46.7mm (optimal)
GraphMLP: 48.0mm (second)
Modulated GCN: 49.4mm
Relative Error Reduction:
- vs GraphMLP: 2.7%
- vs Modulated GCN: 5.47%
- vs High-order GCN: 15.99%

Key Action Performance (occlusion challenges):

Eating: 44.4mm (significantly better than other methods)
Sitting: 54.6mm
Smoking: 46.1mm
Superior to Modulated GCN in 14 out of 15 actions

Human3.6M - Protocol #2 (PA-MPJPE)

Overall Performance:

PoseKAN: 38.3mm (optimal)
GraphMLP: 38.4mm (relative error reduction 0.26%)
Modulated GCN: 39.1mm (relative error reduction 2.04%)
High-order GCN: 43.7mm (relative error reduction 12.35%)

Advantageous Actions:

Superior to GraphMLP in 11 out of 15 actions
Superior to Modulated GCN in 13 out of 15 actions
Particularly outstanding in heavily occluded scenarios (Greeting, Sitting, Smoking)

MPI-INF-3DHP (Cross-dataset Generalization)

Training on Human3.6M, testing on MPI-INF-3DHP:

PCK: 86.0% (highest)
AUC: 52.9% (second, only behind ICFNet's 54.3%)
PCK improvement over ICFNet: 0.5%

Using Ground Truth 2D Input

MPJPE: 33.51mm
Relative Error Reduction:
- vs SemGCN: 19.62%
- vs High-order GCN: 14.29%
- vs GraphMLP: 2.01%
PA-MPJPE: 28.01mm (optimal)

Ablation Studies

1. Impact of Initial Residual Connection (IRC)

Configuration	MPJPE	PA-MPJPE
Without IRC	34.44mm	28.79mm
With IRC	33.51mm	28.01mm
Improvement	1.65%	1.49%

Conclusion: IRC stabilizes training by preserving initial features, preventing information loss

2. Spline Order Impact

Order 2: MPJPE=47.43mm, PA-MPJPE=38.86mm
Order 3: MPJPE=46.77mm, PA-MPJPE=38.36mm (optimal)
Order 4: MPJPE=47.10mm, PA-MPJPE=38.59mm

Conclusion: Order 3 achieves optimal balance; higher orders increase complexity without benefit

3. Grid Size Impact

Size 4: MPJPE=47.40mm, PA-MPJPE=38.91mm
Size 5: MPJPE=46.77mm, PA-MPJPE=38.36mm (optimal)
Size 6: MPJPE=47.98mm, PA-MPJPE=39.11mm

Conclusion: Grid size 5 provides sufficient function approximation capability

4. Scaling Factor s Impact

Test range: s ∈ {0.1, 0.2, 0.3, 0.5, 0.7, 0.9}

Optimal value: s=0.2
Smaller s emphasizes local information while moderately considering distant nodes
Both excessively large and small s values degrade performance

5. Embedding Dimension Impact

224: MPJPE=47.38mm
240: MPJPE=46.77mm (optimal)
256: MPJPE=47.29mm

Conclusion: 240 dimensions provide sufficient expressiveness without overfitting

Case Analysis

Qualitative Visualization (Figure 2) demonstrates PoseKAN predictions across various action categories:

Predicted 3D poses highly align with ground truth
Superior performance in self-occlusion scenarios (e.g., crossed arms, sitting)
GraphMLP occasionally produces unnatural joint positions, while PoseKAN maintains skeleton structure consistency
Precise joint placement and natural limb configuration validate the model's ability to alleviate depth ambiguity

Experimental Findings

Clear Advantages of Learnable Functions: Compared to fixed activation functions, learnable functions on edges provide stronger adaptability
Multi-hop Aggregation is Critical: Significantly improves handling of occlusion and complex poses
High Parameter Efficiency: PoseKAN has only 5.72M parameters, far fewer than GraphMLP's 9.49M
Strong Cross-dataset Generalization: Performance on MPI-INF-3DHP demonstrates good generalization
Sensitivity to High-frequency Details: Shows clear advantages in actions requiring rapid motion details (e.g., Greeting)

1. 3D Human Pose Estimation Methods Classification

Single-stage Methods

Directly regress 3D joint coordinates from images
Representatives: Integral Human Pose Regression, Compositional Human Pose Regression
Limitations: Susceptible to occlusion, lower accuracy

Two-stage Methods (2D-to-3D Lifting)

Stage 1: Detect 2D joint positions
Stage 2: Lift to 3D space
Representatives: SimpleBaseline, LCN
Advantages: Modular design, flexible 2D detector selection, stronger robustness
This paper belongs to this category

2. Graph-based 3D Pose Estimation

Standard GCN Methods

SemGCN: First application of GCN to 3D pose estimation
Limitations: One-hop neighbor aggregation, limited local receptive field

High-order GCN Extensions

High-order GCN: Extends to multi-hop neighborhoods
Modulated GCN: Adjacency matrix modulation, learns additional edges
GroupGCN: Group graph convolution
MM-GCN: Multi-hop modulated GCN, fuses multi-hop neighborhood information

Hybrid Architectures

GraphMLP: Combines MLP and GCN, leveraging global and local skeleton interactions
Limitations: Still uses fixed activation functions, suffers from spectral bias

3. Kolmogorov-Arnold Networks

Theoretical Foundation: Kolmogorov-Arnold representation theorem (any continuous multivariate function can be expressed as finite combinations of univariate functions)
KAN Networks: Replace fixed activations with learnable univariate functions, improving interpretability and adaptability
KAGNN: Recently extends KAN to graph learning (node/graph classification, link prediction)
This Paper's Innovation: First application of KAN to 2D-to-3D lifting for 3D pose estimation

4. Relative Advantages of This Paper

vs Standard GCN: Learnable functions vs fixed activation, multi-hop aggregation vs one-hop
vs High-order GCN: Adaptive function transformation vs fixed high-order convolution
vs GraphMLP: Alleviates spectral bias, stronger expressiveness
vs KAGNN: Specifically designed for pose estimation, introduces spectral modulation filter

Conclusions and Discussion

Main Conclusions

Method Effectiveness: PoseKAN achieves or surpasses state-of-the-art methods on Human3.6M and MPI-INF-3DHP datasets
Core Advantages:
- Learnable functions provide stronger adaptability and expressiveness
- Multi-hop feature aggregation effectively captures long-range dependencies
- Alleviates spectral bias, simultaneously learning low and high-frequency components
Practicality: High parameter efficiency (5.72M), manageable computational overhead, suitable for practical applications
Generalization Capability: Excellent cross-dataset evaluation performance demonstrates good generalization

Limitations

Author-acknowledged Limitations

Interpretability Challenges: While more interpretable than GCN, visualizing how each learnable activation adapts across different skeleton parts remains challenging
Computational Cost: Learnable activations increase per-layer computational overhead; spline basis functions require additional memory
Memory Consumption: Significant memory requirements for large-scale datasets and deep network training
Optimization Space: Further improvements needed in computational efficiency, interpretability, and robustness

Potential Limitations

Single-person Pose Limitation: Currently handles only single-person poses, not extended to multi-person scenarios
2D Detection Dependency: Performance depends on quality of 2D pose detector
Static Graph Structure: While learning edge weights, topology is predefined
Hyperparameter Sensitivity: Parameters like s and α require careful tuning

Future Directions

Author-proposed

Multi-person Pose Estimation: Extension to multi-person scenarios handling interpersonal interactions
Other Graph Learning Tasks: Action recognition, anomaly detection, etc.

Potential Extensions

Temporal Modeling: Incorporating temporal information from video sequences
End-to-end Learning: Joint optimization of 2D detection and 3D lifting
Adaptive Graph Structure: Dynamically learn graph topology rather than predefined structure
Lightweight Design: Model compression for mobile devices

Theoretical Innovation: First extension of KAN to graph learning for 3D pose estimation, solid theoretical foundation
Technical Innovation: Ingenious spectral modulation filter design, effective multi-hop aggregation mechanism
Architecture Innovation: Reasonable combination of residual PoseKAN blocks and GRN

2. Experimental Sufficiency (★★★★☆)

Dataset Diversity: Human3.6M (indoor) + MPI-INF-3DHP (indoor/outdoor)
Comprehensive Comparison: Comparison with 10+ state-of-the-art methods
Detailed Ablation: IRC, spline order, grid size, scaling factor, embedding dimension, etc.
Qualitative Analysis: Visualization case comparisons provided

3. Result Convincingness (★★★★☆)

Leading Performance: Achieves SOTA or near-SOTA on multiple metrics
Good Consistency: Stable performance across datasets and protocols
Statistical Significance: Substantial relative error reduction (up to 19.62%)
Parameter Efficiency: 5.72M parameters superior to GraphMLP's 9.49M

4. Writing Clarity (★★★★★)

Clear Structure: Logical progression from motivation to method to experiments
Mathematical Rigor: Complete formula derivations, clear symbol definitions
Rich Figures: Architecture diagrams, comparison tables, ablation charts comprehensive
Supplementary Materials: Detailed appendix explanations provided

Weaknesses

1. Method Limitations

Computational Overhead: While claimed manageable, spline computation and function learning do increase complexity
Memory Requirements: O(2kGLF²) memory complexity may become bottleneck in large-scale applications
Single-person Limitation: Doesn't handle multi-person scenarios, limiting practical application scope

2. Experimental Setup

Hyperparameter Search: s=0.2 and α=0.03 determined via grid search, but search range and process not reported
Statistical Testing: Lacks significance tests (e.g., t-test)
Failure Cases: No demonstration of typical failure cases and analysis of failure reasons

3. Analysis Depth

Interpretability: Claims greater interpretability than GCN but lacks specific function visualization or analysis
Frequency Analysis: Mentions alleviating spectral bias but lacks quantitative spectral analysis evidence
Error Distribution: No analysis of error distribution patterns across different joints and actions

4. Comparison Fairness

Input Consistency: Uses same 2D detector, but doesn't report detector error impact on results
Implementation Details: Baseline methods may use different training strategies, affecting fair comparison

Impact Assessment

1. Contribution to Field (★★★★☆)

Theoretical Contribution: Introduces KAN to graph-based pose estimation, opens new direction
Method Contribution: Spectral modulation filter and multi-hop aggregation mechanisms transferable to other graph tasks
Empirical Contribution: Establishes new performance benchmarks on standard datasets

2. Practical Value (★★★☆☆)

Performance Improvement: 2-19% relative improvement meaningful for practical applications
Parameter Efficiency: 5.72M parameters moderate, deployable
Limitations: Single-person limitation and computational overhead restrict real-time applications
Code Release: GitHub link provided, facilitating reproduction and application

3. Reproducibility (★★★★☆)

Sufficient Details: Hyperparameters, training strategies, network configuration detailed
Open Source Code: Commits to open-source code
Standard Data: Uses public datasets and standard protocols
Potential Issues: KAN implementation details (spline computation) may have technical barriers

Applicable Scenarios

Suitable Applications

High-precision Requirement Scenarios: Sports analysis, medical diagnosis requiring high accuracy
Heavy Occlusion Scenarios: Multi-hop aggregation mechanism shows clear advantages under occlusion
Complex Action Analysis: High-frequency detail capture capability suits rapid complex actions
Offline Processing: Scenarios requiring high precision without real-time constraints

Less Suitable Scenarios

Real-time Applications: Relatively high computational overhead, unsuitable for real-time processing
Multi-person Scenarios: Current architecture doesn't consider multi-person interactions
Resource-constrained Devices: Large memory requirements unsuitable for mobile deployment
Large-scale Deployment: Training and inference costs may limit large-scale application

Extension Potential

Video Sequences: Extendable to temporal modeling
Other Graph Tasks: Action recognition, human mesh recovery, etc.
Multi-modal Fusion: Combining RGB, depth, IMU and other multi-source data
Transfer Learning: Pretrained model transfer to other pose estimation tasks

Key References

Liu et al., 2025 - KAN: Kolmogorov-Arnold networks (ICLR 2025) - Original KAN proposal
Zhao et al., 2019 - SemGCN - First GCN application to 3D pose estimation
Zou & Tang, 2021 - Modulated GCN - Adjacency matrix modulation method
Li et al., 2025 - GraphMLP - One of the strongest baselines
Bresson et al., 2025 - KAGNNs - KAN application in graph learning
Ionescu et al., 2013 - Human3.6M dataset - Standard benchmark dataset
Martinez et al., 2017 - SimpleBaseline - Classical 2D-to-3D lifting method

Overall Assessment

Novelty: 9/10
Technical Quality: 8/10
Experimental Sufficiency: 8/10
Writing Quality: 9/10
Practical Value: 7/10
Comprehensive Score: 8.2/10

Recommendation: ★★★★☆ (Strongly recommended for reading, especially for researchers interested in graph neural networks and 3D vision)