2025-11-17T06:28:12.898097

On the Use of Hierarchical Vision Foundation Models for Low-Cost Human Mesh Recovery and Pose Estimation

Tarashima, Wang, Tagawa
In this work, we aim to develop simple and efficient models for human mesh recovery (HMR) and its predecessor task, human pose estimation (HPE). State-of-the-art HMR methods, such as HMR2.0 and its successors, rely on large, non-hierarchical vision transformers as encoders, which are inherited from the corresponding HPE models like ViTPose. To establish baselines across varying computational budgets, we first construct three lightweight HMR2.0 variants by adapting the corresponding ViTPose models. In addition, we propose leveraging the early stages of hierarchical vision foundation models (VFMs), including Swin Transformer, GroupMixFormer, and VMamba, as encoders. This design is motivated by the observation that intermediate stages of hierarchical VFMs produce feature maps with resolutions comparable to or higher than those of non-hierarchical counterparts. We conduct a comprehensive evaluation of 27 hierarchical-VFM-based HMR and HPE models, demonstrating that using only the first two or three stages achieves performance on par with full-stage models. Moreover, we show that the resulting truncated models exhibit better trade-offs between accuracy and computational efficiency compared to existing lightweight alternatives.
academic

On the Use of Hierarchical Vision Foundation Models for Low-Cost Human Mesh Recovery and Pose Estimation

Basic Information

  • Paper ID: 2510.12660
  • Title: On the Use of Hierarchical Vision Foundation Models for Low-Cost Human Mesh Recovery and Pose Estimation
  • Authors: Shuhei Tarashima (NTT DOCOMO Business & Tokyo Metropolitan University), Yushan Wang (Tokyo Metropolitan University), Norio Tagawa (Tokyo Metropolitan University)
  • Category: cs.CV
  • Publication Date: October 14, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.12660

Abstract

This research aims to develop simple and efficient models for human mesh recovery (HMR) and human pose estimation (HPE). Current state-of-the-art HMR methods (such as HMR2.0 and subsequent versions) rely on large non-hierarchical Vision Transformers as encoders, which are inherited from corresponding HPE models (such as ViTPose). To establish baselines across different computational budgets, the authors first constructed three lightweight HMR2.0 variants by adapting corresponding ViTPose models. Furthermore, the paper proposes leveraging early stages of hierarchical vision foundation models (VFMs) as encoders, including Swin Transformer, GroupMixFormer, and VMamba. This design is based on the observation that intermediate stages of hierarchical VFMs produce feature maps with comparable or higher resolution than non-hierarchical models. The authors conducted comprehensive evaluation of 27 HMR and HPE models based on hierarchical VFMs, demonstrating that using only the first two or three stages achieves comparable performance to full-stage models, with truncated models exhibiting better trade-offs between accuracy and computational efficiency.

Research Background and Motivation

Problem Definition

Human mesh recovery (HMR) is an important task in computer vision with broad applications in animation, virtual try-on, sports analysis, and human-computer interaction. The task aims to predict SMPL parameters from a single image to reconstruct a complete 3D human body model.

Limitations of Existing Methods

  1. High computational resource requirements: Current state-of-the-art methods such as HMR2.0 use large ViT-H as encoders, requiring substantial computational resources
  2. Deployment difficulties: Large models are challenging to deploy in real-time on mobile devices or edge computing environments
  3. Poor efficiency-performance trade-off: Existing lightweight methods often sacrifice significant performance for computational efficiency

Research Motivation

  1. Practical deployment needs: Urgent need for deploying HMR and HPE models in resource-constrained environments
  2. Architecture simplification: Improving efficiency while maintaining the simplicity of HMR2.0 architecture
  3. Potential of hierarchical VFMs: Exploring the application potential of hierarchical vision foundation models in this task

Core Contributions

  1. Construction of lightweight baselines: Instantiated three lightweight HMR2.0 variants by inheriting ViTPose-{L,B,S} encoders
  2. Proposed truncation strategy: Systematically explored the feasibility of using the first few stages of hierarchical VFMs as encoders
  3. Comprehensive experimental evaluation: Conducted comprehensive evaluation of 27 HMR and HPE models based on hierarchical VFMs
  4. Optimized performance-efficiency trade-off: Demonstrated that truncated hierarchical VFM models achieve better trade-offs between accuracy and computational efficiency

Methodology Details

Task Definition

  • HPE task: Predict 2D keypoint locations from input images (H×W, typically 256×192)
  • HMR task: Predict SMPL parameters (pose α, shape β, camera θ) from input images

Baseline Architecture

ViTPose Architecture

  • Encoder: ViT generates feature maps at H/16×W/16 resolution
  • Decoder: Deconvolutional layers + prediction layers output keypoint heatmaps

HMR2.0 Architecture

  • Encoder: ViT-based encoder produces feature maps
  • Decoder: Transformer-based decoder predicts SMPL parameters
  • Uses query token mechanism for feature aggregation

Hierarchical VFM Encoder Design

Design Principles

  1. Maintain architectural simplicity: Avoid complex or highly specialized modules
  2. Architectural consistency: Maintain consistency with HMR2.0 and ViTPose baselines

Resolution Matching Strategy

Hierarchical VFMs contain four stages with output resolutions of 2×2, 1×1, 1/2×1/2 relative to non-hierarchical VFMs:

  • Using all four stages (S4): Add 2×2 deconvolutional layers to align output resolution
  • Using first three stages (S3): Directly feed stage 3 output to decoder
  • Using first two stages (S2): Add stride=2 convolutional layers to downsample feature maps

Supported VFM Architectures

  1. Swin Transformer: Hierarchical Transformer based on shifted window attention
  2. GroupMixFormer (GMF): Efficient Transformer with group-mix attention
  3. VMamba (VM): Vision architecture based on state space models

Technical Innovations

  1. Truncation strategy: First systematic exploration of using only the first few stages of hierarchical VFMs
  2. Minimal modifications: Achieve resolution matching through simple convolutional/deconvolutional layers, maintaining architectural simplicity
  3. Multi-architecture validation: Verify method generality across different architecture types including Transformers and SSMs

Experimental Setup

Datasets

HPE:

  • Training: COCO dataset
  • Evaluation: COCO-val dataset

HMR:

  • Training: Mixed datasets (Human3.6M, MPI-INF-3DHP, COCO, MPII, InstaVariety, AVA, AI Challenger)
  • 2D pose evaluation: LSP-Extended, COCO-val, PoseTrack-val
  • 3D pose evaluation: 3DPW-test, Human3.6M-val

Evaluation Metrics

HPE:

  • Average Precision (AP) and Average Recall (AR)
  • Composite metric: ΦP,2D = 1/2(AP + AR)

HMR:

  • 2D: Percentage of Correct Keypoints (PCK) at 0.05 and 0.1 thresholds
  • 3D: MPJPE and PA-MPJPE error metrics
  • Composite metrics: ΦM,2D and ΦM,3D

Comparison Methods

  • Existing lightweight methods: METRO series, FastMETRO, TORE, etc.
  • ViT baselines: HMR2.0-{L,B,S}, ViTPose-{H,L,B,S}
  • CNN methods: MEMe, SimCC-HRNet, etc.

Implementation Details

  • Hardware: 8×A100 GPUs for training, single A100 GPU for inference testing
  • Initialization: Hierarchical VFM encoders initialized with ImageNet-1K pretrained weights
  • Training protocol: Follow standard training settings of HMR2.0 and ViTPose

Experimental Results

Main Results

Truncation Effect Verification

Experimental results show that truncated models using the first 2-3 stages achieve comparable or even better performance than full 4-stage models:

HPE Models (COCO dataset):

  • SwinPose-S-S3: AP=74.6 vs S4's 74.5 (+0.1)
  • GMFPose-T-S3: AP=75.7 vs S4's 75.8 (-0.1)
  • VMPose-T-S3: AP=75.3 vs S4's 75.2 (+0.1)

HMR Model Performance:

  • In 3D pose estimation, most S3 models slightly outperform S4 models
  • SwinHMR2.0-S-S3 maintains comparable performance to S4 while reducing parameters by 31.6%

Computational Efficiency Improvements

The truncation strategy significantly reduces computational complexity:

  • Parameter reduction: S3 models reduce parameters by 30-50% compared to S4 on average
  • FLOPs reduction: S2 models reduce computational load by 70-90% compared to S4
  • Inference acceleration: S2 models achieve 2-3× FPS improvement

Comparison with Existing Methods

Results on the Human3.6M dataset for 3D pose estimation show that the proposed hierarchical VFM models outperform existing lightweight methods at comparable computational budgets:

  • GMFHMR2.0-S-S3: 19.3M parameters, PA-MPJPE=35.4
  • Better efficiency-performance trade-off compared to ViT-based methods

Ablation Studies

Impact of Different Stage Numbers

Systematic evaluation of S2, S3, and S4 configurations:

  • S3 configuration: Optimal choice in most cases, balancing performance and efficiency
  • S2 configuration: Highest efficiency but with noticeable performance degradation in certain tasks
  • S4 configuration: Largest computational overhead with limited performance improvements

Comparison of Different VFM Architectures

  • Swin Transformer: Stable performance across most configurations
  • GroupMixFormer: Maintains good performance in S2 configuration
  • VMamba: Demonstrates good efficiency-performance trade-off

Case Analysis

Qualitative results show that truncated models achieve comparable visual quality to full models, accurately estimating human pose and shape, validating the method's effectiveness.

Human Mesh Recovery

  • Early CNN methods: Based on traditional CNN architectures like ResNet, HRNet
  • Transformer methods: METRO, Mesh Graphormer and other hybrid CNN-Transformer architectures
  • Pure Transformer: HMR2.0, SMPLer-X and other fully Transformer-based methods

Human Pose Estimation

  • CNN optimization: MEMe, Lite-HRNet, LitePose and other lightweight CNN methods
  • Architecture search: CNF, ViPNAS and other neural architecture search methods
  • Transformer applications: ViTPose and other ViT-based methods

Vision Foundation Models

  • Non-hierarchical: ViT, DeiT and other models maintaining fixed resolution
  • Hierarchical: Swin Transformer, PVT and other multi-scale feature extraction models

Conclusions and Discussion

Main Conclusions

  1. Truncation strategy is effective: The first 2-3 stages of hierarchical VFMs contain sufficient semantic information for HMR and HPE tasks
  2. Significant efficiency improvements: Truncated models substantially reduce computational overhead while maintaining performance
  3. Good generality: The strategy demonstrates consistent effectiveness across different VFM architectures

Limitations

  1. Architectural constraints: Primarily applicable to hierarchical VFMs, not suitable for non-hierarchical models
  2. Task specificity: Mainly validated on HMR and HPE tasks; applicability to other vision tasks remains to be explored
  3. Pretraining dependency: Effectiveness depends on high-quality pretrained weights

Future Directions

  1. Extension to more VFMs: Explore additional hierarchical vision foundation models
  2. Full-body and multi-person scenarios: Validate effectiveness in more complex HMR tasks
  3. Architecture optimization: Further optimize the architecture design after truncation

In-Depth Evaluation

Strengths

  1. High practical value: Addresses efficiency issues in practical deployment with important application value
  2. Simple methodology: Maintains the simplicity of the original architecture, facilitating implementation and deployment
  3. Comprehensive experiments: Comprehensive evaluation of 27 models provides sufficient experimental evidence
  4. Deep insights: Reveals the richness of intermediate representations in hierarchical VFMs

Weaknesses

  1. Insufficient theoretical analysis: Lacks in-depth theoretical analysis of why the first few stages are sufficient
  2. Limited novelty: Primarily engineering optimization with relatively limited algorithmic innovation
  3. Limited evaluation scope: Mainly evaluated on standard datasets; robustness in real-world application scenarios remains to be verified

Impact

  1. Academic contribution: Provides new perspectives for designing efficient HMR/HPE models
  2. Practical value: Significant implications for mobile and edge computing deployment
  3. Reproducibility: Simple methodology facilitates reproduction and application

Applicable Scenarios

  1. Resource-constrained environments: Mobile devices, edge computing devices
  2. Real-time applications: Interactive applications requiring rapid response
  3. Large-scale deployment: Scenarios requiring simultaneous execution across multiple devices

References

The paper cites 118 relevant references covering important works in HMR, HPE, and vision foundation models, providing comprehensive background support for the research.


Overall Assessment: This is a highly practical engineering optimization paper that significantly improves the efficiency of HMR and HPE models through a simple yet effective truncation strategy. While algorithmic novelty is limited, it addresses important practical deployment challenges and possesses high application value. The experimental design is comprehensive, conclusions are reliable, and it provides valuable references for practical applications in related fields.