2025-11-17T06:28:12.898097

On the Use of Hierarchical Vision Foundation Models for Low-Cost Human Mesh Recovery and Pose Estimation

Tarashima, Wang, Tagawa

In this work, we aim to develop simple and efficient models for human mesh recovery (HMR) and its predecessor task, human pose estimation (HPE). State-of-the-art HMR methods, such as HMR2.0 and its successors, rely on large, non-hierarchical vision transformers as encoders, which are inherited from the corresponding HPE models like ViTPose. To establish baselines across varying computational budgets, we first construct three lightweight HMR2.0 variants by adapting the corresponding ViTPose models. In addition, we propose leveraging the early stages of hierarchical vision foundation models (VFMs), including Swin Transformer, GroupMixFormer, and VMamba, as encoders. This design is motivated by the observation that intermediate stages of hierarchical VFMs produce feature maps with resolutions comparable to or higher than those of non-hierarchical counterparts. We conduct a comprehensive evaluation of 27 hierarchical-VFM-based HMR and HPE models, demonstrating that using only the first two or three stages achieves performance on par with full-stage models. Moreover, we show that the resulting truncated models exhibit better trade-offs between accuracy and computational efficiency compared to existing lightweight alternatives.

academic

On the Use of Hierarchical Vision Foundation Models for Low-Cost Human Mesh Recovery and Pose Estimation

Basic Information

Paper ID: 2510.12660
Title: On the Use of Hierarchical Vision Foundation Models for Low-Cost Human Mesh Recovery and Pose Estimation
Authors: Shuhei Tarashima (NTT DOCOMO Business & Tokyo Metropolitan University), Yushan Wang (Tokyo Metropolitan University), Norio Tagawa (Tokyo Metropolitan University)
Category: cs.CV
Publication Date: October 14, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.12660

Abstract

This research aims to develop simple and efficient models for human mesh recovery (HMR) and human pose estimation (HPE). Current state-of-the-art HMR methods (such as HMR2.0 and subsequent versions) rely on large non-hierarchical Vision Transformers as encoders, which are inherited from corresponding HPE models (such as ViTPose). To establish baselines across different computational budgets, the authors first constructed three lightweight HMR2.0 variants by adapting corresponding ViTPose models. Furthermore, the paper proposes leveraging early stages of hierarchical vision foundation models (VFMs) as encoders, including Swin Transformer, GroupMixFormer, and VMamba. This design is based on the observation that intermediate stages of hierarchical VFMs produce feature maps with comparable or higher resolution than non-hierarchical models. The authors conducted comprehensive evaluation of 27 HMR and HPE models based on hierarchical VFMs, demonstrating that using only the first two or three stages achieves comparable performance to full-stage models, with truncated models exhibiting better trade-offs between accuracy and computational efficiency.

Research Background and Motivation

Problem Definition

Human mesh recovery (HMR) is an important task in computer vision with broad applications in animation, virtual try-on, sports analysis, and human-computer interaction. The task aims to predict SMPL parameters from a single image to reconstruct a complete 3D human body model.

Limitations of Existing Methods

High computational resource requirements: Current state-of-the-art methods such as HMR2.0 use large ViT-H as encoders, requiring substantial computational resources
Deployment difficulties: Large models are challenging to deploy in real-time on mobile devices or edge computing environments
Poor efficiency-performance trade-off: Existing lightweight methods often sacrifice significant performance for computational efficiency

Research Motivation

Practical deployment needs: Urgent need for deploying HMR and HPE models in resource-constrained environments
Architecture simplification: Improving efficiency while maintaining the simplicity of HMR2.0 architecture
Potential of hierarchical VFMs: Exploring the application potential of hierarchical vision foundation models in this task

Core Contributions

Construction of lightweight baselines: Instantiated three lightweight HMR2.0 variants by inheriting ViTPose-{L,B,S} encoders
Proposed truncation strategy: Systematically explored the feasibility of using the first few stages of hierarchical VFMs as encoders
Comprehensive experimental evaluation: Conducted comprehensive evaluation of 27 HMR and HPE models based on hierarchical VFMs
Optimized performance-efficiency trade-off: Demonstrated that truncated hierarchical VFM models achieve better trade-offs between accuracy and computational efficiency

Methodology Details

Task Definition

HPE task: Predict 2D keypoint locations from input images (H×W, typically 256×192)
HMR task: Predict SMPL parameters (pose α, shape β, camera θ) from input images

Baseline Architecture

ViTPose Architecture

Encoder: ViT generates feature maps at H/16×W/16 resolution
Decoder: Deconvolutional layers + prediction layers output keypoint heatmaps

HMR2.0 Architecture

Encoder: ViT-based encoder produces feature maps
Decoder: Transformer-based decoder predicts SMPL parameters
Uses query token mechanism for feature aggregation

Hierarchical VFM Encoder Design

Design Principles

Maintain architectural simplicity: Avoid complex or highly specialized modules
Architectural consistency: Maintain consistency with HMR2.0 and ViTPose baselines

Resolution Matching Strategy

Hierarchical VFMs contain four stages with output resolutions of 2×2, 1×1, 1/2×1/2 relative to non-hierarchical VFMs:

Using all four stages (S4): Add 2×2 deconvolutional layers to align output resolution
Using first three stages (S3): Directly feed stage 3 output to decoder
Using first two stages (S2): Add stride=2 convolutional layers to downsample feature maps

Supported VFM Architectures

Swin Transformer: Hierarchical Transformer based on shifted window attention
GroupMixFormer (GMF): Efficient Transformer with group-mix attention
VMamba (VM): Vision architecture based on state space models

Technical Innovations

Truncation strategy: First systematic exploration of using only the first few stages of hierarchical VFMs
Minimal modifications: Achieve resolution matching through simple convolutional/deconvolutional layers, maintaining architectural simplicity
Multi-architecture validation: Verify method generality across different architecture types including Transformers and SSMs

Experimental Setup

Datasets

HPE:

Training: COCO dataset
Evaluation: COCO-val dataset

HMR:

Training: Mixed datasets (Human3.6M, MPI-INF-3DHP, COCO, MPII, InstaVariety, AVA, AI Challenger)
2D pose evaluation: LSP-Extended, COCO-val, PoseTrack-val
3D pose evaluation: 3DPW-test, Human3.6M-val

Evaluation Metrics

HPE:

Average Precision (AP) and Average Recall (AR)
Composite metric: ΦP,2D = 1/2(AP + AR)

HMR:

2D: Percentage of Correct Keypoints (PCK) at 0.05 and 0.1 thresholds
3D: MPJPE and PA-MPJPE error metrics
Composite metrics: ΦM,2D and ΦM,3D

Comparison Methods

Existing lightweight methods: METRO series, FastMETRO, TORE, etc.
ViT baselines: HMR2.0-{L,B,S}, ViTPose-{H,L,B,S}
CNN methods: MEMe, SimCC-HRNet, etc.

Implementation Details

Hardware: 8×A100 GPUs for training, single A100 GPU for inference testing
Initialization: Hierarchical VFM encoders initialized with ImageNet-1K pretrained weights
Training protocol: Follow standard training settings of HMR2.0 and ViTPose

Experimental Results

Main Results

Truncation Effect Verification

Experimental results show that truncated models using the first 2-3 stages achieve comparable or even better performance than full 4-stage models:

HPE Models (COCO dataset):

SwinPose-S-S3: AP=74.6 vs S4's 74.5 (+0.1)
GMFPose-T-S3: AP=75.7 vs S4's 75.8 (-0.1)
VMPose-T-S3: AP=75.3 vs S4's 75.2 (+0.1)

HMR Model Performance:

In 3D pose estimation, most S3 models slightly outperform S4 models
SwinHMR2.0-S-S3 maintains comparable performance to S4 while reducing parameters by 31.6%

Computational Efficiency Improvements

The truncation strategy significantly reduces computational complexity:

Parameter reduction: S3 models reduce parameters by 30-50% compared to S4 on average
FLOPs reduction: S2 models reduce computational load by 70-90% compared to S4
Inference acceleration: S2 models achieve 2-3× FPS improvement

Comparison with Existing Methods

Results on the Human3.6M dataset for 3D pose estimation show that the proposed hierarchical VFM models outperform existing lightweight methods at comparable computational budgets:

GMFHMR2.0-S-S3: 19.3M parameters, PA-MPJPE=35.4
Better efficiency-performance trade-off compared to ViT-based methods

Ablation Studies

Impact of Different Stage Numbers

Systematic evaluation of S2, S3, and S4 configurations:

S3 configuration: Optimal choice in most cases, balancing performance and efficiency
S2 configuration: Highest efficiency but with noticeable performance degradation in certain tasks
S4 configuration: Largest computational overhead with limited performance improvements

Comparison of Different VFM Architectures

Swin Transformer: Stable performance across most configurations
GroupMixFormer: Maintains good performance in S2 configuration
VMamba: Demonstrates good efficiency-performance trade-off

Case Analysis

Qualitative results show that truncated models achieve comparable visual quality to full models, accurately estimating human pose and shape, validating the method's effectiveness.

Human Mesh Recovery

Early CNN methods: Based on traditional CNN architectures like ResNet, HRNet
Transformer methods: METRO, Mesh Graphormer and other hybrid CNN-Transformer architectures
Pure Transformer: HMR2.0, SMPLer-X and other fully Transformer-based methods

Human Pose Estimation

CNN optimization: MEMe, Lite-HRNet, LitePose and other lightweight CNN methods
Architecture search: CNF, ViPNAS and other neural architecture search methods
Transformer applications: ViTPose and other ViT-based methods

Vision Foundation Models

Non-hierarchical: ViT, DeiT and other models maintaining fixed resolution
Hierarchical: Swin Transformer, PVT and other multi-scale feature extraction models

Conclusions and Discussion

Main Conclusions

Truncation strategy is effective: The first 2-3 stages of hierarchical VFMs contain sufficient semantic information for HMR and HPE tasks
Significant efficiency improvements: Truncated models substantially reduce computational overhead while maintaining performance
Good generality: The strategy demonstrates consistent effectiveness across different VFM architectures

Limitations

Architectural constraints: Primarily applicable to hierarchical VFMs, not suitable for non-hierarchical models
Task specificity: Mainly validated on HMR and HPE tasks; applicability to other vision tasks remains to be explored
Pretraining dependency: Effectiveness depends on high-quality pretrained weights

Future Directions

Extension to more VFMs: Explore additional hierarchical vision foundation models
Full-body and multi-person scenarios: Validate effectiveness in more complex HMR tasks
Architecture optimization: Further optimize the architecture design after truncation

In-Depth Evaluation

Strengths

High practical value: Addresses efficiency issues in practical deployment with important application value
Simple methodology: Maintains the simplicity of the original architecture, facilitating implementation and deployment
Comprehensive experiments: Comprehensive evaluation of 27 models provides sufficient experimental evidence
Deep insights: Reveals the richness of intermediate representations in hierarchical VFMs

Weaknesses

Insufficient theoretical analysis: Lacks in-depth theoretical analysis of why the first few stages are sufficient
Limited novelty: Primarily engineering optimization with relatively limited algorithmic innovation
Limited evaluation scope: Mainly evaluated on standard datasets; robustness in real-world application scenarios remains to be verified

Impact

Academic contribution: Provides new perspectives for designing efficient HMR/HPE models
Practical value: Significant implications for mobile and edge computing deployment
Reproducibility: Simple methodology facilitates reproduction and application

Applicable Scenarios

Resource-constrained environments: Mobile devices, edge computing devices
Real-time applications: Interactive applications requiring rapid response
Large-scale deployment: Scenarios requiring simultaneous execution across multiple devices

References

The paper cites 118 relevant references covering important works in HMR, HPE, and vision foundation models, providing comprehensive background support for the research.

Overall Assessment: This is a highly practical engineering optimization paper that significantly improves the efficiency of HMR and HPE models through a simple yet effective truncation strategy. While algorithmic novelty is limited, it addresses important practical deployment challenges and possesses high application value. The experimental design is comprehensive, conclusions are reliable, and it provides valuable references for practical applications in related fields.