2025-11-21T00:49:15.710789

Learning Optimal Prompt Ensemble for Multi-source Visual Prompt Transfer

Zhang, Cao, Wu et al.

Prompt tuning has emerged as a lightweight strategy for adapting foundation models to downstream tasks, particularly for resource-constrained systems. As pre-trained prompts become valuable assets, combining multiple source prompts offers a promising approach to enhance generalization for new tasks by leveraging complementary knowledge. However, naive aggregation often overlooks different source prompts have different contribution potential to the target task. To address this, we propose HGPrompt, a dynamic framework that learns optimal ensemble weights. These weights are optimized by jointly maximizing an information-theoretic metric for transferability and minimizing gradient conflicts via a novel regularization strategy. Specifically, we propose a differentiable prompt transferability metric to captures the discriminability of prompt-induced features on the target task. Meanwhile, HGPrompt match the gradient variances with respect to different source prompts based on Hessian and Fisher Information, ensuring stable and coherent knowledge transfer while suppressing gradient conflicts among them. Extensive experiments on the large-scale VTAB benchmark demonstrate the state-of-the-art performance of HGPrompt, validating its effectiveness in learning an optimal ensemble for effective multi-source prompt transfer.

academic

Learning Optimal Prompt Ensemble for Multi-source Visual Prompt Transfer

Basic Information

Paper ID: 2504.12311
Title: Learning Optimal Prompt Ensemble for Multi-source Visual Prompt Transfer
Authors: Enming Zhang, Liwen Cao, Yanru Wu, Zijie Zhao, Yang Li (Tsinghua University Shenzhen International Graduate School, Southeast University)
Classification: cs.CL (Computational Linguistics)
Publication Date/Venue: arXiv Preprint (Latest version October 15, 2025)
Paper Link: https://arxiv.org/abs/2504.12311v5

Abstract

This paper proposes the HGPrompt framework for multi-source visual prompt transfer tasks. The method learns optimal ensemble weights through joint optimization of information-theoretic transferability metrics and gradient conflict minimization regularization. Specifically, a differentiable prompt transferability metric is proposed to capture the discriminability of prompt-induced features on target tasks, while matching gradient variance across different source prompts based on Hessian and Fisher information to ensure stable and consistent knowledge transfer while suppressing gradient conflicts. Experiments on the large-scale VTAB benchmark validate the effectiveness of HGPrompt.

Research Background and Motivation

Problem Definition

With the development of visual foundation models, prompt tuning has become a lightweight strategy for adapting to downstream tasks. The core challenge faced by existing methods is: how to effectively aggregate multiple source prompts to enhance generalization capability on new tasks.

Research Motivation

Resource Efficiency Requirements: Full model fine-tuning becomes impractical on large-scale pre-trained models, while prompt tuning achieves competitive performance by updating only 0.4% of parameters
Prompt Asset Value: Pre-trained prompts have become valuable knowledge assets, and combinations of multi-source prompts can leverage complementary knowledge
Limitations of Existing Methods: Simple concatenation or averaging aggregation ignores the varying contributions of different source prompts to target tasks, potentially leading to representation collapse

Core Challenges

Traditional methods evaluate each prompt's transferability in isolation, ignoring inter-prompt dependencies
Lack of theoretically grounded heuristic methods (e.g., parameter similarity computation)
Gradient interference introduced by multi-prompt aggregation leads to optimization instability

Core Contributions

Proposes HGPrompt Framework: The first theoretically reliable framework for dynamically learning optimal prompt weights by evaluating the transferability of aggregated prompt-induced features
Information-Theoretic Transferability Metric: A differentiable prompt transferability metric based on H-score, providing explicit and interpretable contribution quantification
Gradient Alignment Regularization: An innovative gradient variance matching objective that addresses gradient conflicts among multi-source prompts
SOTA Performance: Achieves state-of-the-art performance on the VTAB benchmark with average accuracy of 60.3%

Method Details

Task Definition

Given κ source tasks S = {Sᵢ}ᵏᵢ₌₁ and their corresponding optimized prompts {Pᵢ}ᵏᵢ₌₁, the goal is to construct a target prompt P_T for new task T through optimal combination of source prompts. Let M ≤ κ be the number of selected source prompts, with weights α = (α₁,...,αₘ) satisfying ∑ᵢαᵢ = 1 and αᵢ ≥ 0.

Model Architecture

1. Visual Prompt Tuning Foundation

For pre-trained Transformers, m learnable prompt tokens P = p₁,...,pₘ ∈ ℝᵐˣᵈ are introduced. Given patch embeddings E(X) ∈ ℝⁿˣᵈ of input image X, the combined input sequence is P;E(X) ∈ ℝ⁽ᵐ⁺ⁿ⁾ˣᵈ.

The prediction probability is:

Pr_θ(Y|X;P) = exp(f_Y([P;E(X)];θ)) / ∑ᶜᵢ₌₁exp(fᵢ([P;E(X)];θ))

2. H-score Transferability Metric

Definition 1: Given input data x, labels y, and feature extractor f(x), the one-sided H-score is defined as:

H(f) = tr(cov(f(X))⁻¹cov(E_{P_{X|Y}}[f(X)|Y]))

This metric has an intuitive interpretation: high H-score indicates greater inter-class discriminability cov(Ef(X)|Y) and minimal feature redundancy tr(cov(f(X))).

Definition 2: Optimal feature weights are determined by maximizing the H-score of weighted feature sum:

α* = argmax_α H(∑ⱼαⱼ·f_{Pⱼ}) s.t. ∑ⱼαⱼ = 1

Theorem 1: H-score is a convex quadratic form in weights α, guaranteeing reliable solution of the optimization problem.

3. Gradient Alignment Regularization

To address gradient interference in multi-prompt aggregation, a gradient variance matching objective is proposed:

Computing gradients for each source prompt Pᵢ:

gᵢ = ∇_{Pᵢ} L(f_θ([x₀;Pᵢ;E(X)]), y)

Gradient variance:

vᵢ = Var(G) = 1/(N-1) ∑ⱼ(gⁱⱼ - gᵅⱼ)²

Regularization term:

L_{align}(α) = 1/M ∑ᵢ||vᵢ - v̄(α)||²₂

Total objective function:

L(α) = -H(α) + λL_{align}(α)

Technical Innovations

Ensemble Evaluation vs. Isolated Evaluation: Unlike traditional methods that independently evaluate each prompt, this work evaluates the overall transferability of aggregated prompts
Theoretical Foundation: The information-theoretic H-score provides rigorous mathematical basis, replacing heuristic methods
Gradient Conflict Resolution: By leveraging theoretical insights from Hessian and Fisher information, gradient variance matching is designed to reduce optimization inconsistency

Experimental Setup

Datasets

The VTAB-1k benchmark with 13 datasets covering three task categories:

Natural: Images from standard cameras (e.g., CIFAR100, Flowers102, Pets)
Specialized: Data from specialized equipment (e.g., EuroSAT satellite imagery)
Structured: Tasks requiring spatial reasoning (e.g., CLEVR counting)

Evaluation Metrics

Classification accuracy is used as the primary evaluation metric, reporting average results from three independent runs.

Baseline Methods

Includes 11 baseline methods:

Classification Head Retraining: PARTIAL-k, MLP-k
Parameter Subset Updates: Adapter, SIDETUNE, BIAS
Prompt Transfer: Average, Single-Best, VPT, SPoT, ATTEMPT, PANDA

Implementation Details

Backbone Network: ViT-B/16 (ImageNet-21k pre-trained)
Number of Prompt Tokens: 50
Source Task Training: 10 epochs
Computing Device: NVIDIA A800-80GB GPU
Sample Size: 2000 samples per source task for transferability and gradient alignment loss computation

Experimental Results

Main Results

HGPrompt achieves SOTA performance on 13 visual tasks:

Method	CIFAR100	DTD	Flowers102	Pets	SVHN	EuroSAT	Average
PANDA	74.1	61.3	96.5	86.2	71.2	90.8	58.7
HGPrompt	75.9	64.2	98.1	87.4	71.0	92.6	60.3

Average accuracy of 60.3%, surpassing all baseline methods
Outstanding performance on fine-grained recognition tasks (Flowers102, Pets)
Establishes new benchmarks on geometric reasoning tasks (sNORB-Azimuth, dSprite-Orientation)

Ablation Study

Component contribution analysis:

H(α)	L_	CIFAR	DTD	Pets	EuroSAT	Average
×	×	60.4	57.8	82.7	89.1	72.5
✓	×	74.6	62.3	85.9	91.2	78.5
×	✓	74.1	61.9	85.5	90.8	78.1
✓	✓	75.9	64.2	87.4	92.6	80.0

Results demonstrate that the two components have complementary effects, achieving optimal performance when used jointly.

Weight Analysis

Spearman rank correlation coefficient validates weight quality:

Method	CIFAR	C-dist	d-Loc	DML	SVHN	Average
SPoT	0.552	0.175	-0.168	0.112	-0.147	0.105
PANDA	0.916	0.441	0.552	0.713	0.224	0.569
HGPrompt	0.944	0.664	0.853	0.727	0.853	0.808

Weights learned by HGPrompt show the highest correlation with zero-shot transfer accuracy, more accurately reflecting semantic affinity between tasks.

Scalability Analysis

As the number of source prompts increases from 3 to 11, HGPrompt demonstrates stronger performance advantages over PANDA and SPoT, validating the method's effectiveness on large-scale prompt collections.

Feature Visualization

t-SNE visualization shows that features generated by HGPrompt exhibit better class discriminability, with same-class objects forming tight clusters with clear boundaries.

Parameter-Efficient Transfer Learning

NLP Domain: Adapter, BitFit, LoRA and other methods fine-tune 1-5% of parameters
Vision Domain: VPT introduces learnable tokens, VP performs pixel-level perturbations

Transferability Estimation

Information-Theoretic Methods: H-score, LEEP, LogME evaluate feature discriminability
Optimal Transport: OTCE measures domain-task discrepancy

Multi-source Prompt Tuning

Single-Task Transfer: SPoT uses metrics to predict best source task, Su et al. emphasize the role of neural activation
Multi-Task Setting: ATTEMPT uses attention mechanisms to aggregate knowledge, PANDA addresses forgetting through knowledge distillation

Conclusions and Discussion

Main Conclusions

HGPrompt achieves optimal prompt ensemble through joint optimization of H-score and gradient alignment
Information-theoretic metrics more effectively quantify prompt transferability than heuristic methods
Gradient variance matching successfully addresses interference from multi-source prompts

Limitations

Architecture Specificity: Current work focuses on Transformer architectures, with limited applicability to other architectures
Modality Constraints: Primarily targets vision tasks; multi-modal learning requires new prompt design methods
Computational Overhead: Requires computing features and gradients for multiple source prompts

Future Directions

Extend to architecture-agnostic universal prompt interfaces
Explore prompt design in multi-modal learning
Investigate more efficient transferability assessment methods

In-Depth Evaluation

Strengths

Theoretical Innovation: Information-theoretic transferability metric provides rigorous mathematical foundation
Advanced Technique: Gradient alignment regularization cleverly addresses multi-source interference
Comprehensive Experiments: Thorough evaluation on large-scale benchmarks validates method effectiveness
Strong Interpretability: Weight learning process has clear theoretical explanation

Weaknesses

Limited Theoretical Depth: While providing convexity proof, analysis of convergence and optimality is insufficient
Hyperparameter Sensitivity: λ parameter selection significantly impacts performance, lacking adaptive mechanisms
Computational Complexity: Computational complexity and scalability analysis are not detailed

Impact

Academic Contribution: Provides new theoretical framework and practical methods for multi-source prompt transfer
Practical Value: Important applications in resource-constrained scenarios
Reproducibility: Authors commit to releasing source code, facilitating method dissemination

Applicable Scenarios

Resource-Constrained Environments: Mobile devices, edge computing scenarios
Rapid Adaptation Requirements: Applications requiring quick adaptation to new tasks
Multi-Task Learning: Scenarios requiring leveraging knowledge from multiple related tasks

References

The paper cites abundant related work, including:

Parameter-Efficient Learning: Houlsby et al. (2019), Hu et al. (2021)
Transferability Assessment: Bao et al. (2019), You et al. (2021)
Multi-Task Learning: Yu et al. (2020), Rame et al. (2022)
Vision Transformers: Dosovitskiy (2020), Jia et al. (2022)

This paper makes important contributions to the multi-source visual prompt transfer domain, addressing key challenges of existing methods through theoretical innovation and technical breakthroughs, providing new research directions for parameter-efficient transfer learning.