2025-11-21T00:49:15.710789

Learning Optimal Prompt Ensemble for Multi-source Visual Prompt Transfer

Zhang, Cao, Wu et al.
Prompt tuning has emerged as a lightweight strategy for adapting foundation models to downstream tasks, particularly for resource-constrained systems. As pre-trained prompts become valuable assets, combining multiple source prompts offers a promising approach to enhance generalization for new tasks by leveraging complementary knowledge. However, naive aggregation often overlooks different source prompts have different contribution potential to the target task. To address this, we propose HGPrompt, a dynamic framework that learns optimal ensemble weights. These weights are optimized by jointly maximizing an information-theoretic metric for transferability and minimizing gradient conflicts via a novel regularization strategy. Specifically, we propose a differentiable prompt transferability metric to captures the discriminability of prompt-induced features on the target task. Meanwhile, HGPrompt match the gradient variances with respect to different source prompts based on Hessian and Fisher Information, ensuring stable and coherent knowledge transfer while suppressing gradient conflicts among them. Extensive experiments on the large-scale VTAB benchmark demonstrate the state-of-the-art performance of HGPrompt, validating its effectiveness in learning an optimal ensemble for effective multi-source prompt transfer.
academic

Learning Optimal Prompt Ensemble for Multi-source Visual Prompt Transfer

Basic Information

  • Paper ID: 2504.12311
  • Title: Learning Optimal Prompt Ensemble for Multi-source Visual Prompt Transfer
  • Authors: Enming Zhang, Liwen Cao, Yanru Wu, Zijie Zhao, Yang Li (Tsinghua University Shenzhen International Graduate School, Southeast University)
  • Classification: cs.CL (Computational Linguistics)
  • Publication Date/Venue: arXiv Preprint (Latest version October 15, 2025)
  • Paper Link: https://arxiv.org/abs/2504.12311v5

Abstract

This paper proposes the HGPrompt framework for multi-source visual prompt transfer tasks. The method learns optimal ensemble weights through joint optimization of information-theoretic transferability metrics and gradient conflict minimization regularization. Specifically, a differentiable prompt transferability metric is proposed to capture the discriminability of prompt-induced features on target tasks, while matching gradient variance across different source prompts based on Hessian and Fisher information to ensure stable and consistent knowledge transfer while suppressing gradient conflicts. Experiments on the large-scale VTAB benchmark validate the effectiveness of HGPrompt.

Research Background and Motivation

Problem Definition

With the development of visual foundation models, prompt tuning has become a lightweight strategy for adapting to downstream tasks. The core challenge faced by existing methods is: how to effectively aggregate multiple source prompts to enhance generalization capability on new tasks.

Research Motivation

  1. Resource Efficiency Requirements: Full model fine-tuning becomes impractical on large-scale pre-trained models, while prompt tuning achieves competitive performance by updating only 0.4% of parameters
  2. Prompt Asset Value: Pre-trained prompts have become valuable knowledge assets, and combinations of multi-source prompts can leverage complementary knowledge
  3. Limitations of Existing Methods: Simple concatenation or averaging aggregation ignores the varying contributions of different source prompts to target tasks, potentially leading to representation collapse

Core Challenges

  • Traditional methods evaluate each prompt's transferability in isolation, ignoring inter-prompt dependencies
  • Lack of theoretically grounded heuristic methods (e.g., parameter similarity computation)
  • Gradient interference introduced by multi-prompt aggregation leads to optimization instability

Core Contributions

  1. Proposes HGPrompt Framework: The first theoretically reliable framework for dynamically learning optimal prompt weights by evaluating the transferability of aggregated prompt-induced features
  2. Information-Theoretic Transferability Metric: A differentiable prompt transferability metric based on H-score, providing explicit and interpretable contribution quantification
  3. Gradient Alignment Regularization: An innovative gradient variance matching objective that addresses gradient conflicts among multi-source prompts
  4. SOTA Performance: Achieves state-of-the-art performance on the VTAB benchmark with average accuracy of 60.3%

Method Details

Task Definition

Given κ source tasks S = {Sᵢ}ᵏᵢ₌₁ and their corresponding optimized prompts {Pᵢ}ᵏᵢ₌₁, the goal is to construct a target prompt P_T for new task T through optimal combination of source prompts. Let M ≤ κ be the number of selected source prompts, with weights α = (α₁,...,αₘ) satisfying ∑ᵢαᵢ = 1 and αᵢ ≥ 0.

Model Architecture

1. Visual Prompt Tuning Foundation

For pre-trained Transformers, m learnable prompt tokens P = p₁,...,pₘ ∈ ℝᵐˣᵈ are introduced. Given patch embeddings E(X) ∈ ℝⁿˣᵈ of input image X, the combined input sequence is P;E(X) ∈ ℝ⁽ᵐ⁺ⁿ⁾ˣᵈ.

The prediction probability is:

Pr_θ(Y|X;P) = exp(f_Y([P;E(X)];θ)) / ∑ᶜᵢ₌₁exp(fᵢ([P;E(X)];θ))

2. H-score Transferability Metric

Definition 1: Given input data x, labels y, and feature extractor f(x), the one-sided H-score is defined as:

H(f) = tr(cov(f(X))⁻¹cov(E_{P_{X|Y}}[f(X)|Y]))

This metric has an intuitive interpretation: high H-score indicates greater inter-class discriminability cov(Ef(X)|Y) and minimal feature redundancy tr(cov(f(X))).

Definition 2: Optimal feature weights are determined by maximizing the H-score of weighted feature sum:

α* = argmax_α H(∑ⱼαⱼ·f_{Pⱼ}) s.t. ∑ⱼαⱼ = 1

Theorem 1: H-score is a convex quadratic form in weights α, guaranteeing reliable solution of the optimization problem.

3. Gradient Alignment Regularization

To address gradient interference in multi-prompt aggregation, a gradient variance matching objective is proposed:

Computing gradients for each source prompt Pᵢ:

gᵢ = ∇_{Pᵢ} L(f_θ([x₀;Pᵢ;E(X)]), y)

Gradient variance:

vᵢ = Var(G) = 1/(N-1) ∑ⱼ(gⁱⱼ - gᵅⱼ)²

Regularization term:

L_{align}(α) = 1/M ∑ᵢ||vᵢ - v̄(α)||²₂

Total objective function:

L(α) = -H(α) + λL_{align}(α)

Technical Innovations

  1. Ensemble Evaluation vs. Isolated Evaluation: Unlike traditional methods that independently evaluate each prompt, this work evaluates the overall transferability of aggregated prompts
  2. Theoretical Foundation: The information-theoretic H-score provides rigorous mathematical basis, replacing heuristic methods
  3. Gradient Conflict Resolution: By leveraging theoretical insights from Hessian and Fisher information, gradient variance matching is designed to reduce optimization inconsistency

Experimental Setup

Datasets

The VTAB-1k benchmark with 13 datasets covering three task categories:

  • Natural: Images from standard cameras (e.g., CIFAR100, Flowers102, Pets)
  • Specialized: Data from specialized equipment (e.g., EuroSAT satellite imagery)
  • Structured: Tasks requiring spatial reasoning (e.g., CLEVR counting)

Evaluation Metrics

Classification accuracy is used as the primary evaluation metric, reporting average results from three independent runs.

Baseline Methods

Includes 11 baseline methods:

  1. Classification Head Retraining: PARTIAL-k, MLP-k
  2. Parameter Subset Updates: Adapter, SIDETUNE, BIAS
  3. Prompt Transfer: Average, Single-Best, VPT, SPoT, ATTEMPT, PANDA

Implementation Details

  • Backbone Network: ViT-B/16 (ImageNet-21k pre-trained)
  • Number of Prompt Tokens: 50
  • Source Task Training: 10 epochs
  • Computing Device: NVIDIA A800-80GB GPU
  • Sample Size: 2000 samples per source task for transferability and gradient alignment loss computation

Experimental Results

Main Results

HGPrompt achieves SOTA performance on 13 visual tasks:

MethodCIFAR100DTDFlowers102PetsSVHNEuroSATAverage
PANDA74.161.396.586.271.290.858.7
HGPrompt75.964.298.187.471.092.660.3
  • Average accuracy of 60.3%, surpassing all baseline methods
  • Outstanding performance on fine-grained recognition tasks (Flowers102, Pets)
  • Establishes new benchmarks on geometric reasoning tasks (sNORB-Azimuth, dSprite-Orientation)

Ablation Study

Component contribution analysis:

H(α)L_CIFARDTDPetsEuroSATAverage
××60.457.882.789.172.5
×74.662.385.991.278.5
×74.161.985.590.878.1
75.964.287.492.680.0

Results demonstrate that the two components have complementary effects, achieving optimal performance when used jointly.

Weight Analysis

Spearman rank correlation coefficient validates weight quality:

MethodCIFARC-distd-LocDMLSVHNAverage
SPoT0.5520.175-0.1680.112-0.1470.105
PANDA0.9160.4410.5520.7130.2240.569
HGPrompt0.9440.6640.8530.7270.8530.808

Weights learned by HGPrompt show the highest correlation with zero-shot transfer accuracy, more accurately reflecting semantic affinity between tasks.

Scalability Analysis

As the number of source prompts increases from 3 to 11, HGPrompt demonstrates stronger performance advantages over PANDA and SPoT, validating the method's effectiveness on large-scale prompt collections.

Feature Visualization

t-SNE visualization shows that features generated by HGPrompt exhibit better class discriminability, with same-class objects forming tight clusters with clear boundaries.

Parameter-Efficient Transfer Learning

  • NLP Domain: Adapter, BitFit, LoRA and other methods fine-tune 1-5% of parameters
  • Vision Domain: VPT introduces learnable tokens, VP performs pixel-level perturbations

Transferability Estimation

  • Information-Theoretic Methods: H-score, LEEP, LogME evaluate feature discriminability
  • Optimal Transport: OTCE measures domain-task discrepancy

Multi-source Prompt Tuning

  • Single-Task Transfer: SPoT uses metrics to predict best source task, Su et al. emphasize the role of neural activation
  • Multi-Task Setting: ATTEMPT uses attention mechanisms to aggregate knowledge, PANDA addresses forgetting through knowledge distillation

Conclusions and Discussion

Main Conclusions

  1. HGPrompt achieves optimal prompt ensemble through joint optimization of H-score and gradient alignment
  2. Information-theoretic metrics more effectively quantify prompt transferability than heuristic methods
  3. Gradient variance matching successfully addresses interference from multi-source prompts

Limitations

  1. Architecture Specificity: Current work focuses on Transformer architectures, with limited applicability to other architectures
  2. Modality Constraints: Primarily targets vision tasks; multi-modal learning requires new prompt design methods
  3. Computational Overhead: Requires computing features and gradients for multiple source prompts

Future Directions

  1. Extend to architecture-agnostic universal prompt interfaces
  2. Explore prompt design in multi-modal learning
  3. Investigate more efficient transferability assessment methods

In-Depth Evaluation

Strengths

  1. Theoretical Innovation: Information-theoretic transferability metric provides rigorous mathematical foundation
  2. Advanced Technique: Gradient alignment regularization cleverly addresses multi-source interference
  3. Comprehensive Experiments: Thorough evaluation on large-scale benchmarks validates method effectiveness
  4. Strong Interpretability: Weight learning process has clear theoretical explanation

Weaknesses

  1. Limited Theoretical Depth: While providing convexity proof, analysis of convergence and optimality is insufficient
  2. Hyperparameter Sensitivity: λ parameter selection significantly impacts performance, lacking adaptive mechanisms
  3. Computational Complexity: Computational complexity and scalability analysis are not detailed

Impact

  1. Academic Contribution: Provides new theoretical framework and practical methods for multi-source prompt transfer
  2. Practical Value: Important applications in resource-constrained scenarios
  3. Reproducibility: Authors commit to releasing source code, facilitating method dissemination

Applicable Scenarios

  1. Resource-Constrained Environments: Mobile devices, edge computing scenarios
  2. Rapid Adaptation Requirements: Applications requiring quick adaptation to new tasks
  3. Multi-Task Learning: Scenarios requiring leveraging knowledge from multiple related tasks

References

The paper cites abundant related work, including:

  • Parameter-Efficient Learning: Houlsby et al. (2019), Hu et al. (2021)
  • Transferability Assessment: Bao et al. (2019), You et al. (2021)
  • Multi-Task Learning: Yu et al. (2020), Rame et al. (2022)
  • Vision Transformers: Dosovitskiy (2020), Jia et al. (2022)

This paper makes important contributions to the multi-source visual prompt transfer domain, addressing key challenges of existing methods through theoretical innovation and technical breakthroughs, providing new research directions for parameter-efficient transfer learning.