2025-11-16T13:43:12.593063

Training-Free Personalization via Retrieval and Reasoning on Fingerprints

Das, Talon, Wang et al.
Vision Language Models (VLMs) have lead to major improvements in multimodal reasoning, yet they still struggle to understand user-specific concepts. Existing personalization methods address this limitation but heavily rely on training procedures, that can be either costly or unpleasant to individual users. We depart from existing work, and for the first time explore the training-free setting in the context of personalization. We propose a novel method, Retrieval and Reasoning for Personalization (R2P), leveraging internal knowledge of VLMs. First, we leverage VLMs to extract the concept fingerprint, i.e., key attributes uniquely defining the concept within its semantic class. When a query arrives, the most similar fingerprints are retrieved and scored via chain-of-thought-reasoning. To reduce the risk of hallucinations, the scores are validated through cross-modal verification at the attribute level: in case of a discrepancy between the scores, R2P refines the concept association via pairwise multimodal matching, where the retrieved fingerprints and their images are directly compared with the query. We validate R2P on two publicly available benchmarks and a newly introduced dataset, Personal Concepts with Visual Ambiguity (PerVA), for concept identification highlighting challenges in visual ambiguity. R2P consistently outperforms state-of-the-art approaches on various downstream tasks across all benchmarks. Code will be available upon acceptance.
academic

Training-Free Personalization via Retrieval and Reasoning on Fingerprints

Basic Information

  • Paper ID: 2503.18623
  • Title: Training-Free Personalization via Retrieval and Reasoning on Fingerprints
  • Authors: Deepayan Das, Davide Talon, Yiming Wang, Massimiliano Mancini, Elisa Ricci
  • Classification: cs.CV (Computer Vision)
  • Publication Date/Venue: arXiv 2025 (submitted to CVPR 2025)
  • Paper Link: https://arxiv.org/abs/2503.18623

Abstract

This paper proposes R2P (Retrieval and Reasoning for Personalization), a novel method that for the first time explores the training-free setting in vision-language model (VLM) personalization. The method uniquely defines user-specific concepts by extracting concept fingerprints, retrieving the most similar fingerprints at query time, and scoring them through chain-of-thought reasoning. To mitigate hallucination risks, R2P introduces an attribute-level cross-modal verification mechanism and employs pairwise multimodal matching for concept association optimization when necessary.

Research Background and Motivation

Problem Definition

Although existing vision-language models have achieved significant advances in multimodal reasoning, they still struggle with understanding user-specific concepts. For instance, VLMs find it difficult to comprehend personal concepts in questions such as "Where are my keys?" or "What is Fluffy doing?"

Research Significance

Personalization is a critical step toward making VLMs practical. Users need models capable of recognizing and reasoning about their personal items, pets, friends, and other specific concepts.

Limitations of Existing Methods

  1. Training Dependency: Existing personalization methods such as MyVLM and Yo'LLaVA heavily rely on training processes, requiring multiple reference samples and extensive negative samples for contrastive learning
  2. High Cost: Adding new concepts requires expensive fine-tuning for each instance
  3. Difficult Data Collection: Requires collecting large amounts of training data, which is both expensive and inconvenient for users

Research Motivation

The authors pose a critical question: Since VLMs have been exposed to nearly all semantic concepts through web-scale training data, can we leverage VLMs' internal knowledge to achieve training-free personalization?

Core Contributions

  1. First Exploration of Training-Free Personalization: Proposes and implements the training-free setting in VLM personalization for the first time
  2. Proposes R2P Framework: Designs a novel retrieval-reasoning paradigm-based method using textual attributes as concept fingerprints to uniquely identify personal concepts
  3. Introduces PerVA Dataset: Constructs a new benchmark dataset specifically designed for testing personalization methods under visual ambiguity scenarios
  4. Achieves SOTA Performance: Consistently outperforms existing methods across all benchmarks, demonstrating the effectiveness of the training-free approach

Method Details

Task Definition

Given user-provided reference images IiVI_i \in V, concept names ciTc_i \in T, and categories giTg_i \in T, construct a user-specific multimodal database DD. At test time, given a query image QVQ \in V and text prompt PqTP_q \in T, the VLM should provide answers related to personal concepts.

Model Architecture

R2P consists of two main stages:

Stage One: Personal Database Creation

  1. Concept Fingerprint Extraction:
    {A_i, d_i} = Φ_VLM(P^V_D, P^T_D)
    

    where AiA_i is the list of fingerprint attributes and did_i is a brief description
  2. Multimodal Encoding:
    • Visual embedding: fiV=EV(Ii)f^V_i = E_V(I_i)
    • Text embedding: fiT=ET(di)f^T_i = E_T(d_i)
  3. Database Construction:
    D = {I_i, c_i, g_i, d_i, A_i, f^V_i, f^T_i}^N_{i=1}
    

Stage Two: Retrieval-Reasoning-Based Concept Inference

  1. Multimodal Concept Retrieval:
    s_{q,i} = \frac{1}{2}(s^{V,V}_{q,i} + s^{V,T}_{q,i})
    

    Select top-K candidate concepts CKC_K
  2. Attribute-Focused Chain-of-Thought Reasoning:
    {A_{q,i}, ∀i ∈ C_k}, \tilde{c} = Φ_VLM(P^V_R, P^T_R)
    
  3. Cross-Modal Attribute Verification:
    s^{V,A}_{q,i} = \frac{1}{|A_{q,i}|} \sum_{a_j ∈ A_{q,i}} ⟨f^V_q, f^T_{a,j}⟩
    
  4. Pairwise Reasoning (when verification fails):
    p_i = \frac{λ^{Yes}_i}{λ^{Yes}_i + λ^{No}_i}
    

Technical Innovations

  1. Concept Fingerprint Mechanism: First proposes using fine-grained attributes extracted by VLMs as unique identifiers for concepts
  2. Multi-Layer Verification Strategy: Designs a progressive verification mechanism: CoT reasoning → attribute verification → pairwise reasoning
  3. Cross-Modal Consistency Checking: Reduces hallucinations by comparing text reasoning results with vision-text alignment scores
  4. Training-Free Paradigm: Completely relies on the internal knowledge of pre-trained VLMs without any fine-tuning

Experimental Setup

Datasets

  1. MyVLM: 29 personal concepts
  2. Yo'LLaVA: 40 concepts including objects, people, and buildings
  3. PerVA (newly proposed): 329 concepts across 21 categories, 67,482 images, specifically designed for testing visual ambiguity scenarios

Evaluation Metrics

  1. Recognition Task: Recall (Pos. Acc.), Specificity (Neg. Acc.), Weighted Average (Wtd)
  2. Caption Generation: Hard Recall - proportion of concept names appearing in generated captions
  3. Personalized VQA: Answer accuracy

Comparison Methods

  • MyVLM, Yo'LLaVA (training-required methods)
  • RAP (retrieval-augmented method)
  • GPT-4V + Vprompt
  • LLaVA, LLaVA + prompt
  • MiniCPM-o + prompt

Implementation Details

  • Base VLM: Mini-CPM-o-2.6
  • Retrieval System: FAISS
  • Encoder: CLIP ViT-L/14-336
  • K value: K=3

Experimental Results

Main Results

MyVLM Dataset:

  • Weighted Accuracy: 97.4% (best)
  • Caption Recall: 91.4%

Yo'LLaVA Dataset:

  • Weighted Accuracy: 94.4% (+2.2% vs RAP)
  • Caption Recall: 87.1% (+5.5% over second-best method)
  • VQA Accuracy: 96.5% (+3.3% vs RAP)

PerVA Dataset:

  • Weighted Accuracy: 91.8% (+2.8% vs RAP)
  • Caption Recall: 72.5%
  • Significant advantage over training methods: +29.6% vs MyVLM, +19.8% vs Yo'LLaVA

Ablation Study

Main Component Analysis (PerVA Dataset):

  • Complete R2P: 91.8% Wtd, 72.5% Recall
  • Without Fingerprint Attributes: 86.5% Wtd, 62.2% Recall
  • CoT Reasoning Only: 84.7% Wtd, 62.8% Recall
  • Manually Defined Attributes: 92.5% Wtd, 72.8% Recall

Verification Strategy Comparison:

  • Attribute Verification (this work): 72.5%
  • Pairwise Reasoning: 72.3%
  • No Estimation: 71.2%
  • Abstention Strategy: 70.7%

Case Analysis

The paper demonstrates R2P's effectiveness in handling visually similar concepts, such as distinguishing different T-shirts (CVPR vs ICCV logo) and identifying specific plush toys. The model accurately identifies target concepts through key attributes (e.g., "CVPR logo", "round neck").

Experimental Findings

  1. Importance of Fingerprint Attributes: VLM-generated attributes perform nearly as well as manually defined attributes
  2. Advantages of Multimodal Retrieval: Retrieval strategies combining visual and text embeddings outperform single-modality methods
  3. Effectiveness of Verification Mechanism: Cross-modal attribute verification effectively reduces hallucinations and improves accuracy

VLM Personalization

Early methods such as MyVLM and Yo'LLaVA employ inversion strategies, assigning unique latent representations to each object. Recent work reduces personalization time through large-scale fine-tuning and multi-image inputs.

Attribute-Based Reasoning

Recognizing objects based on attributes is a long-standing problem in computer vision with important applications in zero-shot learning. This work is similar to research on discovering useful or machine-generated attributes but focuses on describing personal objects.

Conclusions and Discussion

Main Conclusions

  1. First demonstrates the feasibility of training-free VLM personalization
  2. R2P effectively addresses personal concept recognition through concept fingerprints and retrieval-reasoning paradigm
  3. Achieves state-of-the-art performance across multiple benchmarks

Limitations

  1. Computational Overhead: Although training-free, the multi-step verification process during inference still incurs computational costs
  2. Scene Limitations: Performance may be constrained in cluttered scenes containing multiple similar concepts
  3. Single-Image Limitation: Currently supports personalization with only a single reference image

Future Directions

  1. Reduce computational overhead and improve inference efficiency
  2. Improve performance in cluttered scenes
  3. Extend to multi-reference image settings
  4. Explore additional application scenarios

In-Depth Evaluation

Strengths

  1. Strong Novelty: First to explore training-free setting in VLM personalization, opening new research directions
  2. Complete Method: Designs a comprehensive retrieval-reasoning-verification pipeline with mature technical solutions
  3. Comprehensive Experiments: Conducts thorough evaluation across multiple datasets, including newly constructed challenging benchmarks
  4. Excellent Performance: Achieves SOTA performance across all benchmarks
  5. High Practical Value: Training-free nature makes the method easier to deploy and use

Weaknesses

  1. Computational Complexity: Multi-step reasoning may present efficiency concerns in practical applications
  2. VLM Quality Dependency: Method effectiveness heavily depends on the capabilities of the underlying VLM
  3. Fingerprint Extraction Quality: Stability of VLM-generated fingerprint attributes may be insufficient
  4. Scalability Issues: Complexity of retrieval and reasoning increases with growing concept numbers

Impact

  1. Academic Contribution: Provides a new research paradigm for VLM personalization
  2. Practical Value: Lowers deployment barriers for personalized VLMs
  3. Reproducibility: Provides detailed implementation details and open-source commitment
  4. Inspirational Significance: Demonstrates the potential of leveraging pre-trained model internal knowledge

Applicable Scenarios

  1. Personal Assistant Systems: Users can quickly add personal concepts without training
  2. Smart Homes: Recognizing user personal items and environments
  3. Educational Applications: Personalized learning content recognition
  4. E-commerce Recommendations: Product recognition based on user personal preferences

References

The paper cites important works in related fields, including personalization methods such as MyVLM, Yo'LLaVA, and RAP, as well as foundational models like CLIP and LLaVA, providing solid theoretical foundations for the research.


Overall Assessment: This is a high-quality research paper that proposes an innovative training-free method in VLM personalization with a complete technical solution and comprehensive experimental evaluation, possessing significant academic and practical value. The paper's main contribution lies in demonstrating the feasibility of leveraging VLM internal knowledge for personalization, opening new research directions in this field.