2025-11-16T13:43:12.593063

Training-Free Personalization via Retrieval and Reasoning on Fingerprints

Das, Talon, Wang et al.

Vision Language Models (VLMs) have lead to major improvements in multimodal reasoning, yet they still struggle to understand user-specific concepts. Existing personalization methods address this limitation but heavily rely on training procedures, that can be either costly or unpleasant to individual users. We depart from existing work, and for the first time explore the training-free setting in the context of personalization. We propose a novel method, Retrieval and Reasoning for Personalization (R2P), leveraging internal knowledge of VLMs. First, we leverage VLMs to extract the concept fingerprint, i.e., key attributes uniquely defining the concept within its semantic class. When a query arrives, the most similar fingerprints are retrieved and scored via chain-of-thought-reasoning. To reduce the risk of hallucinations, the scores are validated through cross-modal verification at the attribute level: in case of a discrepancy between the scores, R2P refines the concept association via pairwise multimodal matching, where the retrieved fingerprints and their images are directly compared with the query. We validate R2P on two publicly available benchmarks and a newly introduced dataset, Personal Concepts with Visual Ambiguity (PerVA), for concept identification highlighting challenges in visual ambiguity. R2P consistently outperforms state-of-the-art approaches on various downstream tasks across all benchmarks. Code will be available upon acceptance.

academic

Training-Free Personalization via Retrieval and Reasoning on Fingerprints

Basic Information

Paper ID: 2503.18623
Title: Training-Free Personalization via Retrieval and Reasoning on Fingerprints
Authors: Deepayan Das, Davide Talon, Yiming Wang, Massimiliano Mancini, Elisa Ricci
Classification: cs.CV (Computer Vision)
Publication Date/Venue: arXiv 2025 (submitted to CVPR 2025)
Paper Link: https://arxiv.org/abs/2503.18623

Abstract

This paper proposes R2P (Retrieval and Reasoning for Personalization), a novel method that for the first time explores the training-free setting in vision-language model (VLM) personalization. The method uniquely defines user-specific concepts by extracting concept fingerprints, retrieving the most similar fingerprints at query time, and scoring them through chain-of-thought reasoning. To mitigate hallucination risks, R2P introduces an attribute-level cross-modal verification mechanism and employs pairwise multimodal matching for concept association optimization when necessary.

Research Background and Motivation

Problem Definition

Although existing vision-language models have achieved significant advances in multimodal reasoning, they still struggle with understanding user-specific concepts. For instance, VLMs find it difficult to comprehend personal concepts in questions such as "Where are my keys?" or "What is Fluffy doing?"

Research Significance

Personalization is a critical step toward making VLMs practical. Users need models capable of recognizing and reasoning about their personal items, pets, friends, and other specific concepts.

Limitations of Existing Methods

Training Dependency: Existing personalization methods such as MyVLM and Yo'LLaVA heavily rely on training processes, requiring multiple reference samples and extensive negative samples for contrastive learning
High Cost: Adding new concepts requires expensive fine-tuning for each instance
Difficult Data Collection: Requires collecting large amounts of training data, which is both expensive and inconvenient for users

Research Motivation

The authors pose a critical question: Since VLMs have been exposed to nearly all semantic concepts through web-scale training data, can we leverage VLMs' internal knowledge to achieve training-free personalization?

Core Contributions

First Exploration of Training-Free Personalization: Proposes and implements the training-free setting in VLM personalization for the first time
Proposes R2P Framework: Designs a novel retrieval-reasoning paradigm-based method using textual attributes as concept fingerprints to uniquely identify personal concepts
Introduces PerVA Dataset: Constructs a new benchmark dataset specifically designed for testing personalization methods under visual ambiguity scenarios
Achieves SOTA Performance: Consistently outperforms existing methods across all benchmarks, demonstrating the effectiveness of the training-free approach

Method Details

Task Definition

Given user-provided reference images $I_i \in V$ , concept names $c_i \in T$ , and categories $g_i \in T$ , construct a user-specific multimodal database $D$ . At test time, given a query image $Q \in V$ and text prompt $P_q \in T$ , the VLM should provide answers related to personal concepts.

Model Architecture

R2P consists of two main stages:

Stage One: Personal Database Creation

Concept Fingerprint Extraction:
```
{A_i, d_i} = Φ_VLM(P^V_D, P^T_D)
```
where $A_i$ $A_{i}$ is the list of fingerprint attributes and $d_i$ $d_{i}$ is a brief description
Multimodal Encoding:
- Visual embedding: $f^V_i = E_V(I_i)$
- Text embedding: $f^T_i = E_T(d_i)$

Database Construction:

D = {I_i, c_i, g_i, d_i, A_i, f^V_i, f^T_i}^N_{i=1}

Stage Two: Retrieval-Reasoning-Based Concept Inference

Multimodal Concept Retrieval:
```
s_{q,i} = \frac{1}{2}(s^{V,V}_{q,i} + s^{V,T}_{q,i})
```
Select top-K candidate concepts $C_K$ $C_{K}$

Attribute-Focused Chain-of-Thought Reasoning:

{A_{q,i}, ∀i ∈ C_k}, \tilde{c} = Φ_VLM(P^V_R, P^T_R)

Cross-Modal Attribute Verification:

s^{V,A}_{q,i} = \frac{1}{|A_{q,i}|} \sum_{a_j ∈ A_{q,i}} ⟨f^V_q, f^T_{a,j}⟩

Pairwise Reasoning (when verification fails):

p_i = \frac{λ^{Yes}_i}{λ^{Yes}_i + λ^{No}_i}

Technical Innovations

Concept Fingerprint Mechanism: First proposes using fine-grained attributes extracted by VLMs as unique identifiers for concepts
Multi-Layer Verification Strategy: Designs a progressive verification mechanism: CoT reasoning → attribute verification → pairwise reasoning
Cross-Modal Consistency Checking: Reduces hallucinations by comparing text reasoning results with vision-text alignment scores
Training-Free Paradigm: Completely relies on the internal knowledge of pre-trained VLMs without any fine-tuning

Experimental Setup

Datasets

MyVLM: 29 personal concepts
Yo'LLaVA: 40 concepts including objects, people, and buildings
PerVA (newly proposed): 329 concepts across 21 categories, 67,482 images, specifically designed for testing visual ambiguity scenarios

Evaluation Metrics

Recognition Task: Recall (Pos. Acc.), Specificity (Neg. Acc.), Weighted Average (Wtd)
Caption Generation: Hard Recall - proportion of concept names appearing in generated captions
Personalized VQA: Answer accuracy

Comparison Methods

MyVLM, Yo'LLaVA (training-required methods)
RAP (retrieval-augmented method)
GPT-4V + Vprompt
LLaVA, LLaVA + prompt
MiniCPM-o + prompt

Implementation Details

Base VLM: Mini-CPM-o-2.6
Retrieval System: FAISS
Encoder: CLIP ViT-L/14-336
K value: K=3

Experimental Results

Main Results

MyVLM Dataset:

Weighted Accuracy: 97.4% (best)
Caption Recall: 91.4%

Yo'LLaVA Dataset:

Weighted Accuracy: 94.4% (+2.2% vs RAP)
Caption Recall: 87.1% (+5.5% over second-best method)
VQA Accuracy: 96.5% (+3.3% vs RAP)

PerVA Dataset:

Weighted Accuracy: 91.8% (+2.8% vs RAP)
Caption Recall: 72.5%
Significant advantage over training methods: +29.6% vs MyVLM, +19.8% vs Yo'LLaVA

Ablation Study

Main Component Analysis (PerVA Dataset):

Complete R2P: 91.8% Wtd, 72.5% Recall
Without Fingerprint Attributes: 86.5% Wtd, 62.2% Recall
CoT Reasoning Only: 84.7% Wtd, 62.8% Recall
Manually Defined Attributes: 92.5% Wtd, 72.8% Recall

Verification Strategy Comparison:

Attribute Verification (this work): 72.5%
Pairwise Reasoning: 72.3%
No Estimation: 71.2%
Abstention Strategy: 70.7%

Case Analysis

The paper demonstrates R2P's effectiveness in handling visually similar concepts, such as distinguishing different T-shirts (CVPR vs ICCV logo) and identifying specific plush toys. The model accurately identifies target concepts through key attributes (e.g., "CVPR logo", "round neck").

Experimental Findings

Importance of Fingerprint Attributes: VLM-generated attributes perform nearly as well as manually defined attributes
Advantages of Multimodal Retrieval: Retrieval strategies combining visual and text embeddings outperform single-modality methods
Effectiveness of Verification Mechanism: Cross-modal attribute verification effectively reduces hallucinations and improves accuracy

VLM Personalization

Early methods such as MyVLM and Yo'LLaVA employ inversion strategies, assigning unique latent representations to each object. Recent work reduces personalization time through large-scale fine-tuning and multi-image inputs.

Attribute-Based Reasoning

Recognizing objects based on attributes is a long-standing problem in computer vision with important applications in zero-shot learning. This work is similar to research on discovering useful or machine-generated attributes but focuses on describing personal objects.

Conclusions and Discussion

Main Conclusions

First demonstrates the feasibility of training-free VLM personalization
R2P effectively addresses personal concept recognition through concept fingerprints and retrieval-reasoning paradigm
Achieves state-of-the-art performance across multiple benchmarks

Limitations

Computational Overhead: Although training-free, the multi-step verification process during inference still incurs computational costs
Scene Limitations: Performance may be constrained in cluttered scenes containing multiple similar concepts
Single-Image Limitation: Currently supports personalization with only a single reference image

Future Directions

Reduce computational overhead and improve inference efficiency
Improve performance in cluttered scenes
Extend to multi-reference image settings
Explore additional application scenarios

In-Depth Evaluation

Strengths

Strong Novelty: First to explore training-free setting in VLM personalization, opening new research directions
Complete Method: Designs a comprehensive retrieval-reasoning-verification pipeline with mature technical solutions
Comprehensive Experiments: Conducts thorough evaluation across multiple datasets, including newly constructed challenging benchmarks
Excellent Performance: Achieves SOTA performance across all benchmarks
High Practical Value: Training-free nature makes the method easier to deploy and use

Weaknesses

Computational Complexity: Multi-step reasoning may present efficiency concerns in practical applications
VLM Quality Dependency: Method effectiveness heavily depends on the capabilities of the underlying VLM
Fingerprint Extraction Quality: Stability of VLM-generated fingerprint attributes may be insufficient
Scalability Issues: Complexity of retrieval and reasoning increases with growing concept numbers

Impact

Academic Contribution: Provides a new research paradigm for VLM personalization
Practical Value: Lowers deployment barriers for personalized VLMs
Reproducibility: Provides detailed implementation details and open-source commitment
Inspirational Significance: Demonstrates the potential of leveraging pre-trained model internal knowledge

Applicable Scenarios

Personal Assistant Systems: Users can quickly add personal concepts without training
Smart Homes: Recognizing user personal items and environments
Educational Applications: Personalized learning content recognition
E-commerce Recommendations: Product recognition based on user personal preferences

References

The paper cites important works in related fields, including personalization methods such as MyVLM, Yo'LLaVA, and RAP, as well as foundational models like CLIP and LLaVA, providing solid theoretical foundations for the research.

Overall Assessment: This is a high-quality research paper that proposes an innovative training-free method in VLM personalization with a complete technical solution and comprehensive experimental evaluation, possessing significant academic and practical value. The paper's main contribution lies in demonstrating the feasibility of leveraging VLM internal knowledge for personalization, opening new research directions in this field.