Training-Free Personalization via Retrieval and Reasoning on Fingerprints
Das, Talon, Wang et al.
Vision Language Models (VLMs) have lead to major improvements in multimodal reasoning, yet they still struggle to understand user-specific concepts. Existing personalization methods address this limitation but heavily rely on training procedures, that can be either costly or unpleasant to individual users. We depart from existing work, and for the first time explore the training-free setting in the context of personalization. We propose a novel method, Retrieval and Reasoning for Personalization (R2P), leveraging internal knowledge of VLMs. First, we leverage VLMs to extract the concept fingerprint, i.e., key attributes uniquely defining the concept within its semantic class. When a query arrives, the most similar fingerprints are retrieved and scored via chain-of-thought-reasoning. To reduce the risk of hallucinations, the scores are validated through cross-modal verification at the attribute level: in case of a discrepancy between the scores, R2P refines the concept association via pairwise multimodal matching, where the retrieved fingerprints and their images are directly compared with the query. We validate R2P on two publicly available benchmarks and a newly introduced dataset, Personal Concepts with Visual Ambiguity (PerVA), for concept identification highlighting challenges in visual ambiguity. R2P consistently outperforms state-of-the-art approaches on various downstream tasks across all benchmarks. Code will be available upon acceptance.
academic
Training-Free Personalization via Retrieval and Reasoning on Fingerprints
This paper proposes R2P (Retrieval and Reasoning for Personalization), a novel method that for the first time explores the training-free setting in vision-language model (VLM) personalization. The method uniquely defines user-specific concepts by extracting concept fingerprints, retrieving the most similar fingerprints at query time, and scoring them through chain-of-thought reasoning. To mitigate hallucination risks, R2P introduces an attribute-level cross-modal verification mechanism and employs pairwise multimodal matching for concept association optimization when necessary.
Although existing vision-language models have achieved significant advances in multimodal reasoning, they still struggle with understanding user-specific concepts. For instance, VLMs find it difficult to comprehend personal concepts in questions such as "Where are my keys?" or "What is Fluffy doing?"
Personalization is a critical step toward making VLMs practical. Users need models capable of recognizing and reasoning about their personal items, pets, friends, and other specific concepts.
Training Dependency: Existing personalization methods such as MyVLM and Yo'LLaVA heavily rely on training processes, requiring multiple reference samples and extensive negative samples for contrastive learning
High Cost: Adding new concepts requires expensive fine-tuning for each instance
Difficult Data Collection: Requires collecting large amounts of training data, which is both expensive and inconvenient for users
The authors pose a critical question: Since VLMs have been exposed to nearly all semantic concepts through web-scale training data, can we leverage VLMs' internal knowledge to achieve training-free personalization?
First Exploration of Training-Free Personalization: Proposes and implements the training-free setting in VLM personalization for the first time
Proposes R2P Framework: Designs a novel retrieval-reasoning paradigm-based method using textual attributes as concept fingerprints to uniquely identify personal concepts
Introduces PerVA Dataset: Constructs a new benchmark dataset specifically designed for testing personalization methods under visual ambiguity scenarios
Achieves SOTA Performance: Consistently outperforms existing methods across all benchmarks, demonstrating the effectiveness of the training-free approach
Given user-provided reference images Ii∈V, concept names ci∈T, and categories gi∈T, construct a user-specific multimodal database D. At test time, given a query image Q∈V and text prompt Pq∈T, the VLM should provide answers related to personal concepts.
The paper demonstrates R2P's effectiveness in handling visually similar concepts, such as distinguishing different T-shirts (CVPR vs ICCV logo) and identifying specific plush toys. The model accurately identifies target concepts through key attributes (e.g., "CVPR logo", "round neck").
Early methods such as MyVLM and Yo'LLaVA employ inversion strategies, assigning unique latent representations to each object. Recent work reduces personalization time through large-scale fine-tuning and multi-image inputs.
Recognizing objects based on attributes is a long-standing problem in computer vision with important applications in zero-shot learning. This work is similar to research on discovering useful or machine-generated attributes but focuses on describing personal objects.
The paper cites important works in related fields, including personalization methods such as MyVLM, Yo'LLaVA, and RAP, as well as foundational models like CLIP and LLaVA, providing solid theoretical foundations for the research.
Overall Assessment: This is a high-quality research paper that proposes an innovative training-free method in VLM personalization with a complete technical solution and comprehensive experimental evaluation, possessing significant academic and practical value. The paper's main contribution lies in demonstrating the feasibility of leveraging VLM internal knowledge for personalization, opening new research directions in this field.