Training-Free Personalization via Retrieval and Reasoning on Fingerprints
Das, Talon, Wang et al.
Vision Language Models (VLMs) have lead to major improvements in multimodal reasoning, yet they still struggle to understand user-specific concepts. Existing personalization methods address this limitation but heavily rely on training procedures, that can be either costly or unpleasant to individual users. We depart from existing work, and for the first time explore the training-free setting in the context of personalization. We propose a novel method, Retrieval and Reasoning for Personalization (R2P), leveraging internal knowledge of VLMs. First, we leverage VLMs to extract the concept fingerprint, i.e., key attributes uniquely defining the concept within its semantic class. When a query arrives, the most similar fingerprints are retrieved and scored via chain-of-thought-reasoning. To reduce the risk of hallucinations, the scores are validated through cross-modal verification at the attribute level: in case of a discrepancy between the scores, R2P refines the concept association via pairwise multimodal matching, where the retrieved fingerprints and their images are directly compared with the query. We validate R2P on two publicly available benchmarks and a newly introduced dataset, Personal Concepts with Visual Ambiguity (PerVA), for concept identification highlighting challenges in visual ambiguity. R2P consistently outperforms state-of-the-art approaches on various downstream tasks across all benchmarks. Code will be available upon acceptance.
academic
Training-Free Personalization via Retrieval and Reasoning on Fingerprints
本文提出了一种名为R2P (Retrieval and Reasoning for Personalization)的新方法,首次在视觉语言模型(VLM)个性化领域探索了无需训练的设置。该方法通过提取概念指纹(concept fingerprint)来唯一定义用户特定概念,在查询时检索最相似的指纹并通过链式思维推理进行评分。为减少幻觉风险,R2P引入了属性级别的跨模态验证机制,并在必要时使用成对多模态匹配进行概念关联优化。