2025-11-17T09:37:14.027661

You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction

Lawrence, Saha, Wei et al.

Despite the renewed interest in zero-shot visual classification due to the rise of Multimodal Large Language Models (MLLMs), the problem of evaluating free-form responses of auto-regressive models remains a persistent challenge. Most existing works focus on language-only tasks or don't consider Multiple Choice Questions (MCQs) beyond 5-way options, both of which are critical capabilities to solve tasks in Fine-Grained Visual Classification (FGVC) where choice counts are in the hundreds to thousands and the choices are highly related. Furthermore, in this highly multi-way MCQ setting it is not clear how to extend LLM choice extraction to retrieval-based problems, where computing probabilities over the choice set is computationally costly. In this work we investigate nlg2choice, a simple two-stage method which first asks the MLLM an open-ended question for the task with minimal constraints, then uses text-only constrained decoding to predict the most likely choice. In retrieval settings, we compute the probability of the constrained response taking that choice with an early stopping method to significantly improve throughput. Our results show improvement over a suite of seven fine-grained visual datasets when evaluating in terms of classification and retrieval, and show that this performance holds over the various ways that users of LLMs can implement tasks in natural language.

academic

You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction

基本信息

论文ID: 2510.14885
标题: You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction
作者: Logan Lawrence¹, Oindrila Saha¹, Megan Wei², Chen Sun², Subhransu Maji¹, Grant Van Horn¹
机构: ¹University of Massachusetts, Amherst; ²Brown University
分类: cs.CV (Computer Vision), cs.CL (Computation and Language)
发表时间: 2025年10月16日
论文链接: https://arxiv.org/abs/2510.14885

摘要

尽管多模态大语言模型(MLLMs)的兴起重新激发了对零样本视觉分类的兴趣，但评估自回归模型自由形式回答的问题仍然是一个持续的挑战。现有工作大多关注纯语言任务或不考虑超过5选项的多选题，而这两者都是解决细粒度视觉分类(FGVC)任务的关键能力，在FGVC中选项数量达到数百到数千个且选项高度相关。此外，在这种高度多选的MCQ设置中，尚不清楚如何将LLM选择提取扩展到基于检索的问题，因为计算选择集上的概率在计算上成本高昂。本文研究了nlg2choice，这是一种简单的两阶段方法，首先以最小约束向MLLM提出开放式问题，然后使用纯文本约束解码来预测最可能的选择。在检索设置中，采用早停方法计算约束响应选择该选项的概率，显著提高吞吐量。

研究背景与动机

核心问题

细粒度视觉分类的挑战: 传统的多选题方法在面对数百到数千个高度相似的选项时表现不佳，如鸟类物种识别中LLaVA-1.5在粗粒度分类(如"鸟"vs"非鸟")上接近完美，但在细粒度物种标签上仅有1-2%的准确率。
评估方法的局限性: 现有方法要么强制约束输出格式(可能阻碍推理)，要么允许自由形式解释(但提取困难)，缺乏有效的答案提取机制。
计算效率问题: 在检索场景中，对数百到数千个选择计算概率的计算成本过高。

研究动机

MLLMs在细粒度视觉识别任务上的性能远低于其在粗粒度任务上的表现
现有的约束解码方法和第一令牌预测方法在细粒度设置中失效
缺乏对用户提示变化鲁棒性的系统性研究

核心贡献

提出nlg2choice方法: 一种简单有效的两阶段答案提取方法，在7个细粒度视觉数据集上显著提升分类和检索性能。
验证鲁棒性: 通过生成语义等价的提示变体，证明方法对用户输入变化的鲁棒性，性能提升具有统计显著性。
提出早停优化: 在检索设置中引入早停方法，将吞吐量提升15倍(某些数据集上达到1362%的提升)。
系统性分析: 证明约束解码是可靠的答案提取器，无需额外训练，主要瓶颈在于自由形式响应本身缺乏可提取内容而非答案提取能力。

输入提示: "What is the species of bird in this image?"
模型输出: "This bird is an Ivory Gull."

第二阶段：约束解码提取

提示: "What is the most likely species of bird indicated in this response?
Response: [nlg]
Answer from the following: [choice_list]"

使用约束解码确保输出必须来自预定义的类别列表。

用户变化模拟

为测试鲁棒性，使用o3-high生成15个语义等价的提示变体：

基础模板："What is the species of bird in this image?"
简洁模板："What is the species of bird in this image? Answer only with species name."
约束模板："What is the species of bird in this image? Answer only from the following list..."

检索优化：早停方法

在检索场景中，通过截断概率计算提升效率：

对于类别名"Baltimore Oriole"，分解为"B", "altimore", " Ori", "ole"，当"altimore"在所有类别中唯一时，停止计算后续token概率：

p_full("Baltimore Oriole") = p("B") × p("altimore"|"B") × p(" Ori"|"Baltimore") × p("ole"|"Baltimore Ori")
p_trunc("Baltimore Oriole") = p("B") × p("altimore"|"B")