You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction
Lawrence, Saha, Wei et al.
Despite the renewed interest in zero-shot visual classification due to the rise of Multimodal Large Language Models (MLLMs), the problem of evaluating free-form responses of auto-regressive models remains a persistent challenge. Most existing works focus on language-only tasks or don't consider Multiple Choice Questions (MCQs) beyond 5-way options, both of which are critical capabilities to solve tasks in Fine-Grained Visual Classification (FGVC) where choice counts are in the hundreds to thousands and the choices are highly related. Furthermore, in this highly multi-way MCQ setting it is not clear how to extend LLM choice extraction to retrieval-based problems, where computing probabilities over the choice set is computationally costly. In this work we investigate nlg2choice, a simple two-stage method which first asks the MLLM an open-ended question for the task with minimal constraints, then uses text-only constrained decoding to predict the most likely choice. In retrieval settings, we compute the probability of the constrained response taking that choice with an early stopping method to significantly improve throughput. Our results show improvement over a suite of seven fine-grained visual datasets when evaluating in terms of classification and retrieval, and show that this performance holds over the various ways that users of LLMs can implement tasks in natural language.
academic
You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction
Despite the resurgence of interest in zero-shot visual classification driven by the emergence of multimodal large language models (MLLMs), the challenge of evaluating free-form responses from autoregressive models remains an ongoing concern. Existing work primarily focuses on pure language tasks or does not consider multiple-choice questions with more than five options, both of which are critical capabilities for addressing fine-grained visual classification (FGVC) tasks, where the number of options reaches hundreds to thousands and options are highly correlated. Furthermore, in such highly multi-choice MCQ settings, it remains unclear how to extend LLM choice extraction to retrieval-based questions, as computing probabilities over the choice set is computationally expensive. This paper investigates nlg2choice, a simple two-stage approach that first poses open-ended questions to the MLLM with minimal constraints, then uses constrained decoding on plain text to predict the most likely choice. In the retrieval setting, an early stopping approach is adopted to compute the probability that constrained responses select that option, significantly improving throughput.
Challenges in Fine-Grained Visual Classification: Traditional multiple-choice methods perform poorly when faced with hundreds to thousands of highly similar options, such as bird species identification where LLaVA-1.5 achieves near-perfect accuracy on coarse-grained classification (e.g., "bird" vs. "non-bird") but only 1-2% accuracy on fine-grained species labels.
Limitations of Evaluation Methods: Existing approaches either enforce constrained output formats (which may hinder reasoning) or allow free-form interpretation (but extraction is difficult), lacking effective answer extraction mechanisms.
Computational Efficiency Issues: In retrieval scenarios, computing probabilities for hundreds to thousands of choices incurs prohibitive computational costs.
Proposes nlg2choice Method: A simple yet effective two-stage answer extraction approach that significantly improves classification and retrieval performance across seven fine-grained visual datasets.
Validates Robustness: Through generation of semantically equivalent prompt variants, demonstrates the method's robustness to user input variations with statistically significant performance improvements.
Proposes Early Stopping Optimization: Introduces an early stopping approach in retrieval settings, achieving 15-fold throughput improvement (up to 1362% improvement on certain datasets).
Systematic Analysis: Demonstrates that constrained decoding is a reliable answer extractor without requiring additional training, with the primary bottleneck being the lack of extractable content in free-form responses rather than answer extraction capability.
Given an image and a fine-grained visual classification task, the objective is to accurately identify image content from a large set of highly similar categories (hundreds to thousands), such as bird species, flower varieties, car models, etc.
In retrieval scenarios, efficiency is improved through truncated probability computation:
For category name "Baltimore Oriole", decomposed as "B", "altimore", " Ori", "ole", when "altimore" is unique across all categories, subsequent token probability computation is halted:
This work provides a practical solution for fine-grained visual classification, particularly valuable in real-world applications requiring classification among large numbers of similar categories. The method's simplicity and lack of requirement for additional training make it easy to adopt and deploy.
The paper cites 47 relevant references, covering important works in multimodal large language models, constrained decoding, answer extraction, and related key areas, providing a solid theoretical foundation for the research.