2025-11-17T09:37:14.027661

You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction

Lawrence, Saha, Wei et al.

Despite the renewed interest in zero-shot visual classification due to the rise of Multimodal Large Language Models (MLLMs), the problem of evaluating free-form responses of auto-regressive models remains a persistent challenge. Most existing works focus on language-only tasks or don't consider Multiple Choice Questions (MCQs) beyond 5-way options, both of which are critical capabilities to solve tasks in Fine-Grained Visual Classification (FGVC) where choice counts are in the hundreds to thousands and the choices are highly related. Furthermore, in this highly multi-way MCQ setting it is not clear how to extend LLM choice extraction to retrieval-based problems, where computing probabilities over the choice set is computationally costly. In this work we investigate nlg2choice, a simple two-stage method which first asks the MLLM an open-ended question for the task with minimal constraints, then uses text-only constrained decoding to predict the most likely choice. In retrieval settings, we compute the probability of the constrained response taking that choice with an early stopping method to significantly improve throughput. Our results show improvement over a suite of seven fine-grained visual datasets when evaluating in terms of classification and retrieval, and show that this performance holds over the various ways that users of LLMs can implement tasks in natural language.

academic

You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction

Basic Information

Paper ID: 2510.14885
Title: You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction
Authors: Logan Lawrence¹, Oindrila Saha¹, Megan Wei², Chen Sun², Subhransu Maji¹, Grant Van Horn¹
Affiliations: ¹University of Massachusetts, Amherst; ²Brown University
Categories: cs.CV (Computer Vision), cs.CL (Computation and Language)
Publication Date: October 16, 2025
Paper Link: https://arxiv.org/abs/2510.14885

Abstract

Despite the resurgence of interest in zero-shot visual classification driven by the emergence of multimodal large language models (MLLMs), the challenge of evaluating free-form responses from autoregressive models remains an ongoing concern. Existing work primarily focuses on pure language tasks or does not consider multiple-choice questions with more than five options, both of which are critical capabilities for addressing fine-grained visual classification (FGVC) tasks, where the number of options reaches hundreds to thousands and options are highly correlated. Furthermore, in such highly multi-choice MCQ settings, it remains unclear how to extend LLM choice extraction to retrieval-based questions, as computing probabilities over the choice set is computationally expensive. This paper investigates nlg2choice, a simple two-stage approach that first poses open-ended questions to the MLLM with minimal constraints, then uses constrained decoding on plain text to predict the most likely choice. In the retrieval setting, an early stopping approach is adopted to compute the probability that constrained responses select that option, significantly improving throughput.

Research Background and Motivation

Core Problems

Challenges in Fine-Grained Visual Classification: Traditional multiple-choice methods perform poorly when faced with hundreds to thousands of highly similar options, such as bird species identification where LLaVA-1.5 achieves near-perfect accuracy on coarse-grained classification (e.g., "bird" vs. "non-bird") but only 1-2% accuracy on fine-grained species labels.
Limitations of Evaluation Methods: Existing approaches either enforce constrained output formats (which may hinder reasoning) or allow free-form interpretation (but extraction is difficult), lacking effective answer extraction mechanisms.
Computational Efficiency Issues: In retrieval scenarios, computing probabilities for hundreds to thousands of choices incurs prohibitive computational costs.

Research Motivation

MLLMs demonstrate significantly lower performance on fine-grained visual recognition tasks compared to coarse-grained tasks
Existing constrained decoding methods and first-token prediction approaches fail in fine-grained settings
Lack of systematic investigation into robustness against variations in user prompts

Core Contributions

Proposes nlg2choice Method: A simple yet effective two-stage answer extraction approach that significantly improves classification and retrieval performance across seven fine-grained visual datasets.
Validates Robustness: Through generation of semantically equivalent prompt variants, demonstrates the method's robustness to user input variations with statistically significant performance improvements.
Proposes Early Stopping Optimization: Introduces an early stopping approach in retrieval settings, achieving 15-fold throughput improvement (up to 1362% improvement on certain datasets).
Systematic Analysis: Demonstrates that constrained decoding is a reliable answer extractor without requiring additional training, with the primary bottleneck being the lack of extractable content in free-form responses rather than answer extraction capability.

Methodology Details

Task Definition

Given an image and a fine-grained visual classification task, the objective is to accurately identify image content from a large set of highly similar categories (hundreds to thousands), such as bird species, flower varieties, car models, etc.

nlg2choice Architecture

Stage One: Free-Form Generation

Input Prompt: "What is the species of bird in this image?"
Model Output: "This bird is an Ivory Gull."

Stage Two: Constrained Decoding Extraction

Prompt: "What is the most likely species of bird indicated in this response?
Response: [nlg]
Answer from the following: [choice_list]"

Constrained decoding ensures the output must come from a predefined category list.

User Variation Simulation

To test robustness, o3-high is used to generate 15 semantically equivalent prompt variants:

Base template: "What is the species of bird in this image?"
Concise template: "What is the species of bird in this image? Answer only with species name."
Constrained template: "What is the species of bird in this image? Answer only from the following list..."

Retrieval Optimization: Early Stopping Method

In retrieval scenarios, efficiency is improved through truncated probability computation:

For category name "Baltimore Oriole", decomposed as "B", "altimore", " Ori", "ole", when "altimore" is unique across all categories, subsequent token probability computation is halted:

p_full("Baltimore Oriole") = p("B") × p("altimore"|"B") × p(" Ori"|"Baltimore") × p("ole"|"Baltimore Ori")
p_trunc("Baltimore Oriole") = p("B") × p("altimore"|"B")

Experimental Setup

Datasets

Seven fine-grained visual classification datasets are tested:

CUB200: 200 bird species
Flowers102: 102 flower species
Stanford Cars: 196 car models
FGVC Aircrafts: 100 aircraft variants
Food101: 101 food categories
NABirds: 555 bird species
iNaturalist-Birds: 1486 bird species

Evaluation Metrics

Classification Task: Accuracy (averaged across 15 semantically equivalent prompts)
Retrieval Task: Mean Average Precision (mAP)
Robustness: Statistical significance testing

Baseline Methods

choice: Direct constrained decoding
nlg2choice: Two-stage method (with constrained instructions)
nlg2choiceopen: Two-stage method (with open-ended prompts)

Tested Models

Qwen-2.5VL-7B
Llama-3.2-Vision-11B
Intern3VL-8B

Experimental Results

Main Results

Classification Performance Improvement

Across all models and datasets, nlg2choice significantly outperforms direct constrained decoding:

Model	Average Accuracy Improvement
Qwen-2.5VL	+17.46%
Llama-3.2V	+8.49%
Intern3VL	+6.87%

Best Performance: Qwen-2.5VL achieves average accuracy of 56.91% with open-ended prompts, reaching 78.03% on the Flowers dataset.

Retrieval Performance

nlg2choice also demonstrates superior performance in retrieval tasks:

Qwen-2.5VL average mAP improvement: +8.16
Improvements observed on all datasets except Stanford Cars
Most significant improvement on Flowers dataset (+25.23 mAP)

Computational Efficiency

Early stopping method significantly improves throughput:

CUB200: +1362%
Flowers: +2042%
Average improvement of approximately 10-fold or greater

Ablation Studies

Impact of Prompt Constraint

Experiments reveal that constrained instructions reduce performance:

Open-ended prompts > Concise instructions > Explicit choice enumeration
Qwen-2.5VL with open-ended prompts outperforms constrained prompts by +62.44% (CUB200)

Chain-of-Thought (CoT) Effects

Forcing CoT reasoning does not consistently improve performance:

"Let's think step by step": Average decrease of -9.75%
"First,": Average decrease of -9.48%
Slight improvement only on Intern3VL's CUB200 (+1.01%)

Misclassification Quality Analysis

nlg2choice produces more reasonable errors:

Genus-level matching accuracy improvement: Qwen-2.5VL +16.75%, Llama-3.2V +23.85%
Errors more frequently occur between species of the same genus rather than completely unrelated categories

Answer Extraction Capability Verification

Through manual annotation verification:

34.64% of free-form responses contain out-of-pattern answers
70.75% of failure cases contain valid species names
Constrained decoding achieves high accuracy on extractable samples: Qwen-2.5VL 97.93%, Intern3VL 93.26%

Forcing MLLMs to Generate Valid Choices

Early approaches: Regular expression parsing, but poor performance on fine-grained tasks
Probability ranking: Based on first-token probability of choice IDs (A/B/C/D), widely adopted but computationally expensive
Constrained decoding: Ensures output falls within choice set, but recent evaluations show performance degradation

MLLMs as Answer Extractors

Mismatch between text output and token probability metrics
Large models such as GPT-4 used for answer extraction
Specialized extraction methods like xFinder, SLOT, xVerify require additional training

Conclusions and Discussion

Main Conclusions

Answer Extraction Significantly Improves Visual Recognition: Improvements observed across all tested architectures and datasets
Method is Robust to User Variations: Performance improvements are statistically significant and independent of specific prompt formats
Constrained Decoding is a Reliable Extractor: Functions effectively without requiring additional training

Limitations

Model Scale Constraints: Primary testing focuses on medium-scale models (8B-11B), using only open-source models
Computational Resource Requirements: Despite avoiding specialized training, still requires substantial computational resources for processing textual descriptions
Multi-label Scalability: Applicability to multi-label problems remains to be verified

Future Directions

Extension to larger-scale proprietary models
Exploration of multi-label fine-grained classification
Further optimization of computational efficiency

In-Depth Evaluation

Strengths

Simple and Effective Method: Two-stage design is intuitive, requiring no additional training data or architectural modifications
Comprehensive Experiments: Tests multiple models, datasets, and evaluation dimensions, including robustness verification
High Practical Value: Early stopping optimization addresses computational efficiency concerns in practical deployment
In-Depth Analysis: Manual annotation verification validates answer extraction effectiveness and identifies true bottlenecks

Weaknesses

Insufficient Theoretical Analysis: Lacks theoretical explanation for why the two-stage method is more effective
Limited Model Coverage: Does not test top-tier proprietary models such as GPT-4V
Narrow Task Scope: Primarily focuses on single-label classification, with insufficient coverage of multi-label and other visual tasks

Impact

This work provides a practical solution for fine-grained visual classification, particularly valuable in real-world applications requiring classification among large numbers of similar categories. The method's simplicity and lack of requirement for additional training make it easy to adopt and deploy.

Applicable Scenarios

Biological species identification systems
Product fine-grained classification platforms
Medical image fine-grained diagnosis
Any visual task requiring precise classification from numerous similar options

References

The paper cites 47 relevant references, covering important works in multimodal large language models, constrained decoding, answer extraction, and related key areas, providing a solid theoretical foundation for the research.