2025-11-17T09:37:14.027661

You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction

Lawrence, Saha, Wei et al.
Despite the renewed interest in zero-shot visual classification due to the rise of Multimodal Large Language Models (MLLMs), the problem of evaluating free-form responses of auto-regressive models remains a persistent challenge. Most existing works focus on language-only tasks or don't consider Multiple Choice Questions (MCQs) beyond 5-way options, both of which are critical capabilities to solve tasks in Fine-Grained Visual Classification (FGVC) where choice counts are in the hundreds to thousands and the choices are highly related. Furthermore, in this highly multi-way MCQ setting it is not clear how to extend LLM choice extraction to retrieval-based problems, where computing probabilities over the choice set is computationally costly. In this work we investigate nlg2choice, a simple two-stage method which first asks the MLLM an open-ended question for the task with minimal constraints, then uses text-only constrained decoding to predict the most likely choice. In retrieval settings, we compute the probability of the constrained response taking that choice with an early stopping method to significantly improve throughput. Our results show improvement over a suite of seven fine-grained visual datasets when evaluating in terms of classification and retrieval, and show that this performance holds over the various ways that users of LLMs can implement tasks in natural language.
academic

You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction

Basic Information

  • Paper ID: 2510.14885
  • Title: You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction
  • Authors: Logan Lawrence¹, Oindrila Saha¹, Megan Wei², Chen Sun², Subhransu Maji¹, Grant Van Horn¹
  • Affiliations: ¹University of Massachusetts, Amherst; ²Brown University
  • Categories: cs.CV (Computer Vision), cs.CL (Computation and Language)
  • Publication Date: October 16, 2025
  • Paper Link: https://arxiv.org/abs/2510.14885

Abstract

Despite the resurgence of interest in zero-shot visual classification driven by the emergence of multimodal large language models (MLLMs), the challenge of evaluating free-form responses from autoregressive models remains an ongoing concern. Existing work primarily focuses on pure language tasks or does not consider multiple-choice questions with more than five options, both of which are critical capabilities for addressing fine-grained visual classification (FGVC) tasks, where the number of options reaches hundreds to thousands and options are highly correlated. Furthermore, in such highly multi-choice MCQ settings, it remains unclear how to extend LLM choice extraction to retrieval-based questions, as computing probabilities over the choice set is computationally expensive. This paper investigates nlg2choice, a simple two-stage approach that first poses open-ended questions to the MLLM with minimal constraints, then uses constrained decoding on plain text to predict the most likely choice. In the retrieval setting, an early stopping approach is adopted to compute the probability that constrained responses select that option, significantly improving throughput.

Research Background and Motivation

Core Problems

  1. Challenges in Fine-Grained Visual Classification: Traditional multiple-choice methods perform poorly when faced with hundreds to thousands of highly similar options, such as bird species identification where LLaVA-1.5 achieves near-perfect accuracy on coarse-grained classification (e.g., "bird" vs. "non-bird") but only 1-2% accuracy on fine-grained species labels.
  2. Limitations of Evaluation Methods: Existing approaches either enforce constrained output formats (which may hinder reasoning) or allow free-form interpretation (but extraction is difficult), lacking effective answer extraction mechanisms.
  3. Computational Efficiency Issues: In retrieval scenarios, computing probabilities for hundreds to thousands of choices incurs prohibitive computational costs.

Research Motivation

  • MLLMs demonstrate significantly lower performance on fine-grained visual recognition tasks compared to coarse-grained tasks
  • Existing constrained decoding methods and first-token prediction approaches fail in fine-grained settings
  • Lack of systematic investigation into robustness against variations in user prompts

Core Contributions

  1. Proposes nlg2choice Method: A simple yet effective two-stage answer extraction approach that significantly improves classification and retrieval performance across seven fine-grained visual datasets.
  2. Validates Robustness: Through generation of semantically equivalent prompt variants, demonstrates the method's robustness to user input variations with statistically significant performance improvements.
  3. Proposes Early Stopping Optimization: Introduces an early stopping approach in retrieval settings, achieving 15-fold throughput improvement (up to 1362% improvement on certain datasets).
  4. Systematic Analysis: Demonstrates that constrained decoding is a reliable answer extractor without requiring additional training, with the primary bottleneck being the lack of extractable content in free-form responses rather than answer extraction capability.

Methodology Details

Task Definition

Given an image and a fine-grained visual classification task, the objective is to accurately identify image content from a large set of highly similar categories (hundreds to thousands), such as bird species, flower varieties, car models, etc.

nlg2choice Architecture

Stage One: Free-Form Generation

Input Prompt: "What is the species of bird in this image?"
Model Output: "This bird is an Ivory Gull."

Stage Two: Constrained Decoding Extraction

Prompt: "What is the most likely species of bird indicated in this response?
Response: [nlg]
Answer from the following: [choice_list]"

Constrained decoding ensures the output must come from a predefined category list.

User Variation Simulation

To test robustness, o3-high is used to generate 15 semantically equivalent prompt variants:

  • Base template: "What is the species of bird in this image?"
  • Concise template: "What is the species of bird in this image? Answer only with species name."
  • Constrained template: "What is the species of bird in this image? Answer only from the following list..."

Retrieval Optimization: Early Stopping Method

In retrieval scenarios, efficiency is improved through truncated probability computation:

For category name "Baltimore Oriole", decomposed as "B", "altimore", " Ori", "ole", when "altimore" is unique across all categories, subsequent token probability computation is halted:

p_full("Baltimore Oriole") = p("B") × p("altimore"|"B") × p(" Ori"|"Baltimore") × p("ole"|"Baltimore Ori")
p_trunc("Baltimore Oriole") = p("B") × p("altimore"|"B")

Experimental Setup

Datasets

Seven fine-grained visual classification datasets are tested:

  • CUB200: 200 bird species
  • Flowers102: 102 flower species
  • Stanford Cars: 196 car models
  • FGVC Aircrafts: 100 aircraft variants
  • Food101: 101 food categories
  • NABirds: 555 bird species
  • iNaturalist-Birds: 1486 bird species

Evaluation Metrics

  • Classification Task: Accuracy (averaged across 15 semantically equivalent prompts)
  • Retrieval Task: Mean Average Precision (mAP)
  • Robustness: Statistical significance testing

Baseline Methods

  • choice: Direct constrained decoding
  • nlg2choice: Two-stage method (with constrained instructions)
  • nlg2choiceopen: Two-stage method (with open-ended prompts)

Tested Models

  • Qwen-2.5VL-7B
  • Llama-3.2-Vision-11B
  • Intern3VL-8B

Experimental Results

Main Results

Classification Performance Improvement

Across all models and datasets, nlg2choice significantly outperforms direct constrained decoding:

ModelAverage Accuracy Improvement
Qwen-2.5VL+17.46%
Llama-3.2V+8.49%
Intern3VL+6.87%

Best Performance: Qwen-2.5VL achieves average accuracy of 56.91% with open-ended prompts, reaching 78.03% on the Flowers dataset.

Retrieval Performance

nlg2choice also demonstrates superior performance in retrieval tasks:

  • Qwen-2.5VL average mAP improvement: +8.16
  • Improvements observed on all datasets except Stanford Cars
  • Most significant improvement on Flowers dataset (+25.23 mAP)

Computational Efficiency

Early stopping method significantly improves throughput:

  • CUB200: +1362%
  • Flowers: +2042%
  • Average improvement of approximately 10-fold or greater

Ablation Studies

Impact of Prompt Constraint

Experiments reveal that constrained instructions reduce performance:

  • Open-ended prompts > Concise instructions > Explicit choice enumeration
  • Qwen-2.5VL with open-ended prompts outperforms constrained prompts by +62.44% (CUB200)

Chain-of-Thought (CoT) Effects

Forcing CoT reasoning does not consistently improve performance:

  • "Let's think step by step": Average decrease of -9.75%
  • "First,": Average decrease of -9.48%
  • Slight improvement only on Intern3VL's CUB200 (+1.01%)

Misclassification Quality Analysis

nlg2choice produces more reasonable errors:

  • Genus-level matching accuracy improvement: Qwen-2.5VL +16.75%, Llama-3.2V +23.85%
  • Errors more frequently occur between species of the same genus rather than completely unrelated categories

Answer Extraction Capability Verification

Through manual annotation verification:

  • 34.64% of free-form responses contain out-of-pattern answers
  • 70.75% of failure cases contain valid species names
  • Constrained decoding achieves high accuracy on extractable samples: Qwen-2.5VL 97.93%, Intern3VL 93.26%

Forcing MLLMs to Generate Valid Choices

  • Early approaches: Regular expression parsing, but poor performance on fine-grained tasks
  • Probability ranking: Based on first-token probability of choice IDs (A/B/C/D), widely adopted but computationally expensive
  • Constrained decoding: Ensures output falls within choice set, but recent evaluations show performance degradation

MLLMs as Answer Extractors

  • Mismatch between text output and token probability metrics
  • Large models such as GPT-4 used for answer extraction
  • Specialized extraction methods like xFinder, SLOT, xVerify require additional training

Conclusions and Discussion

Main Conclusions

  1. Answer Extraction Significantly Improves Visual Recognition: Improvements observed across all tested architectures and datasets
  2. Method is Robust to User Variations: Performance improvements are statistically significant and independent of specific prompt formats
  3. Constrained Decoding is a Reliable Extractor: Functions effectively without requiring additional training

Limitations

  1. Model Scale Constraints: Primary testing focuses on medium-scale models (8B-11B), using only open-source models
  2. Computational Resource Requirements: Despite avoiding specialized training, still requires substantial computational resources for processing textual descriptions
  3. Multi-label Scalability: Applicability to multi-label problems remains to be verified

Future Directions

  • Extension to larger-scale proprietary models
  • Exploration of multi-label fine-grained classification
  • Further optimization of computational efficiency

In-Depth Evaluation

Strengths

  1. Simple and Effective Method: Two-stage design is intuitive, requiring no additional training data or architectural modifications
  2. Comprehensive Experiments: Tests multiple models, datasets, and evaluation dimensions, including robustness verification
  3. High Practical Value: Early stopping optimization addresses computational efficiency concerns in practical deployment
  4. In-Depth Analysis: Manual annotation verification validates answer extraction effectiveness and identifies true bottlenecks

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks theoretical explanation for why the two-stage method is more effective
  2. Limited Model Coverage: Does not test top-tier proprietary models such as GPT-4V
  3. Narrow Task Scope: Primarily focuses on single-label classification, with insufficient coverage of multi-label and other visual tasks

Impact

This work provides a practical solution for fine-grained visual classification, particularly valuable in real-world applications requiring classification among large numbers of similar categories. The method's simplicity and lack of requirement for additional training make it easy to adopt and deploy.

Applicable Scenarios

  • Biological species identification systems
  • Product fine-grained classification platforms
  • Medical image fine-grained diagnosis
  • Any visual task requiring precise classification from numerous similar options

References

The paper cites 47 relevant references, covering important works in multimodal large language models, constrained decoding, answer extraction, and related key areas, providing a solid theoretical foundation for the research.