2025-11-16T08:22:11.899344

Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations

Yu, Jabbar, Hawkins et al.
Different open-ended generation tasks require different degrees of output diversity. However, current LLMs are often miscalibrated. They collapse to overly homogeneous outputs for creative tasks and hallucinate diverse but incorrect responses for factual tasks. We argue that these two failure modes are unified by, and can both be addressed by, the notion of effective generation space size (GSS) -- the set of semantically distinct outputs a model considers for a prompt. We present GSSBench, a task suite of prompt pairs with ground-truth GSS relationships to assess different metrics and understand where models diverge from desired behavior. We find that hallucination detection metrics, particularly EigenScore, consistently outperform standard diversity and uncertainty quantification metrics, while using only model internals, providing interpretable insights into a model's internal task representations. We demonstrate three applications of GSS: (1) detecting prompt ambiguity and predicting clarification questions for better grounding, (2) interpreting overthinking and underthinking in reasoning models, and (3) steering models to expand their generation space to yield high-quality and diverse outputs.
academic

Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations

Basic Information

  • Paper ID: 2510.12699
  • Title: Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations
  • Authors: Sunny Yu, Ahmad Jabbar, Robert D. Hawkins, Dan Jurafsky, Myra Cheng (Stanford University)
  • Classification: cs.CL, cs.AI
  • Publication Status: Under Review
  • Paper Link: https://arxiv.org/abs/2510.12699

Abstract

Different open-ended generation tasks require varying degrees of output diversity. However, current large language models (LLMs) are often poorly calibrated: producing overly homogeneous outputs in creative tasks while generating diverse but incorrect hallucinated responses in factual tasks. This paper proposes that both failure modes can be unified and addressed through the concept of "Generation Space Size" (GSS)—the set of semantically distinct outputs that a model considers for a given prompt. The authors introduce GSSBench, an evaluation framework containing prompt pairs with ground-truth GSS relationships, designed to assess different metrics and understand where models deviate from expected behavior. The study finds that hallucination detection metrics (particularly EigenScore) consistently outperform standard diversity and uncertainty quantification metrics when using only model-internal information, providing interpretable insights into model-internal task representation.

Research Background and Motivation

Core Problem

Current LLMs exhibit two primary generation failure modes:

  1. Output Homogeneity in Creative Tasks: In tasks requiring diversity (e.g., brainstorming, creative writing), models produce overly similar outputs
  2. Hallucination in Factual Tasks: In tasks requiring accuracy (e.g., question answering), models generate diverse but incorrect answers

Research Motivation

Traditional approaches address these two problems separately: either maximizing diversity signals or constraining diversity to improve factual accuracy. This paper proposes a unified perspective, arguing that both problems stem from miscalibration of Generation Space Size (GSS).

Limitations of Existing Approaches

  • Lack of a unified theoretical framework to understand different types of generation failures
  • Most existing diversity metrics are post-hoc and cannot directly access model internal representations
  • Absence of systematic evaluation frameworks to quantify model GSS calibration capability

Core Contributions

  1. Theoretical Contribution: Proposes Generation Space Size (GSS) as a unified framework, viewing output homogeneity and hallucination issues as two aspects of GSS calibration errors
  2. Evaluation Framework: Constructs GSSBench, an evaluation suite containing 9,300 prompt pairs for measuring GSS and its calibration errors
  3. Methodological Findings: Demonstrates that hallucination detection metrics like EigenScore outperform traditional diversity and uncertainty quantification metrics in GSS estimation
  4. Practical Applications: Showcases the value of GSS in three important applications: prompt ambiguity detection, reasoning model analysis, and diversity optimization

Methodology Details

Task Definition

For each prompt p, there exists a true generation space G_t(p): the semantic distribution of all possible correct outputs. Model m also has a generation space G_m(p): the output space the model "considers" for a given prompt. GSS calibration error is defined as:

|Gm(p)| = |Gt(p)| + εm(p)

where ε_m(p) is the error between the model's GSS and the expected GSS.

GSSBench Evaluation Framework

Dataset Construction

Six datasets are constructed based on set-theoretic operations, totaling 9,300 prompt pairs:

  1. Complement: Base prompt vs. complement prompt (e.g., "Write a poem about the moon" vs. "Write anything that is not about the moon")
  2. FactualQA: Specific questions vs. general questions (e.g., "Rivers in Brazil" vs. "Rivers")
  3. Random Choice: Multiple-choice questions with varying numbers of options
  4. Subset: Subset relationships created by adding constraints
  5. Union: Expanded generation space through "or" connections
  6. Intersection: Reduced generation space through "and" connections

Evaluation Metrics

Pairwise accuracy is used to evaluate metric f's ability to predict GSS ordering:

  • For prompt pairs (x, y) where |G_t(x)| > |G_t(y)|
  • Score is 1 if f(x) > f(y), otherwise 0

Candidate Metrics Analysis

Multiple metrics are evaluated as proxies for GSS:

  • Traditional Metrics: Perplexity, energy, length-normalized entropy, lexical similarity
  • Hallucination Detection Metrics: EigenScore and variants, semantic entropy
  • EigenScore Variants:
    • E_original: Original version
    • E_average: Averaged across layers and tokens
    • E_output: Using external sentence embedding models

Experimental Setup

Model Selection

Five instruction-tuned models are tested:

  • Llama-8B-Instruct
  • Mistral-7B-v0.3
  • Qwen3 series (0.6B, 4B, 8B)

Hyperparameter Settings

  • Temperature: 1.0
  • Number of samples: 10
  • Top-k: 10
  • Optimal parameters determined through ablation studies

Experimental Results

Main Findings

EigenScore Variants Perform Best

  • E_output and E_average achieve highest accuracy across all models
  • E_output reaches 71.7% accuracy on Llama-8B-Instruct
  • E_average reaches 72.4% accuracy on the same model
  • Significantly outperforms traditional metrics such as perplexity (60.0%) and lexical similarity (66.5%)

Model Calibration Analysis

  • Llama-8B-Instruct shows best calibration across most metrics
  • Qwen3-0.6B performs best on E_output and semantic entropy
  • Scale Effects: Larger models are not necessarily better calibrated; Qwen3-0.6B outperforms Qwen3-8B across all metrics

Distribution Analysis

EigenScore variants exhibit clear bimodal distributions, effectively distinguishing prompts with different GSS, while other metrics show more overlapping distributions.

Ablation Studies

Parameter Sensitivity Analysis

  • Top-k: Variations have minimal impact on performance
  • Number of Samples: Stable improvement from 0 to 20 samples, with diminishing returns beyond 20
  • Temperature: EigenScore performs best at temperature 1.0 (different from 0.5 in hallucination detection)

EigenScore Implementation Details

  • Cross-layer averaging outperforms using single layers
  • Averaging across all tokens outperforms using only the final token

Practical Applications

1. Prompt Ambiguity Detection and Clarification Question Prediction

Experiment 1: Ambiguity Detection on RIFTS Dataset

On the RIFTS dataset with 1,740 prompts:

  • Only E_output and E_average correctly distinguish ambiguous from non-ambiguous prompts
  • E_output significantly differentiates both categories across all test models

Experiment 2: Clarification Question Prediction

  • E_output and E_average are the only metrics that significantly predict whether models will ask clarification questions across all models
  • Provides interpretable insights into when models seek clarification

2. Reasoning Model Analysis

Solution Path Count Measurement

On 1,000 logical reasoning problems:

  • Constructs single-path vs. multi-path prompt pairs
  • E_output achieves highest accuracy across all reasoning models (73% on Qwen3-4B and 8B)

Reasoning Token Length Prediction

  • GSS shows moderate to strong positive correlation with reasoning token length
  • On deductive reasoning tasks, E_original shows strongest correlation with reasoning length
  • Provides new perspective on understanding "overthinking" and "underthinking" in reasoning models

3. Diversity Optimization: Leave-One-Out EigenScore (LOOE)

LOOE Metric Design

Proposes a new response-level diversity metric:

LOOEi = Eglobal - Ei

where E_i is the EigenScore recalculated after removing response i.

DivPO Experimental Results

  • LOOE performs comparably to other diversity metrics in diversity and reward
  • Compared to traditional metrics, LOOE offers three unique advantages:
    1. Uses model-internal information
    2. Semantically aware
    3. Response-level evaluation

Uncertainty Quantification and Model Calibration

Traditional calibration primarily focuses on aligning UQ metrics with factual task correctness. This paper extends to broader open-ended tasks.

Diversity Metrics

Existing diversity metrics (e.g., unique n-gram, self-BLEU) are primarily post-hoc and cannot access model internal representations. EigenScore provides semantically-aware diversity measurement based on model internals.

Hallucination Detection

Methods like semantic entropy and Kernel Language Entropy are primarily used for hallucination detection. This paper demonstrates broader value of these metrics in GSS estimation.

Conclusions and Discussion

Main Conclusions

  1. Unified Framework: GSS provides a unified perspective for understanding different types of LLM generation failures
  2. Metric Findings: EigenScore serves as the best GSS proxy metric, surpassing traditional diversity and uncertainty metrics
  3. Broad Applicability: The GSS concept has value across multiple domains including ambiguity detection, reasoning analysis, and diversity optimization

Limitations

  1. Content Agnosticism: GSS is insensitive to the quality of generated content
  2. Evaluation Assumptions: Assumes model GSS approximates true GSS, though this assumption may not always hold
  3. Computational Complexity: Some metrics (e.g., EigenScore) have high computational costs

Future Directions

  1. GSS-Aware Training: Develop training methods that dynamically adjust GSS
  2. Better Proxy Metrics: Seek more accurate and efficient GSS estimation methods
  3. Content-Sensitive Extensions: Combine GSS with content quality assessment

In-Depth Evaluation

Strengths

  1. Theoretical Innovation: Proposes GSS as a unifying concept to understand seemingly disparate generation problems, with significant theoretical value
  2. Systematic Evaluation: GSSBench provides comprehensive evaluation framework, filling a gap in the field
  3. Strong Practicality: Three application cases demonstrate practical value of the GSS concept
  4. Rigorous Methodology: Ground-truth relationships constructed through set-theoretic operations avoid subjective judgment
  5. Important Findings: Discovery of EigenScore as GSS proxy provides new tools for the field

Weaknesses

  1. Scale Limitations: Primarily tested on smaller models; performance on larger models may differ
  2. Task Coverage: While covering multiple task types, coverage may not be comprehensive
  3. Theoretical Analysis: Lacks in-depth theoretical explanation for why EigenScore performs best
  4. Computational Efficiency: Computational costs of certain metrics may limit practical applications

Impact

  1. Academic Contribution: Provides new theoretical framework and tools for LLM generation quality assessment
  2. Practical Value: Offers guidance for improving LLM performance across different task types
  3. Reproducibility: Provides detailed experimental settings and dataset construction methods

Applicable Scenarios

  1. Model Evaluation: Assessing LLM calibration across different task types
  2. Model Training: Guiding development of GSS-aware training methods
  3. Applied Systems: Optimizing diversity control in dialogue systems, content generation, and other applications

References

This paper cites important works in related fields, including:

  • Uncertainty Quantification: Kuhn et al. (2023), Farquhar et al. (2024)
  • Diversity Metrics: Kirk et al. (2024), Li et al. (2024)
  • Hallucination Detection: Chen et al. (2024), Nikitin et al. (2024)
  • Model Calibration: Huang et al. (2024), Vashurin et al. (2025)

Overall Assessment: This is a high-quality research paper that proposes an innovative theoretical framework for understanding different generation problems in LLMs. Both the GSSBench evaluation framework and the discovery of EigenScore as a GSS proxy metric have significant academic and practical value. Despite some limitations, its contributions are substantial enough to provide valuable tools and insights for field development.