2025-11-16T08:22:11.899344

Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations

Yu, Jabbar, Hawkins et al.

Different open-ended generation tasks require different degrees of output diversity. However, current LLMs are often miscalibrated. They collapse to overly homogeneous outputs for creative tasks and hallucinate diverse but incorrect responses for factual tasks. We argue that these two failure modes are unified by, and can both be addressed by, the notion of effective generation space size (GSS) -- the set of semantically distinct outputs a model considers for a prompt. We present GSSBench, a task suite of prompt pairs with ground-truth GSS relationships to assess different metrics and understand where models diverge from desired behavior. We find that hallucination detection metrics, particularly EigenScore, consistently outperform standard diversity and uncertainty quantification metrics, while using only model internals, providing interpretable insights into a model's internal task representations. We demonstrate three applications of GSS: (1) detecting prompt ambiguity and predicting clarification questions for better grounding, (2) interpreting overthinking and underthinking in reasoning models, and (3) steering models to expand their generation space to yield high-quality and diverse outputs.

academic

Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations

Basic Information

Paper ID: 2510.12699
Title: Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations
Authors: Sunny Yu, Ahmad Jabbar, Robert D. Hawkins, Dan Jurafsky, Myra Cheng (Stanford University)
Classification: cs.CL, cs.AI
Publication Status: Under Review
Paper Link: https://arxiv.org/abs/2510.12699

Abstract

Different open-ended generation tasks require varying degrees of output diversity. However, current large language models (LLMs) are often poorly calibrated: producing overly homogeneous outputs in creative tasks while generating diverse but incorrect hallucinated responses in factual tasks. This paper proposes that both failure modes can be unified and addressed through the concept of "Generation Space Size" (GSS)—the set of semantically distinct outputs that a model considers for a given prompt. The authors introduce GSSBench, an evaluation framework containing prompt pairs with ground-truth GSS relationships, designed to assess different metrics and understand where models deviate from expected behavior. The study finds that hallucination detection metrics (particularly EigenScore) consistently outperform standard diversity and uncertainty quantification metrics when using only model-internal information, providing interpretable insights into model-internal task representation.

Research Background and Motivation

Core Problem

Current LLMs exhibit two primary generation failure modes:

Output Homogeneity in Creative Tasks: In tasks requiring diversity (e.g., brainstorming, creative writing), models produce overly similar outputs
Hallucination in Factual Tasks: In tasks requiring accuracy (e.g., question answering), models generate diverse but incorrect answers

Research Motivation

Traditional approaches address these two problems separately: either maximizing diversity signals or constraining diversity to improve factual accuracy. This paper proposes a unified perspective, arguing that both problems stem from miscalibration of Generation Space Size (GSS).

Limitations of Existing Approaches

Lack of a unified theoretical framework to understand different types of generation failures
Most existing diversity metrics are post-hoc and cannot directly access model internal representations
Absence of systematic evaluation frameworks to quantify model GSS calibration capability

Core Contributions

Theoretical Contribution: Proposes Generation Space Size (GSS) as a unified framework, viewing output homogeneity and hallucination issues as two aspects of GSS calibration errors
Evaluation Framework: Constructs GSSBench, an evaluation suite containing 9,300 prompt pairs for measuring GSS and its calibration errors
Methodological Findings: Demonstrates that hallucination detection metrics like EigenScore outperform traditional diversity and uncertainty quantification metrics in GSS estimation
Practical Applications: Showcases the value of GSS in three important applications: prompt ambiguity detection, reasoning model analysis, and diversity optimization

Methodology Details

Task Definition

For each prompt p, there exists a true generation space G_t(p): the semantic distribution of all possible correct outputs. Model m also has a generation space G_m(p): the output space the model "considers" for a given prompt. GSS calibration error is defined as:

|Gm(p)| = |Gt(p)| + εm(p)

where ε_m(p) is the error between the model's GSS and the expected GSS.

GSSBench Evaluation Framework

Dataset Construction

Six datasets are constructed based on set-theoretic operations, totaling 9,300 prompt pairs:

Complement: Base prompt vs. complement prompt (e.g., "Write a poem about the moon" vs. "Write anything that is not about the moon")
FactualQA: Specific questions vs. general questions (e.g., "Rivers in Brazil" vs. "Rivers")
Random Choice: Multiple-choice questions with varying numbers of options
Subset: Subset relationships created by adding constraints
Union: Expanded generation space through "or" connections
Intersection: Reduced generation space through "and" connections

Evaluation Metrics

Pairwise accuracy is used to evaluate metric f's ability to predict GSS ordering:

For prompt pairs (x, y) where |G_t(x)| > |G_t(y)|
Score is 1 if f(x) > f(y), otherwise 0

Candidate Metrics Analysis

Multiple metrics are evaluated as proxies for GSS:

Traditional Metrics: Perplexity, energy, length-normalized entropy, lexical similarity
Hallucination Detection Metrics: EigenScore and variants, semantic entropy
EigenScore Variants:
- E_original: Original version
- E_average: Averaged across layers and tokens
- E_output: Using external sentence embedding models

Experimental Setup

Model Selection

Five instruction-tuned models are tested:

Llama-8B-Instruct
Mistral-7B-v0.3
Qwen3 series (0.6B, 4B, 8B)

Hyperparameter Settings

Temperature: 1.0
Number of samples: 10
Top-k: 10
Optimal parameters determined through ablation studies

Experimental Results

Main Findings

EigenScore Variants Perform Best

E_output and E_average achieve highest accuracy across all models
E_output reaches 71.7% accuracy on Llama-8B-Instruct
E_average reaches 72.4% accuracy on the same model
Significantly outperforms traditional metrics such as perplexity (60.0%) and lexical similarity (66.5%)

Model Calibration Analysis

Llama-8B-Instruct shows best calibration across most metrics
Qwen3-0.6B performs best on E_output and semantic entropy
Scale Effects: Larger models are not necessarily better calibrated; Qwen3-0.6B outperforms Qwen3-8B across all metrics

Distribution Analysis

EigenScore variants exhibit clear bimodal distributions, effectively distinguishing prompts with different GSS, while other metrics show more overlapping distributions.

Ablation Studies

Parameter Sensitivity Analysis

Top-k: Variations have minimal impact on performance
Number of Samples: Stable improvement from 0 to 20 samples, with diminishing returns beyond 20
Temperature: EigenScore performs best at temperature 1.0 (different from 0.5 in hallucination detection)

EigenScore Implementation Details

Cross-layer averaging outperforms using single layers
Averaging across all tokens outperforms using only the final token

Practical Applications

1. Prompt Ambiguity Detection and Clarification Question Prediction

Experiment 1: Ambiguity Detection on RIFTS Dataset

On the RIFTS dataset with 1,740 prompts:

Only E_output and E_average correctly distinguish ambiguous from non-ambiguous prompts
E_output significantly differentiates both categories across all test models

Experiment 2: Clarification Question Prediction

E_output and E_average are the only metrics that significantly predict whether models will ask clarification questions across all models
Provides interpretable insights into when models seek clarification

2. Reasoning Model Analysis

Solution Path Count Measurement

On 1,000 logical reasoning problems:

Constructs single-path vs. multi-path prompt pairs
E_output achieves highest accuracy across all reasoning models (73% on Qwen3-4B and 8B)

Reasoning Token Length Prediction

GSS shows moderate to strong positive correlation with reasoning token length
On deductive reasoning tasks, E_original shows strongest correlation with reasoning length
Provides new perspective on understanding "overthinking" and "underthinking" in reasoning models

3. Diversity Optimization: Leave-One-Out EigenScore (LOOE)

LOOE Metric Design

Proposes a new response-level diversity metric:

LOOEi = Eglobal - Ei

where E_i is the EigenScore recalculated after removing response i.

DivPO Experimental Results

LOOE performs comparably to other diversity metrics in diversity and reward
Compared to traditional metrics, LOOE offers three unique advantages:
1. Uses model-internal information
2. Semantically aware
3. Response-level evaluation

Uncertainty Quantification and Model Calibration

Traditional calibration primarily focuses on aligning UQ metrics with factual task correctness. This paper extends to broader open-ended tasks.

Diversity Metrics

Existing diversity metrics (e.g., unique n-gram, self-BLEU) are primarily post-hoc and cannot access model internal representations. EigenScore provides semantically-aware diversity measurement based on model internals.

Hallucination Detection

Methods like semantic entropy and Kernel Language Entropy are primarily used for hallucination detection. This paper demonstrates broader value of these metrics in GSS estimation.

Conclusions and Discussion

Main Conclusions

Unified Framework: GSS provides a unified perspective for understanding different types of LLM generation failures
Metric Findings: EigenScore serves as the best GSS proxy metric, surpassing traditional diversity and uncertainty metrics
Broad Applicability: The GSS concept has value across multiple domains including ambiguity detection, reasoning analysis, and diversity optimization

Limitations

Content Agnosticism: GSS is insensitive to the quality of generated content
Evaluation Assumptions: Assumes model GSS approximates true GSS, though this assumption may not always hold
Computational Complexity: Some metrics (e.g., EigenScore) have high computational costs

Future Directions

GSS-Aware Training: Develop training methods that dynamically adjust GSS
Better Proxy Metrics: Seek more accurate and efficient GSS estimation methods
Content-Sensitive Extensions: Combine GSS with content quality assessment

In-Depth Evaluation

Strengths

Theoretical Innovation: Proposes GSS as a unifying concept to understand seemingly disparate generation problems, with significant theoretical value
Systematic Evaluation: GSSBench provides comprehensive evaluation framework, filling a gap in the field
Strong Practicality: Three application cases demonstrate practical value of the GSS concept
Rigorous Methodology: Ground-truth relationships constructed through set-theoretic operations avoid subjective judgment
Important Findings: Discovery of EigenScore as GSS proxy provides new tools for the field

Weaknesses

Scale Limitations: Primarily tested on smaller models; performance on larger models may differ
Task Coverage: While covering multiple task types, coverage may not be comprehensive
Theoretical Analysis: Lacks in-depth theoretical explanation for why EigenScore performs best
Computational Efficiency: Computational costs of certain metrics may limit practical applications

Impact

Academic Contribution: Provides new theoretical framework and tools for LLM generation quality assessment
Practical Value: Offers guidance for improving LLM performance across different task types
Reproducibility: Provides detailed experimental settings and dataset construction methods

Applicable Scenarios

Model Evaluation: Assessing LLM calibration across different task types
Model Training: Guiding development of GSS-aware training methods
Applied Systems: Optimizing diversity control in dialogue systems, content generation, and other applications

References

This paper cites important works in related fields, including:

Uncertainty Quantification: Kuhn et al. (2023), Farquhar et al. (2024)
Diversity Metrics: Kirk et al. (2024), Li et al. (2024)
Hallucination Detection: Chen et al. (2024), Nikitin et al. (2024)
Model Calibration: Huang et al. (2024), Vashurin et al. (2025)

Overall Assessment: This is a high-quality research paper that proposes an innovative theoretical framework for understanding different generation problems in LLMs. Both the GSSBench evaluation framework and the discovery of EigenScore as a GSS proxy metric have significant academic and practical value. Despite some limitations, its contributions are substantial enough to provide valuable tools and insights for field development.