Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations
Yu, Jabbar, Hawkins et al.
Different open-ended generation tasks require different degrees of output diversity. However, current LLMs are often miscalibrated. They collapse to overly homogeneous outputs for creative tasks and hallucinate diverse but incorrect responses for factual tasks. We argue that these two failure modes are unified by, and can both be addressed by, the notion of effective generation space size (GSS) -- the set of semantically distinct outputs a model considers for a prompt. We present GSSBench, a task suite of prompt pairs with ground-truth GSS relationships to assess different metrics and understand where models diverge from desired behavior. We find that hallucination detection metrics, particularly EigenScore, consistently outperform standard diversity and uncertainty quantification metrics, while using only model internals, providing interpretable insights into a model's internal task representations. We demonstrate three applications of GSS: (1) detecting prompt ambiguity and predicting clarification questions for better grounding, (2) interpreting overthinking and underthinking in reasoning models, and (3) steering models to expand their generation space to yield high-quality and diverse outputs.
academic
Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations
Different open-ended generation tasks require varying degrees of output diversity. However, current large language models (LLMs) are often poorly calibrated: producing overly homogeneous outputs in creative tasks while generating diverse but incorrect hallucinated responses in factual tasks. This paper proposes that both failure modes can be unified and addressed through the concept of "Generation Space Size" (GSS)—the set of semantically distinct outputs that a model considers for a given prompt. The authors introduce GSSBench, an evaluation framework containing prompt pairs with ground-truth GSS relationships, designed to assess different metrics and understand where models deviate from expected behavior. The study finds that hallucination detection metrics (particularly EigenScore) consistently outperform standard diversity and uncertainty quantification metrics when using only model-internal information, providing interpretable insights into model-internal task representation.
Traditional approaches address these two problems separately: either maximizing diversity signals or constraining diversity to improve factual accuracy. This paper proposes a unified perspective, arguing that both problems stem from miscalibration of Generation Space Size (GSS).
Theoretical Contribution: Proposes Generation Space Size (GSS) as a unified framework, viewing output homogeneity and hallucination issues as two aspects of GSS calibration errors
Evaluation Framework: Constructs GSSBench, an evaluation suite containing 9,300 prompt pairs for measuring GSS and its calibration errors
Methodological Findings: Demonstrates that hallucination detection metrics like EigenScore outperform traditional diversity and uncertainty quantification metrics in GSS estimation
Practical Applications: Showcases the value of GSS in three important applications: prompt ambiguity detection, reasoning model analysis, and diversity optimization
For each prompt p, there exists a true generation space G_t(p): the semantic distribution of all possible correct outputs. Model m also has a generation space G_m(p): the output space the model "considers" for a given prompt. GSS calibration error is defined as:
|Gm(p)| = |Gt(p)| + εm(p)
where ε_m(p) is the error between the model's GSS and the expected GSS.
EigenScore variants exhibit clear bimodal distributions, effectively distinguishing prompts with different GSS, while other metrics show more overlapping distributions.
Existing diversity metrics (e.g., unique n-gram, self-BLEU) are primarily post-hoc and cannot access model internal representations. EigenScore provides semantically-aware diversity measurement based on model internals.
Methods like semantic entropy and Kernel Language Entropy are primarily used for hallucination detection. This paper demonstrates broader value of these metrics in GSS estimation.
This paper cites important works in related fields, including:
Uncertainty Quantification: Kuhn et al. (2023), Farquhar et al. (2024)
Diversity Metrics: Kirk et al. (2024), Li et al. (2024)
Hallucination Detection: Chen et al. (2024), Nikitin et al. (2024)
Model Calibration: Huang et al. (2024), Vashurin et al. (2025)
Overall Assessment: This is a high-quality research paper that proposes an innovative theoretical framework for understanding different generation problems in LLMs. Both the GSSBench evaluation framework and the discovery of EigenScore as a GSS proxy metric have significant academic and practical value. Despite some limitations, its contributions are substantial enough to provide valuable tools and insights for field development.