Obtaining high-quality labels for large datasets is expensive, requiring massive annotations from human experts. While AI models offer a cost-effective alternative by predicting labels, their label quality is compromised by the unavoidable labeling errors. Existing methods mitigate this issue through selective labeling, where AI labels a subset and human labels the remainder. However, these methods lack theoretical guarantees on the quality of AI-assigned labels, often resulting in unacceptably high labeling error within the AI-labeled subset. To address this, we introduce \textbf{Conformal Labeling}, a novel method to identify instances where AI predictions can be provably trusted. This is achieved by controlling the false discovery rate (FDR), the proportion of incorrect labels within the selected subset. In particular, we construct a conformal $p$-value for each test instance by comparing AI models' predicted confidence to those of calibration instances mislabeled by AI models. Then, we select test instances whose $p$-values are below a data-dependent threshold, certifying AI models' predictions as trustworthy. We provide theoretical guarantees that Conformal Labeling controls the FDR below the nominal level, ensuring that a predefined fraction of AI-assigned labels is correct on average. Extensive experiments demonstrate that our method achieves tight FDR control with high power across various tasks, including image and text labeling, and LLM QA.
- Paper ID: 2510.14581
- Title: Selective Labeling with False Discovery Rate Control
- Authors: Huipeng Huang, Wenbo Liao, Huajun Xi, Hao Zeng, Mengchen Zhao, Hongxin Wei
- Classification: cs.LG cs.AI
- Publication Date: October 16, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.14581v1
Obtaining high-quality labels for large-scale datasets is expensive, requiring substantial expert annotation effort. While AI models provide a cost-effective alternative through predicted labels, their label quality is compromised by inevitable annotation errors. Existing methods address this through selective labeling, where AI annotates part of the data and experts annotate the remainder. However, these methods lack theoretical guarantees on the quality of AI-assigned labels, often resulting in unacceptably high error rates in AI-annotated subsets. To address this issue, this paper introduces Conformal Labeling, a novel method for identifying instances with provably trustworthy AI predictions. This is achieved by controlling the False Discovery Rate (FDR)—the proportion of mislabeled instances in the selected subset. Specifically, a conformal p-value is constructed for each test instance by comparing the AI model's prediction confidence with the confidence of calibration instances misclassified by the AI model. Test instances with p-values below a data-dependent threshold are then selected, certifying that the AI model's predictions are trustworthy. The paper provides theoretical guarantees proving that Conformal Labeling controls FDR below the nominal level, ensuring that on average a predefined proportion of AI-assigned labels are correct.
- Core Problem: The cost of obtaining high-quality annotations for large-scale datasets. As modern datasets grow in scale, expert annotation becomes prohibitively expensive, while AI models, though offering cost-effective alternatives, suffer from inevitable annotation errors.
- Problem Significance:
- High-quality annotated data is critical to machine learning pipelines
- Even state-of-the-art LLMs exhibit high error rates on text annotation tasks
- Inherent annotation errors in AI models severely impact label quality, hindering deployment of AI annotation in production environments
- Limitations of Existing Methods:
- Heuristic methods lack theoretical guarantees, relying on AI models to annotate high-confidence instances
- PAC labeling, while providing theoretical guarantees, only controls overall annotation error; error rates in AI-annotated subsets can reach 100%
- Existing selective labeling methods cannot guarantee the quality of AI-assigned labels
- Research Motivation: There is a need for a method that can rigorously guarantee the quality of AI-assigned labels, not merely control overall annotation errors.
- Proposes Conformal Labeling Method: A novel approach for identifying instances with provably trustworthy AI predictions, providing strict quality guarantees for AI-assigned labels through rigorous FDR control, independent of AI model performance.
- Theoretical Guarantees: Theoretically proves that Conformal Labeling provides strict quality guarantees for AI-assigned labels, achieving effective FDR control and ensuring the expected proportion of mislabeled instances remains below user-specified levels.
- Comprehensive Experimental Validation: Extensive experiments on image annotation, text annotation, and LLM question-answering tasks demonstrate that Conformal Labeling significantly reduces annotation costs while maintaining strict FDR control.
Consider a multi-class classification task with feature space X and label space Y={1,…,K}. The test dataset Dtest={Xj}j=1m contains m instances independently and identically sampled from data distribution PX. A pre-trained AI model f:X→R∣Y∣ generates labels, with predicted labels Y^=argmaxy∈Yfy(X).
The objective is to identify the maximum subset R⊆{1,…,m} to control the false discovery rate:
FDR=E[max(∣R∣,1)∣R∩H0∣]
where H0={j∈{1,…,m}:Yj=Y^j} is the set of indices for incorrect predictions.
Conformal Labeling comprises three main steps:
Define an uncertainty score S:X→R, where higher values indicate greater model uncertainty:
S(X)=1−maxy∈Yfy(X)
Reformulate the problem as multiple hypothesis testing:
Hj0:Yn+j=Y^n+j vs. Hj1:Yn+j=Y^n+j
For the subset of misclassified instances in the calibration dataset Dcal0={(Xi,Yi)}i=1n0, the conformal p-value for instance Xn+j is computed as:
p^j=n0+1∑i=1n01{Si<Sn+j}+(1+∑i=1n01{Si=Sn+j})⋅Uj
where Uj∼Uniform[0,1] handles ties.
Adopt a threshold rule inspired by the Benjamini-Hochberg (BH) procedure:
j∗=max{j:p^(j)≤m(n0+1)αj(n+1)}
The selection set is R={j:p^j≤p^(j∗)}.
- Multiple Hypothesis Testing Framework: Reformulates selective labeling as a multiple hypothesis testing problem, enabling provision of rigorous statistical guarantees.
- Conformal p-value Construction: Constructs p-values through rank-based comparison with uncertainty scores of known misclassified instances, ensuring p-values of mislabeled instances stochastically dominate the uniform distribution.
- Data-Dependent Threshold: Carefully sets thresholds using calibration data to control label quality at the desired FDR level.
Image Classification:
- ImageNet (Deng et al., 2009)
- ImageNet-V2 (Recht et al., 2019)
Text Annotation:
- Stance on Global Warming (Luo et al., 2021): Determines whether titles acknowledge global warming as a serious problem
- Misinformation (Gabriel et al., 2022): Binary annotation identifying whether text contains misinformation
LLM Question-Answering:
- MedMCQA (Pal et al., 2022)
- MMLU (Hendrycks et al., 2021)
- MMLU-Pro (Wang et al., 2024)
- FDR: Expected proportion of mislabeled instances in the selected set
- Power: Proportion of correctly annotated instances selected
- AI Annotation Ratio: Number of instances annotated by AI divided by total size of calibration and test datasets
- Naive Method: Uses AI model to annotate test instances with uncertainty score Sn+j≤0.1
- Full AI Annotation: Applies AI predictions to entire test dataset
- BH Variants: BH, Storey-BH, Quantile-BH procedures
- Each experiment repeated 1000 times with averaged results reported
- 10% of data randomly selected as calibration dataset
- Maximum softmax probability (MSP) used as uncertainty score function
- Target FDR level set to α = 0.1
Conformal Labeling successfully controls FDR at or below target levels across all annotation tasks and model architectures:
Performance on ImageNet:
- ResNet-34: FDR=9.97%, Power=80.01%, AI Annotation Ratio=58.67%
- In contrast, naive full AI annotation methods exceed 25% error rate
Performance on MMLU:
- Qwen3-32B: FDR=10.00%, Power=82.96%, AI Annotation Ratio=65.22%
Tightness of FDR Control: Most experiments achieve FDR below 9.9%, with maximum deviation of 9.56%, demonstrating tight FDR control.
Impact of Model Accuracy: Higher prediction accuracy (achieved through stronger models or simpler datasets) improves power and AI annotation ratio.
Impact of Calibration Set Size:
- FDR remains controlled with low standard deviation even at 5% calibration ratio
- Increasing calibration ratio reduces variance in FDR and power
- Improvements from 10% to 20% are negligible
Comparison of Selection Procedures: Conformal Labeling's selection procedure provides the tightest FDR control, consistently achieving FDR closest to the desired level.
- Choice of Uncertainty Score is Critical: Both MSP and DOCTOR-α scores effectively distinguish correct from incorrect predictions, while energy scores perform poorly.
- Method is Robust to Calibration Set Size: While larger calibration sets reduce variance, even smaller calibration sets achieve effective control.
- Relationship with Model Performance: Although the method guarantees FDR control independent of model performance, better models do achieve higher power.
- Heuristic approaches: Collaborative annotation frameworks, domain-specific methods
- PAC labeling: Controls overall annotation error but AI subset error rates can be high
- Selective prediction: Models can abstain when uncertain
- Conformal novelty detection: Identifies out-of-distribution instances
- Conformal selection: Selects data points meeting specific quality criteria
- Extensions to regression, multivariate data selection, online data selection
Theorem 3.1: Under the assumption that calibration and test samples are independently and identically distributed, let α ∈ (0,1) be the target FDR level and p = EH_j^0 be the probability that a test sample is mispredicted. Then the FDR of the selection set R satisfies:
FDR≤[1−(1−p)n+1]α≤α
This theorem ensures that Conformal Labeling strictly controls FDR below the desired level.
- Conformal Labeling successfully addresses the lack of quality guarantees for AI-assigned labels in existing selective labeling methods
- Provides rigorous theoretical guarantees through FDR control, ensuring the expected error proportion of AI-assigned labels remains below user-specified levels
- Achieves tight FDR control and high statistical power across diverse tasks
- Calibration Data Requirements: Requires a small amount of annotated calibration data, though practically feasible, still incurs cost
- Uncertainty Score Dependency: Method's power heavily depends on the quality of uncertainty scores
- IID Assumption: Requires calibration and test data to come from the same distribution
- Sensitivity in Regression: Highly sensitive to the choice of tolerance parameter ε in regression settings
- Explore better uncertainty score functions to improve statistical power
- Investigate methods for relaxing the IID assumption
- Develop adaptive methods for selecting tolerance parameters
- Extend to more complex annotation scenarios
- Theoretical Innovation: First to provide rigorous quality guarantees for AI-assigned labels in selective labeling, filling an important theoretical gap
- Method Generality: Applicable to both classification and regression tasks, validated across image, text, and LLM question-answering domains
- Comprehensive Experiments: Large-scale experimental validation including multiple datasets, models, and detailed ablation studies
- Practical Value: Simple and easy to implement, robust to calibration set size
- Limited Novelty: Primarily applies existing conformal inference and multiple hypothesis testing techniques to new scenarios
- Assumption Limitations: IID assumption may not hold in practical applications
- Insufficient Power Analysis: While providing theoretical guarantees for FDR control, theoretical analysis of statistical power is limited
- Computational Complexity: Computational efficiency on large-scale datasets not discussed
- Academic Value: Provides important theoretical foundation for selective labeling research, likely to inspire follow-up studies
- Practical Significance: Provides reliable quality control methods in the context of increasingly important AI-assisted annotation
- Reproducibility: Detailed algorithm descriptions and implementation details facilitate reproduction
- Large-Scale Data Annotation: Scenarios requiring balance between cost and quality
- High-Quality Requirements: Applications with strict label quality requirements and need for theoretical guarantees
- AI-Assisted Annotation: Scenarios aiming to maximize AI annotation ratio while controlling error rates
- Multi-Domain Applications: Image classification, text analysis, question-answering systems, and other domains
This paper cites extensive related work, primarily including:
- Conformal inference foundational theory (Vovk et al., 1999, 2005)
- Multiple hypothesis testing methods (Benjamini & Hochberg, 1995)
- Selective labeling related work (Candès et al., 2025)
- Uncertainty quantification methods (Hendrycks & Gimpel, 2016)
Overall Assessment: This is an important theoretical contribution to the selective labeling field. While technical innovation is relatively limited, it successfully applies mature statistical methods to practical problems with rigorous theoretical guarantees. Experimental validation is comprehensive with high practical value, providing a reliable quality control framework for AI-assisted annotation.