2025-11-23T23:37:17.450142

Selective Labeling with False Discovery Rate Control

Huang, Liao, Xi et al.
Obtaining high-quality labels for large datasets is expensive, requiring massive annotations from human experts. While AI models offer a cost-effective alternative by predicting labels, their label quality is compromised by the unavoidable labeling errors. Existing methods mitigate this issue through selective labeling, where AI labels a subset and human labels the remainder. However, these methods lack theoretical guarantees on the quality of AI-assigned labels, often resulting in unacceptably high labeling error within the AI-labeled subset. To address this, we introduce \textbf{Conformal Labeling}, a novel method to identify instances where AI predictions can be provably trusted. This is achieved by controlling the false discovery rate (FDR), the proportion of incorrect labels within the selected subset. In particular, we construct a conformal $p$-value for each test instance by comparing AI models' predicted confidence to those of calibration instances mislabeled by AI models. Then, we select test instances whose $p$-values are below a data-dependent threshold, certifying AI models' predictions as trustworthy. We provide theoretical guarantees that Conformal Labeling controls the FDR below the nominal level, ensuring that a predefined fraction of AI-assigned labels is correct on average. Extensive experiments demonstrate that our method achieves tight FDR control with high power across various tasks, including image and text labeling, and LLM QA.
academic

Selective Labeling with False Discovery Rate Control

Basic Information

  • Paper ID: 2510.14581
  • Title: Selective Labeling with False Discovery Rate Control
  • Authors: Huipeng Huang, Wenbo Liao, Huajun Xi, Hao Zeng, Mengchen Zhao, Hongxin Wei
  • Classification: cs.LG cs.AI
  • Publication Date: October 16, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.14581v1

Abstract

Obtaining high-quality labels for large-scale datasets is expensive, requiring substantial expert annotation effort. While AI models provide a cost-effective alternative through predicted labels, their label quality is compromised by inevitable annotation errors. Existing methods address this through selective labeling, where AI annotates part of the data and experts annotate the remainder. However, these methods lack theoretical guarantees on the quality of AI-assigned labels, often resulting in unacceptably high error rates in AI-annotated subsets. To address this issue, this paper introduces Conformal Labeling, a novel method for identifying instances with provably trustworthy AI predictions. This is achieved by controlling the False Discovery Rate (FDR)—the proportion of mislabeled instances in the selected subset. Specifically, a conformal p-value is constructed for each test instance by comparing the AI model's prediction confidence with the confidence of calibration instances misclassified by the AI model. Test instances with p-values below a data-dependent threshold are then selected, certifying that the AI model's predictions are trustworthy. The paper provides theoretical guarantees proving that Conformal Labeling controls FDR below the nominal level, ensuring that on average a predefined proportion of AI-assigned labels are correct.

Research Background and Motivation

  1. Core Problem: The cost of obtaining high-quality annotations for large-scale datasets. As modern datasets grow in scale, expert annotation becomes prohibitively expensive, while AI models, though offering cost-effective alternatives, suffer from inevitable annotation errors.
  2. Problem Significance:
    • High-quality annotated data is critical to machine learning pipelines
    • Even state-of-the-art LLMs exhibit high error rates on text annotation tasks
    • Inherent annotation errors in AI models severely impact label quality, hindering deployment of AI annotation in production environments
  3. Limitations of Existing Methods:
    • Heuristic methods lack theoretical guarantees, relying on AI models to annotate high-confidence instances
    • PAC labeling, while providing theoretical guarantees, only controls overall annotation error; error rates in AI-annotated subsets can reach 100%
    • Existing selective labeling methods cannot guarantee the quality of AI-assigned labels
  4. Research Motivation: There is a need for a method that can rigorously guarantee the quality of AI-assigned labels, not merely control overall annotation errors.

Core Contributions

  1. Proposes Conformal Labeling Method: A novel approach for identifying instances with provably trustworthy AI predictions, providing strict quality guarantees for AI-assigned labels through rigorous FDR control, independent of AI model performance.
  2. Theoretical Guarantees: Theoretically proves that Conformal Labeling provides strict quality guarantees for AI-assigned labels, achieving effective FDR control and ensuring the expected proportion of mislabeled instances remains below user-specified levels.
  3. Comprehensive Experimental Validation: Extensive experiments on image annotation, text annotation, and LLM question-answering tasks demonstrate that Conformal Labeling significantly reduces annotation costs while maintaining strict FDR control.

Methodology Details

Task Definition

Consider a multi-class classification task with feature space XX and label space Y={1,,K}Y = \{1, \ldots, K\}. The test dataset Dtest={Xj}j=1mD_{test} = \{X_j\}_{j=1}^m contains mm instances independently and identically sampled from data distribution PXP_X. A pre-trained AI model f:XRYf: X \rightarrow \mathbb{R}^{|Y|} generates labels, with predicted labels Y^=argmaxyYfy(X)\hat{Y} = \arg\max_{y \in Y} f_y(X).

The objective is to identify the maximum subset R{1,,m}R \subseteq \{1, \ldots, m\} to control the false discovery rate: FDR=E[RH0max(R,1)]FDR = E\left[\frac{|R \cap H_0|}{\max(|R|, 1)}\right]

where H0={j{1,,m}:YjY^j}H_0 = \{j \in \{1, \ldots, m\}: Y_j \neq \hat{Y}_j\} is the set of indices for incorrect predictions.

Model Architecture

Conformal Labeling comprises three main steps:

1. Uncertainty Quantification

Define an uncertainty score S:XRS: X \rightarrow \mathbb{R}, where higher values indicate greater model uncertainty: S(X)=1maxyYfy(X)S(X) = 1 - \max_{y \in Y} f_y(X)

2. Constructing Conformal p-values

Reformulate the problem as multiple hypothesis testing: Hj0:Yn+jY^n+j vs. Hj1:Yn+j=Y^n+jH_j^0: Y_{n+j} \neq \hat{Y}_{n+j} \text{ vs. } H_j^1: Y_{n+j} = \hat{Y}_{n+j}

For the subset of misclassified instances in the calibration dataset Dcal0={(Xi,Yi)}i=1n0D_{cal}^0 = \{(X_i, Y_i)\}_{i=1}^{n_0}, the conformal p-value for instance Xn+jX_{n+j} is computed as:

p^j=i=1n01{Si<Sn+j}+(1+i=1n01{Si=Sn+j})Ujn0+1\hat{p}_j = \frac{\sum_{i=1}^{n_0} \mathbf{1}\{S_i < S_{n+j}\} + (1 + \sum_{i=1}^{n_0} \mathbf{1}\{S_i = S_{n+j}\}) \cdot U_j}{n_0 + 1}

where UjUniform[0,1]U_j \sim \text{Uniform}[0,1] handles ties.

3. Threshold Setting

Adopt a threshold rule inspired by the Benjamini-Hochberg (BH) procedure: j=max{j:p^(j)αj(n+1)m(n0+1)}j^* = \max\left\{j: \hat{p}_{(j)} \leq \frac{\alpha j(n+1)}{m(n_0+1)}\right\}

The selection set is R={j:p^jp^(j)}R = \{j: \hat{p}_j \leq \hat{p}_{(j^*)}\}.

Technical Innovations

  1. Multiple Hypothesis Testing Framework: Reformulates selective labeling as a multiple hypothesis testing problem, enabling provision of rigorous statistical guarantees.
  2. Conformal p-value Construction: Constructs p-values through rank-based comparison with uncertainty scores of known misclassified instances, ensuring p-values of mislabeled instances stochastically dominate the uniform distribution.
  3. Data-Dependent Threshold: Carefully sets thresholds using calibration data to control label quality at the desired FDR level.

Experimental Setup

Datasets

Image Classification:

  • ImageNet (Deng et al., 2009)
  • ImageNet-V2 (Recht et al., 2019)

Text Annotation:

  • Stance on Global Warming (Luo et al., 2021): Determines whether titles acknowledge global warming as a serious problem
  • Misinformation (Gabriel et al., 2022): Binary annotation identifying whether text contains misinformation

LLM Question-Answering:

  • MedMCQA (Pal et al., 2022)
  • MMLU (Hendrycks et al., 2021)
  • MMLU-Pro (Wang et al., 2024)

Evaluation Metrics

  1. FDR: Expected proportion of mislabeled instances in the selected set
  2. Power: Proportion of correctly annotated instances selected
  3. AI Annotation Ratio: Number of instances annotated by AI divided by total size of calibration and test datasets

Baseline Methods

  1. Naive Method: Uses AI model to annotate test instances with uncertainty score Sn+j0.1S_{n+j} \leq 0.1
  2. Full AI Annotation: Applies AI predictions to entire test dataset
  3. BH Variants: BH, Storey-BH, Quantile-BH procedures

Implementation Details

  • Each experiment repeated 1000 times with averaged results reported
  • 10% of data randomly selected as calibration dataset
  • Maximum softmax probability (MSP) used as uncertainty score function
  • Target FDR level set to α = 0.1

Experimental Results

Main Results

Conformal Labeling successfully controls FDR at or below target levels across all annotation tasks and model architectures:

Performance on ImageNet:

  • ResNet-34: FDR=9.97%, Power=80.01%, AI Annotation Ratio=58.67%
  • In contrast, naive full AI annotation methods exceed 25% error rate

Performance on MMLU:

  • Qwen3-32B: FDR=10.00%, Power=82.96%, AI Annotation Ratio=65.22%

Tightness of FDR Control: Most experiments achieve FDR below 9.9%, with maximum deviation of 9.56%, demonstrating tight FDR control.

Ablation Studies

Impact of Model Accuracy: Higher prediction accuracy (achieved through stronger models or simpler datasets) improves power and AI annotation ratio.

Impact of Calibration Set Size:

  • FDR remains controlled with low standard deviation even at 5% calibration ratio
  • Increasing calibration ratio reduces variance in FDR and power
  • Improvements from 10% to 20% are negligible

Comparison of Selection Procedures: Conformal Labeling's selection procedure provides the tightest FDR control, consistently achieving FDR closest to the desired level.

Experimental Findings

  1. Choice of Uncertainty Score is Critical: Both MSP and DOCTOR-α scores effectively distinguish correct from incorrect predictions, while energy scores perform poorly.
  2. Method is Robust to Calibration Set Size: While larger calibration sets reduce variance, even smaller calibration sets achieve effective control.
  3. Relationship with Model Performance: Although the method guarantees FDR control independent of model performance, better models do achieve higher power.

Selective Labeling Methods

  • Heuristic approaches: Collaborative annotation frameworks, domain-specific methods
  • PAC labeling: Controls overall annotation error but AI subset error rates can be high
  • Selective prediction: Models can abstain when uncertain

Conformal p-value Selection

  • Conformal novelty detection: Identifies out-of-distribution instances
  • Conformal selection: Selects data points meeting specific quality criteria
  • Extensions to regression, multivariate data selection, online data selection

Theoretical Analysis

Theorem 3.1: Under the assumption that calibration and test samples are independently and identically distributed, let α ∈ (0,1) be the target FDR level and p = EH_j^0 be the probability that a test sample is mispredicted. Then the FDR of the selection set R satisfies:

FDR[1(1p)n+1]ααFDR \leq [1-(1-p)^{n+1}]\alpha \leq \alpha

This theorem ensures that Conformal Labeling strictly controls FDR below the desired level.

Conclusions and Discussion

Main Conclusions

  1. Conformal Labeling successfully addresses the lack of quality guarantees for AI-assigned labels in existing selective labeling methods
  2. Provides rigorous theoretical guarantees through FDR control, ensuring the expected error proportion of AI-assigned labels remains below user-specified levels
  3. Achieves tight FDR control and high statistical power across diverse tasks

Limitations

  1. Calibration Data Requirements: Requires a small amount of annotated calibration data, though practically feasible, still incurs cost
  2. Uncertainty Score Dependency: Method's power heavily depends on the quality of uncertainty scores
  3. IID Assumption: Requires calibration and test data to come from the same distribution
  4. Sensitivity in Regression: Highly sensitive to the choice of tolerance parameter ε in regression settings

Future Directions

  1. Explore better uncertainty score functions to improve statistical power
  2. Investigate methods for relaxing the IID assumption
  3. Develop adaptive methods for selecting tolerance parameters
  4. Extend to more complex annotation scenarios

In-Depth Evaluation

Strengths

  1. Theoretical Innovation: First to provide rigorous quality guarantees for AI-assigned labels in selective labeling, filling an important theoretical gap
  2. Method Generality: Applicable to both classification and regression tasks, validated across image, text, and LLM question-answering domains
  3. Comprehensive Experiments: Large-scale experimental validation including multiple datasets, models, and detailed ablation studies
  4. Practical Value: Simple and easy to implement, robust to calibration set size

Weaknesses

  1. Limited Novelty: Primarily applies existing conformal inference and multiple hypothesis testing techniques to new scenarios
  2. Assumption Limitations: IID assumption may not hold in practical applications
  3. Insufficient Power Analysis: While providing theoretical guarantees for FDR control, theoretical analysis of statistical power is limited
  4. Computational Complexity: Computational efficiency on large-scale datasets not discussed

Impact

  1. Academic Value: Provides important theoretical foundation for selective labeling research, likely to inspire follow-up studies
  2. Practical Significance: Provides reliable quality control methods in the context of increasingly important AI-assisted annotation
  3. Reproducibility: Detailed algorithm descriptions and implementation details facilitate reproduction

Applicable Scenarios

  1. Large-Scale Data Annotation: Scenarios requiring balance between cost and quality
  2. High-Quality Requirements: Applications with strict label quality requirements and need for theoretical guarantees
  3. AI-Assisted Annotation: Scenarios aiming to maximize AI annotation ratio while controlling error rates
  4. Multi-Domain Applications: Image classification, text analysis, question-answering systems, and other domains

References

This paper cites extensive related work, primarily including:

  • Conformal inference foundational theory (Vovk et al., 1999, 2005)
  • Multiple hypothesis testing methods (Benjamini & Hochberg, 1995)
  • Selective labeling related work (Candès et al., 2025)
  • Uncertainty quantification methods (Hendrycks & Gimpel, 2016)

Overall Assessment: This is an important theoretical contribution to the selective labeling field. While technical innovation is relatively limited, it successfully applies mature statistical methods to practical problems with rigorous theoretical guarantees. Experimental validation is comprehensive with high practical value, providing a reliable quality control framework for AI-assisted annotation.