2025-11-11T12:13:09.699032

Budget-constrained Active Learning to Effectively De-censor Survival Data

Parsaee, Jiang, Friggstad et al.

Standard supervised learners attempt to learn a model from a labeled dataset. Given a small set of labeled instances, and a pool of unlabeled instances, a budgeted learner can use its given budget to pay to acquire the labels of some unlabeled instances, which it can then use to produce a model. Here, we explore budgeted learning in the context of survival datasets, which include (right) censored instances, where we know only a lower bound on an instance's time-to-event. Here, that learner can pay to (partially) label a censored instance -- e.g., to acquire the actual time for an instance [perhaps go from (3 yr, censored) to (7.2 yr, uncensored)], or other variants [e.g., learn about one more year, so go from (3 yr, censored) to either (4 yr, censored) or perhaps (3.2 yr, uncensored)]. This serves as a model of real world data collection, where follow-up with censored patients does not always lead to uncensoring, and how much information is given to the learner model during data collection is a function of the budget and the nature of the data itself. We provide both experimental and theoretical results for how to apply state-of-the-art budgeted learning algorithms to survival data and the respective limitations that exist in doing so. Our approach provides bounds and time complexity asymptotically equivalent to the standard active learning method BatchBALD. Moreover, empirical analysis on several survival tasks show that our model performs better than other potential approaches on several benchmarks.

academic

Budget-constrained Active Learning to Effectively De-censor Survival Data

Basic Information

Paper ID: 2510.12144
Title: Budget-constrained Active Learning to Effectively De-censor Survival Data
Authors: Ali Parsaee, Bei Jiang, Zachary Friggstad, Russell Greiner (University of Alberta)
Classification: cs.LG cs.AI
Publication Date: October 15, 2025
Paper Link: https://arxiv.org/abs/2510.12144

Abstract

This paper explores the problem of budget-constrained active learning on survival datasets. Survival data contains right-censored instances, where we only know a lower bound on the event occurrence time. Learners can pay a budget to (partially) de-censor instances, for example, obtaining the actual time "7.2 years, uncensored" from "(3 years, censored)", or other variants such as "(3 years, censored)" to "(4 years, censored)" or "(3.2 years, uncensored)". This simulates real-world data collection processes where follow-up on censored patients does not always result in de-censoring. The information gained by the learner model during data collection is a function of both budget and data characteristics.

Research Background and Motivation

Problem Definition

Core Problem: How to effectively select censored instances for de-censoring under budget constraints to maximize survival prediction model performance
Practical Significance:
- High costs of patient follow-up in medical research
- Additional testing costs in industrial reliability testing
- Computational costs in algorithm runtime prediction

Limitations of Existing Methods

Traditional Active Learning: Primarily targets classification and regression tasks, neglecting the special characteristics of censored data
Active Learning in Survival Analysis: Limited research with insufficient budget constraint considerations
BatchBALD Limitations:
- Assumes oracle provides complete label information
- Does not account for different costs of individual instances
- Inapplicable to partial de-censoring scenarios

Research Motivation

Real-world data collection is costly, particularly in medical research, industrial testing, and similar domains. Traditional methods overlook budget constraints and the special nature of censored data, necessitating specialized approaches to handle such complex scenarios.

Core Contributions

Formal Definition: First formal definition of the learning problem for de-censoring instances under budget constraints
Algorithm Innovation: Proposes BBsurv algorithm, adapting BatchBALD to handle survival data and varying instance costs
Theoretical Guarantees: Proves the algorithm achieves optimal lower bound (1-1/e) in polynomial time
Comprehensive Evaluation: Conducts extensive experiments on three real survival datasets, demonstrating method robustness
Benchmark Establishment: Provides eight comparison algorithms, establishing evaluation benchmarks for this task

Methodology Details

Task Definition

Input:

Probe depth k ∈ ℜ+ (years explored per probe)
Budget B ∈ ℜ+
Training dataset D = {xi, ti, δi, ci}Li=1, where:
- xi: covariates
- ti: time
- δi: censoring indicator (1 for uncensored, 0 for censored)
- ci: probe cost

Output: Select instance set F such that ∑j∈F cj ≤ B, maximizing model performance

Model Architecture

1. Bayesian Survival Model

Uses Bayesian Multi-Task Logistic Regression (MTLR) model:

Discretizes continuous time into n time intervals {bi}ni=1
Outputs multinomial distribution {p(y = bi|x, ω, D)}ni=1
Generates individual survival distribution (ISD)

2. BBsurv Algorithm Core

Probability Adjustment Mechanism:

pcens(y = bi|ω) = p(y = bi|ω) / ∑nr=i p(y = br|ω)

Knowable Interval Processing:

Identifies "knowable" intervals within probe depth k
Merges intervals beyond probe range into single "unknowable" class buk
Generates final probability distribution pfinal

3. Acquisition Function

Based on BatchBALD's mutual information computation:

I(y1:b; ω|x1:b, D) = H(y1:b|x1:b, D) - Ep(ω|D,x1:b)[H(y1:b|x1:b, ω, D)]

Technical Innovations

Probe Depth Modeling: Innovatively models partial de-censoring as probe depth concept
Probability Redistribution: Cleverly handles zero probability intervals before censoring time
Budget Optimization: Reduces problem to weighted maximum coverage, solved via greedy algorithm
Unified Framework: Simultaneously handles uniform and non-uniform cost settings

Experimental Setup

Datasets

MIMIC-IV: 38,520 patients, 93 features, 67% censoring rate
NACD: 2,402 patients, 53 features, 36% censoring rate
SUPPORT: 9,105 patients, 42 features, 32% censoring rate

Evaluation Metrics

Primary Metric: MAE-PO (Mean Absolute Error with Pseudo Observations)
Auxiliary Metrics: C-index, Integrated Brier Score, MAE on uncensored data

Comparison Methods

BatchBALD: Original BatchBALD algorithm
C-BALD: Censoring-aware BALD variant
IDEAL: Inverse Distance-weighted Active Learning
Entropy Sampling: Entropy-based sampling
Variance Sampling: Variance-based sampling
Closest to Half (CtH): Sampling near 0.5 probability
Mean Closest to Middle (MCtM): Mean-middle point sampling
Clusters to form Batches (CfB): Clustering-based batch formation
Random: Random sampling

Implementation Details

10 time intervals (quantile-based partitioning)
Bayesian MTLR model with Spike-and-Slab prior
5000 training iterations
Artificial censoring ensures non-informative censoring assumption

Experimental Results

Main Results

Table 1 shows MAE-PO results at budget=10:

BBsurv significantly outperforms other methods in most settings
Performance converges between BBsurv and BatchBALD as probe depth increases
Most notable improvements on MIMIC dataset compared to BatchBALD

Key Findings:

Probe Depth Impact: BBsurv advantage maximized at k=5, approaches BatchBALD at k=100
Dataset Differences: Significant improvements on MIMIC and NACD, smaller differences on SUPPORT
Statistical Significance: Achieves p<0.05 significance in most cases

Budget Sensitivity Analysis

Figure 2 shows cross-budget performance:

Uniform Cost Setting: BBsurv consistently optimal across budget levels
Non-uniform Cost Setting: BBsurv advantage more pronounced, especially at high budgets
Cost Handling Advantage: Submodularity of mutual information enables BBsurv to better handle budget constraints

Ablation Studies

Probe Depth Impact:

k=5: BBsurv significantly outperforms baselines
k=10: Moderate improvements
k=100: Performance approaches BatchBALD

Cost Setting Comparison:

Uniform costs: Most methods perform similarly
Non-uniform costs: BBsurv and BatchBALD significantly outperform other methods

Experimental Findings

Diversity in Selection: PCA visualization shows BBsurv selects more diverse instances
Unexpected CfB Performance: Clustering method performs well in certain settings
Cost Sensitivity: Information-theoretic methods show greater advantage in non-uniform cost settings

Active Learning Field

Batch Active Learning: BatchBALD as SOTA method, but neglects budget and censored data
Uncertainty Sampling: Selects instances with highest model uncertainty
Diversity Methods: Focuses on sample diversity for improved generalization

Active Learning in Survival Analysis

Vinzamuri et al.: Based on Cox proportional hazards model, but without budget constraints
Hüttel et al.: C-BALD method for censored regression
Dedja et al.: Incremental label updates with random probe depth determination

Budget Learning

Lizotte et al.: Budget learning for naive Bayes classifiers
Maximum Coverage Problem: NP-hard combinatorial optimization problem
Greedy Algorithm: Polynomial-time algorithm with (1-1/e) approximation ratio

Conclusions and Discussion

Main Conclusions

Method Effectiveness: BBsurv outperforms existing methods in most settings
Theoretical Guarantees: Algorithm complexity comparable to BatchBALD while providing optimal approximation ratio
Practical Value: Applicable to medical research, industrial testing, and similar real-world scenarios
Robustness: Stable performance across different datasets, budgets, and probe depths

Limitations

Non-informative Censoring Assumption: May not hold in practical applications
Fixed Probe Depth: Does not consider dynamic probe depth adjustment
Discretization Approximation: Time discretization may lose information
Computational Complexity: Greedy algorithm may be slow on large-scale data

Future Directions

Semi-supervised Extension: Combining unlabeled data to improve performance
Informative Censoring: Relaxing non-informative censoring assumption
Dynamic Probing: Adjusting probe depth based on instance characteristics
Improved Approximation: Exploring more efficient maximum coverage approximation schemes

In-depth Evaluation

Strengths

Problem Novelty: First systematic study of de-censoring survival data under budget constraints
Method Rigor:
- Complete theoretical analysis with complexity and approximation guarantees
- Clever algorithm design effectively handling partial information acquisition
Experimental Sufficiency:
- Three real datasets with multiple evaluation metrics
- Comprehensive baseline comparisons and ablation studies
- Statistical significance verification
High Practical Value: Addresses real needs in medical, industrial, and related domains

Weaknesses

Assumption Limitations: Non-informative censoring assumption may not hold in practice
Method Constraints:
- Discretization may lose continuous time information
- Fixed probe depth lacks flexibility
Experimental Scope:
- Relatively limited dataset scale
- Lacks comparison with more SOTA survival analysis methods
Theoretical Analysis: Missing convergence and generalization error analysis

Impact

Academic Contribution:
- Opens new research direction, expected to inspire follow-up work
- Theoretical framework extensible to other incomplete information learning problems
Practical Value:
- Direct application to clinical trial design
- Applicable to industrial quality control and reliability testing
Method Generality: Framework adaptable to other active learning algorithms

Applicable Scenarios

Medical Research: Patient follow-up, clinical trial design
Industrial Applications: Product lifetime testing, failure prediction
Algorithm Analysis: Runtime prediction, performance evaluation
Financial Domain: Credit risk assessment, default prediction

References

The paper cites 41 related references, primarily including:

Original BatchBALD paper (Kirsch et al., 2019)
Classical survival analysis textbooks (Kleinbaum & Klein, 2012)
Maximum coverage problem research (Khuller et al., 1999)
Bayesian survival models (Qi et al., 2023)
Related active learning work (Vinzamuri et al., 2014; Hüttel et al., 2024)

Overall Assessment: This is a high-quality machine learning paper that innovatively addresses active learning for survival data under budget constraints. The method design is clever, theoretical analysis rigorous, and experimental validation comprehensive. While certain assumption limitations exist, it provides effective solutions to important practical applications with significant academic value and practical significance.