Budget-constrained Active Learning to Effectively De-censor Survival Data
Parsaee, Jiang, Friggstad et al.
Standard supervised learners attempt to learn a model from a labeled dataset. Given a small set of labeled instances, and a pool of unlabeled instances, a budgeted learner can use its given budget to pay to acquire the labels of some unlabeled instances, which it can then use to produce a model. Here, we explore budgeted learning in the context of survival datasets, which include (right) censored instances, where we know only a lower bound on an instance's time-to-event. Here, that learner can pay to (partially) label a censored instance -- e.g., to acquire the actual time for an instance [perhaps go from (3 yr, censored) to (7.2 yr, uncensored)], or other variants [e.g., learn about one more year, so go from (3 yr, censored) to either (4 yr, censored) or perhaps (3.2 yr, uncensored)]. This serves as a model of real world data collection, where follow-up with censored patients does not always lead to uncensoring, and how much information is given to the learner model during data collection is a function of the budget and the nature of the data itself. We provide both experimental and theoretical results for how to apply state-of-the-art budgeted learning algorithms to survival data and the respective limitations that exist in doing so. Our approach provides bounds and time complexity asymptotically equivalent to the standard active learning method BatchBALD. Moreover, empirical analysis on several survival tasks show that our model performs better than other potential approaches on several benchmarks.
academic
Budget-constrained Active Learning to Effectively De-censor Survival Data
This paper explores the problem of budget-constrained active learning on survival datasets. Survival data contains right-censored instances, where we only know a lower bound on the event occurrence time. Learners can pay a budget to (partially) de-censor instances, for example, obtaining the actual time "7.2 years, uncensored" from "(3 years, censored)", or other variants such as "(3 years, censored)" to "(4 years, censored)" or "(3.2 years, uncensored)". This simulates real-world data collection processes where follow-up on censored patients does not always result in de-censoring. The information gained by the learner model during data collection is a function of both budget and data characteristics.
Real-world data collection is costly, particularly in medical research, industrial testing, and similar domains. Traditional methods overlook budget constraints and the special nature of censored data, necessitating specialized approaches to handle such complex scenarios.
Maximum coverage problem research (Khuller et al., 1999)
Bayesian survival models (Qi et al., 2023)
Related active learning work (Vinzamuri et al., 2014; Hüttel et al., 2024)
Overall Assessment: This is a high-quality machine learning paper that innovatively addresses active learning for survival data under budget constraints. The method design is clever, theoretical analysis rigorous, and experimental validation comprehensive. While certain assumption limitations exist, it provides effective solutions to important practical applications with significant academic value and practical significance.