2025-11-12T19:34:10.329996

Bayesian Active Learning By Distribution Disagreement

Werner, Schmidt-Thieme
Active Learning (AL) for regression has been systematically under-researched due to the increased difficulty of measuring uncertainty in regression models. Since normalizing flows offer a full predictive distribution instead of a point forecast, they facilitate direct usage of known heuristics for AL like Entropy or Least-Confident sampling. However, we show that most of these heuristics do not work well for normalizing flows in pool-based AL and we need more sophisticated algorithms to distinguish between aleatoric and epistemic uncertainty. In this work we propose BALSA, an adaptation of the BALD algorithm, tailored for regression with normalizing flows. With this work we extend current research on uncertainty quantification with normalizing flows \cite{berry2023normalizing, berry2023escaping} to real world data and pool-based AL with multiple acquisition functions and query sizes. We report SOTA results for BALSA across 4 different datasets and 2 different architectures.
academic

Bayesian Active Learning By Distribution Disagreement

Basic Information

  • Paper ID: 2501.01248
  • Title: Bayesian Active Learning By Distribution Disagreement
  • Authors: Thorben Werner, Lars Schmidt-Thieme (University of Hildesheim)
  • Category: cs.LG (Machine Learning)
  • Publication Date: January 2, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2501.01248

Abstract

Active learning for regression tasks remains understudied due to the difficulty of quantifying uncertainty in regression models. While normalizing flows provide complete predictive distributions rather than point estimates, enabling direct application of known heuristics such as entropy or least confidence sampling, this paper demonstrates that these heuristics perform poorly on normalizing flows in pool-based active learning, necessitating more sophisticated algorithms to distinguish between aleatoric and epistemic uncertainty. The paper proposes BALSA, an improved variant of the BALD algorithm specifically designed for regression tasks using normalizing flows. This work extends research on uncertainty quantification in normalizing flows to real-world data and pool-based active learning with multiple acquisition functions and query sizes. BALSA achieves state-of-the-art results across 4 different datasets and 2 different architectures.

Research Background and Motivation

Problem Definition

  1. Core Problem: Active learning for regression tasks is severely understudied, primarily because uncertainty quantification in regression models is more challenging than in classification tasks
  2. Significance: Active learning can reduce the amount of labeled data required to train strong models, yet existing research primarily focuses on classification problems
  3. Limitations of Existing Methods:
    • Traditional regression models (except Gaussian processes) cannot directly provide uncertainty quantification
    • Existing uncertainty heuristics (e.g., standard deviation, least confidence, Shannon entropy) perform poorly on normalizing flows
    • Inability to effectively distinguish between aleatoric uncertainty (data noise) and epistemic uncertainty (model underfitting)
  4. Research Motivation: Emerging models such as normalizing flows and Gaussian neural networks provide complete predictive distributions, offering new opportunities for active learning in regression tasks

Core Contributions

  1. Proposes BALSA Algorithm: An improved version of the BALD algorithm designed for models with predictive distributions, including two variants (BALSA_KL and BALSA_EMD)
  2. Establishes Comprehensive Benchmark: Creates a comprehensive benchmark for active learning with predictive distributions, including 3 heuristic baselines and 3 BALD adaptation variants
  3. Technical Innovation: Two novel BALD extension algorithms that directly leverage predictive distributions rather than relying on aggregation methods
  4. Experimental Validation: Extensive comparisons across 4 real-world datasets and 2 model architectures, demonstrating method effectiveness

Methodology Details

Task Definition

  • Input: Training dataset Dtrain:={(xi,yi)}i=1ND_{train} := \{(x_i, y_i)\}_{i=1}^N, where xX,yYx \in \mathcal{X}, y \in \mathcal{Y}
  • Objective: Select the most valuable samples for annotation through active learning strategy, minimizing annotation cost
  • Constraint: Pool-based active learning setting with fixed annotation budget B

Model Architecture

1. Base Models

The paper employs two regression models with predictive distributions:

  • Gaussian Neural Networks (GNN): Uses MLP encoder to produce μ and σ parameters, constructing Gaussian predictive distributions
  • Normalizing Flows (NF): Uses invertible transformations to parameterize free-form predictive distributions, capable of modeling more complex target distributions

2. BALSA Algorithm Core Concept

BALSA builds upon the core idea of the BALD algorithm but improves it for predictive distributions:

Original BALD Formula: BALD(x)=i=1k(H[yˉ(x)]H[y^θi(x)])BALD(x) = \sum_{i=1}^k (H[\bar{y}(x)] - H[\hat{y}_{\theta_i}(x)])

BALSA Improvement Strategy: BALD(x)=i=1kϕ(y^θi(x),yˉ(x))BALD(x) = \sum_{i=1}^k \phi(\hat{y}_{\theta_i}(x), \bar{y}(x))

where φ is a distance metric function that directly measures the distance between predictive distributions.

Technical Innovation Points

1. Average Distribution Computation

Grid Sampling Method:

  • Normalize target values to 0,1
  • Sample distribution across 200 grid points
  • Compute likelihood vectors and average: pˉx=1kj=1kp^θjx\bar{p}|x = \frac{1}{k}\sum_{j=1}^k \hat{p}^⊣_{\theta_j}|x

Pairwise Comparison Method:

  • Avoid computing average distribution
  • Use k-1 pairs of parameter samples: i=1k1ϕ(p^θix,p^θi+1x)\sum_{i=1}^{k-1} \phi(\hat{p}_{\theta_i}|x, \hat{p}_{\theta_{i+1}}|x)

2. Distance Metric Functions

BALSA_KL (Kullback-Leibler Divergence):

  • Grid version: BALSAKLGrid(x)=i=1kKL(p^θix,pˉx)BALSA_{KL}^{Grid}(x) = \sum_{i=1}^k KL(\hat{p}^⊣_{\theta_i}|x, \bar{p}|x)
  • Pairwise version: BALSAKLPair(x)=i=1k1KL(p^θix,p^θi+1x)BALSA_{KL}^{Pair}(x) = \sum_{i=1}^{k-1} KL(\hat{p}_{\theta_i}|x, \hat{p}_{\theta_{i+1}}|x)

BALSA_EMD (Earth Mover's Distance): BALSAEMD(x)=i=1k1EMD(yθi,yθi+1)BALSA_{EMD}(x) = \sum_{i=1}^{k-1} EMD(y'_{\theta_i}, y'_{\theta_{i+1}})

where yθp^θxy'_\theta \sim \hat{p}_\theta|x

Experimental Setup

Datasets

Four regression datasets are used, covering different scales and complexities:

DatasetFeaturesTraining SamplesInitial Labeled SetBudget
Parkinsons613,760200800
Superconductors8113,608200800
Sarcos2128,4702001,200
Diamonds2634,5222001,200

Evaluation Metrics

  • Primary Metric: Negative Log-Likelihood (NLL)
  • Auxiliary Metrics: Mean Absolute Error (MAE), CRPS Score
  • Statistical Method: Wilcoxon signed-rank test, results aggregated using CD diagrams

Comparison Methods

  • Clustering Methods: Coreset, CoreGCN, TypiClust
  • Heuristic Methods: Standard Deviation (Std), Least Confidence (LC), Shannon Entropy (Entropy)
  • BALD Variants: BALD_σ, BALD_LC, BALD_H
  • Proposed Methods: BALSA_KL Grid/Pair, BALSA_EMD

Implementation Details

  • Model Architecture: MLP encoder + distribution decoder
  • Normalizing Flows: Autoregressive neural spline flows with rational quadratic spline transformations
  • Optimizer: NAdam
  • Dropout Rate: 0.008-0.05 (optimized for each dataset)
  • Experimental Repetitions: 30 runs per experiment

Experimental Results

Main Results

Critical Difference diagram based on NLL metric shows:

  1. BALSA_KL Pairs: Best average ranking, optimal performance
  2. BALSA_KL Grid: Close second, second-best ranking
  3. BALD_H: Third ranking
  4. Coreset: Best performance among geometric methods

Key Findings:

  • Traditional heuristic methods (entropy, standard deviation, least confidence) perform poorly on normalizing flows
  • BALSA methods show clear advantages on normalizing flow architectures
  • Coreset and CoreGCN perform better on GNN architectures

Ablation Studies

1. Dual Mode Experiment

Tests the effect of using different dropout rates during training and evaluation:

  • Inconsistent results: BALSA_EMD dual shows performance degradation, BALSA_KL Grid dual shows slight improvement
  • Hypothesis: Dropout rate switching may affect model prediction quality

2. Renormalization Experiment

Tests normalized version of BALSA_KL Grid:

  • Normalized version shows slightly lower performance than non-normalized version
  • Simpler non-normalized formula is selected

3. Query Size Experiment

Performance on τ = {50, 200}:

  • Uncertainty sampling methods maintain performance with larger query sizes
  • Clustering algorithms (Coreset, TypiClust) show faster performance degradation
  • Contradicts common understanding from classification tasks

Case Analysis

Active learning trajectory on Diamonds dataset demonstrates:

  • BALSA methods converge faster
  • Traditional heuristic methods approach random sampling performance
  • Consistent performance across NLL and MAE metrics

Regression Active Learning

  • Geometric Methods: Coreset, CoreGCN, TypiClust and others based on data geometric properties
  • Uncertainty Methods: Mostly bound to specific model architectures with poor generalizability
  • BALD Algorithm: One of few model-agnostic methods

Berry and Meger's work 1,2:

  • Proposes normalizing flow ensembles and MC dropout approximations
  • Validation only on synthetic data
  • This paper extends to real data and multiple acquisition functions

Distinctions and Improvements

  1. Uses Shannon entropy rather than simple -∑logŷ_θ(x)
  2. Extends to real-world datasets
  3. Compares with multiple active learning algorithms

Conclusions and Discussion

Main Conclusions

  1. Method Effectiveness: BALSA performs excellently on normalizing flows, particularly the BALSA_KL Pairs variant
  2. Heuristic Failure: Traditional uncertainty heuristics perform poorly on normalizing flows
  3. Architecture Dependency: Different algorithms show significant performance variations across model architectures
  4. Query Size Impact: Uncertainty methods are more stable with larger query sizes

Limitations

  1. Insufficient Theoretical Analysis: Lacks convergence analysis for BALSA algorithm
  2. Computational Overhead: MC dropout and distribution distance computation increase computational cost
  3. Hyperparameter Sensitivity: Dropout rate selection significantly impacts performance
  4. Dataset Limitations: Validation on only 4 datasets, generalizability remains to be verified

Future Directions

  1. Extend to other parameter sampling methods (Langevin Dynamics, SVGD)
  2. Theoretical analysis of BALSA convergence properties
  3. Investigate additional distribution distance metrics
  4. Validate on larger-scale datasets

In-Depth Evaluation

Strengths

  1. Problem Importance: Addresses the overlooked yet important problem of regression active learning
  2. Method Novelty: First to directly apply distribution distances to active learning, avoiding information loss from aggregation methods
  3. Experimental Comprehensiveness: Multi-dataset, multi-architecture, multi-metric comprehensive evaluation
  4. Practical Value: Provides reproducible code and detailed experimental settings

Weaknesses

  1. Weak Theoretical Foundation: Lacks theoretical analysis explaining why BALSA is more effective
  2. Computational Efficiency: MC dropout and EMD computation may impact practical applications
  3. Hyperparameter Tuning: Lack of principled guidance for dropout rate selection
  4. Evaluation Limitations: Primarily based on NLL, consistency across other regression metrics remains to be verified

Impact

  1. Academic Contribution: Provides new research direction for regression active learning
  2. Practical Value: Particularly suitable for regression applications requiring uncertainty quantification
  3. Reproducibility: Complete code and experimental configuration provided, facilitating subsequent research

Applicable Scenarios

  1. Scientific Computing: Physics/chemistry modeling requiring uncertainty quantification
  2. Risk Assessment: Finance, healthcare and other uncertainty-sensitive domains
  3. Engineering Optimization: Design optimization problems requiring exploration-exploitation balance
  4. Time Series: Prediction tasks with complex distributions

References

This paper primarily references the following key works:

  1. Berry & Meger (2023): Uncertainty modeling with normalizing flow ensembles
  2. Gal et al. (2017): Original BALD algorithm proposal
  3. Sener & Savarese (2017): Coreset active learning method
  4. Durkan et al. (2019): Technical foundation of neural spline flows

Overall Assessment: This is a high-quality research addressing the important yet overlooked problem of regression active learning. The proposal of the BALSA algorithm fills the gap in applying normalizing flows to active learning, with sufficient experimental design and convincing results. While there remains room for improvement in theoretical analysis and computational efficiency, it makes important contributions to the development of this field.