2025-11-12T19:34:10.329996

Bayesian Active Learning By Distribution Disagreement

Werner, Schmidt-Thieme

Active Learning (AL) for regression has been systematically under-researched due to the increased difficulty of measuring uncertainty in regression models. Since normalizing flows offer a full predictive distribution instead of a point forecast, they facilitate direct usage of known heuristics for AL like Entropy or Least-Confident sampling. However, we show that most of these heuristics do not work well for normalizing flows in pool-based AL and we need more sophisticated algorithms to distinguish between aleatoric and epistemic uncertainty. In this work we propose BALSA, an adaptation of the BALD algorithm, tailored for regression with normalizing flows. With this work we extend current research on uncertainty quantification with normalizing flows \cite{berry2023normalizing, berry2023escaping} to real world data and pool-based AL with multiple acquisition functions and query sizes. We report SOTA results for BALSA across 4 different datasets and 2 different architectures.

academic

Bayesian Active Learning By Distribution Disagreement

Basic Information

Paper ID: 2501.01248
Title: Bayesian Active Learning By Distribution Disagreement
Authors: Thorben Werner, Lars Schmidt-Thieme (University of Hildesheim)
Category: cs.LG (Machine Learning)
Publication Date: January 2, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2501.01248

Abstract

Active learning for regression tasks remains understudied due to the difficulty of quantifying uncertainty in regression models. While normalizing flows provide complete predictive distributions rather than point estimates, enabling direct application of known heuristics such as entropy or least confidence sampling, this paper demonstrates that these heuristics perform poorly on normalizing flows in pool-based active learning, necessitating more sophisticated algorithms to distinguish between aleatoric and epistemic uncertainty. The paper proposes BALSA, an improved variant of the BALD algorithm specifically designed for regression tasks using normalizing flows. This work extends research on uncertainty quantification in normalizing flows to real-world data and pool-based active learning with multiple acquisition functions and query sizes. BALSA achieves state-of-the-art results across 4 different datasets and 2 different architectures.

Research Background and Motivation

Problem Definition

Core Problem: Active learning for regression tasks is severely understudied, primarily because uncertainty quantification in regression models is more challenging than in classification tasks
Significance: Active learning can reduce the amount of labeled data required to train strong models, yet existing research primarily focuses on classification problems
Limitations of Existing Methods:
- Traditional regression models (except Gaussian processes) cannot directly provide uncertainty quantification
- Existing uncertainty heuristics (e.g., standard deviation, least confidence, Shannon entropy) perform poorly on normalizing flows
- Inability to effectively distinguish between aleatoric uncertainty (data noise) and epistemic uncertainty (model underfitting)
Research Motivation: Emerging models such as normalizing flows and Gaussian neural networks provide complete predictive distributions, offering new opportunities for active learning in regression tasks

Core Contributions

Proposes BALSA Algorithm: An improved version of the BALD algorithm designed for models with predictive distributions, including two variants (BALSA_KL and BALSA_EMD)
Establishes Comprehensive Benchmark: Creates a comprehensive benchmark for active learning with predictive distributions, including 3 heuristic baselines and 3 BALD adaptation variants
Technical Innovation: Two novel BALD extension algorithms that directly leverage predictive distributions rather than relying on aggregation methods
Experimental Validation: Extensive comparisons across 4 real-world datasets and 2 model architectures, demonstrating method effectiveness

Methodology Details

Task Definition

Input: Training dataset $D_{train} := \{(x_i, y_i)\}_{i=1}^N$ , where $x \in \mathcal{X}, y \in \mathcal{Y}$
Objective: Select the most valuable samples for annotation through active learning strategy, minimizing annotation cost
Constraint: Pool-based active learning setting with fixed annotation budget B

Model Architecture

1. Base Models

The paper employs two regression models with predictive distributions:

Gaussian Neural Networks (GNN): Uses MLP encoder to produce μ and σ parameters, constructing Gaussian predictive distributions
Normalizing Flows (NF): Uses invertible transformations to parameterize free-form predictive distributions, capable of modeling more complex target distributions

2. BALSA Algorithm Core Concept

BALSA builds upon the core idea of the BALD algorithm but improves it for predictive distributions:

Original BALD Formula: $BALD(x) = \sum_{i=1}^k (H[\bar{y}(x)] - H[\hat{y}_{\theta_i}(x)])$

BALSA Improvement Strategy: $BALD(x) = \sum_{i=1}^k \phi(\hat{y}_{\theta_i}(x), \bar{y}(x))$

where φ is a distance metric function that directly measures the distance between predictive distributions.

Technical Innovation Points

1. Average Distribution Computation

Grid Sampling Method:

Normalize target values to 0,1
Sample distribution across 200 grid points
Compute likelihood vectors and average: $\bar{p}|x = \frac{1}{k}\sum_{j=1}^k \hat{p}^⊣_{\theta_j}|x$

Pairwise Comparison Method:

Avoid computing average distribution
Use k-1 pairs of parameter samples: $\sum_{i=1}^{k-1} \phi(\hat{p}_{\theta_i}|x, \hat{p}_{\theta_{i+1}}|x)$

2. Distance Metric Functions

BALSA_KL (Kullback-Leibler Divergence):

Grid version: $BALSA_{KL}^{Grid}(x) = \sum_{i=1}^k KL(\hat{p}^⊣_{\theta_i}|x, \bar{p}|x)$
Pairwise version: $BALSA_{KL}^{Pair}(x) = \sum_{i=1}^{k-1} KL(\hat{p}_{\theta_i}|x, \hat{p}_{\theta_{i+1}}|x)$

BALSA_EMD (Earth Mover's Distance): $BALSA_{EMD}(x) = \sum_{i=1}^{k-1} EMD(y'_{\theta_i}, y'_{\theta_{i+1}})$

where $y'_\theta \sim \hat{p}_\theta|x$

Experimental Setup

Datasets

Four regression datasets are used, covering different scales and complexities:

Dataset	Features	Training Samples	Initial Labeled Set	Budget
Parkinsons	61	3,760	200	800
Superconductors	81	13,608	200	800
Sarcos	21	28,470	200	1,200
Diamonds	26	34,522	200	1,200

Evaluation Metrics

Primary Metric: Negative Log-Likelihood (NLL)
Auxiliary Metrics: Mean Absolute Error (MAE), CRPS Score
Statistical Method: Wilcoxon signed-rank test, results aggregated using CD diagrams

Comparison Methods

Clustering Methods: Coreset, CoreGCN, TypiClust
Heuristic Methods: Standard Deviation (Std), Least Confidence (LC), Shannon Entropy (Entropy)
BALD Variants: BALD_σ, BALD_LC, BALD_H
Proposed Methods: BALSA_KL Grid/Pair, BALSA_EMD

Implementation Details

Model Architecture: MLP encoder + distribution decoder
Normalizing Flows: Autoregressive neural spline flows with rational quadratic spline transformations
Optimizer: NAdam
Dropout Rate: 0.008-0.05 (optimized for each dataset)
Experimental Repetitions: 30 runs per experiment

Experimental Results

Main Results

Critical Difference diagram based on NLL metric shows:

BALSA_KL Pairs: Best average ranking, optimal performance
BALSA_KL Grid: Close second, second-best ranking
BALD_H: Third ranking
Coreset: Best performance among geometric methods

Key Findings:

Traditional heuristic methods (entropy, standard deviation, least confidence) perform poorly on normalizing flows
BALSA methods show clear advantages on normalizing flow architectures
Coreset and CoreGCN perform better on GNN architectures

Ablation Studies

1. Dual Mode Experiment

Tests the effect of using different dropout rates during training and evaluation:

Inconsistent results: BALSA_EMD dual shows performance degradation, BALSA_KL Grid dual shows slight improvement
Hypothesis: Dropout rate switching may affect model prediction quality

2. Renormalization Experiment

Tests normalized version of BALSA_KL Grid:

Normalized version shows slightly lower performance than non-normalized version
Simpler non-normalized formula is selected

3. Query Size Experiment

Performance on τ = {50, 200}:

Uncertainty sampling methods maintain performance with larger query sizes
Clustering algorithms (Coreset, TypiClust) show faster performance degradation
Contradicts common understanding from classification tasks

Case Analysis

Active learning trajectory on Diamonds dataset demonstrates:

BALSA methods converge faster
Traditional heuristic methods approach random sampling performance
Consistent performance across NLL and MAE metrics

Regression Active Learning

Geometric Methods: Coreset, CoreGCN, TypiClust and others based on data geometric properties
Uncertainty Methods: Mostly bound to specific model architectures with poor generalizability
BALD Algorithm: One of few model-agnostic methods

Berry and Meger's work 1,2:

Proposes normalizing flow ensembles and MC dropout approximations
Validation only on synthetic data
This paper extends to real data and multiple acquisition functions

Distinctions and Improvements

Uses Shannon entropy rather than simple -∑logŷ_θ(x)
Extends to real-world datasets
Compares with multiple active learning algorithms

Conclusions and Discussion

Main Conclusions

Method Effectiveness: BALSA performs excellently on normalizing flows, particularly the BALSA_KL Pairs variant
Heuristic Failure: Traditional uncertainty heuristics perform poorly on normalizing flows
Architecture Dependency: Different algorithms show significant performance variations across model architectures
Query Size Impact: Uncertainty methods are more stable with larger query sizes

Limitations

Insufficient Theoretical Analysis: Lacks convergence analysis for BALSA algorithm
Computational Overhead: MC dropout and distribution distance computation increase computational cost
Hyperparameter Sensitivity: Dropout rate selection significantly impacts performance
Dataset Limitations: Validation on only 4 datasets, generalizability remains to be verified

Future Directions

Extend to other parameter sampling methods (Langevin Dynamics, SVGD)
Theoretical analysis of BALSA convergence properties
Investigate additional distribution distance metrics
Validate on larger-scale datasets

In-Depth Evaluation

Strengths

Problem Importance: Addresses the overlooked yet important problem of regression active learning
Method Novelty: First to directly apply distribution distances to active learning, avoiding information loss from aggregation methods
Experimental Comprehensiveness: Multi-dataset, multi-architecture, multi-metric comprehensive evaluation
Practical Value: Provides reproducible code and detailed experimental settings

Weaknesses

Weak Theoretical Foundation: Lacks theoretical analysis explaining why BALSA is more effective
Computational Efficiency: MC dropout and EMD computation may impact practical applications
Hyperparameter Tuning: Lack of principled guidance for dropout rate selection
Evaluation Limitations: Primarily based on NLL, consistency across other regression metrics remains to be verified

Impact

Academic Contribution: Provides new research direction for regression active learning
Practical Value: Particularly suitable for regression applications requiring uncertainty quantification
Reproducibility: Complete code and experimental configuration provided, facilitating subsequent research

Applicable Scenarios

Scientific Computing: Physics/chemistry modeling requiring uncertainty quantification
Risk Assessment: Finance, healthcare and other uncertainty-sensitive domains
Engineering Optimization: Design optimization problems requiring exploration-exploitation balance
Time Series: Prediction tasks with complex distributions

References

This paper primarily references the following key works:

Berry & Meger (2023): Uncertainty modeling with normalizing flow ensembles
Gal et al. (2017): Original BALD algorithm proposal
Sener & Savarese (2017): Coreset active learning method
Durkan et al. (2019): Technical foundation of neural spline flows

Overall Assessment: This is a high-quality research addressing the important yet overlooked problem of regression active learning. The proposal of the BALSA algorithm fills the gap in applying normalizing flows to active learning, with sufficient experimental design and convincing results. While there remains room for improvement in theoretical analysis and computational efficiency, it makes important contributions to the development of this field.