Active Learning (AL) for regression has been systematically under-researched due to the increased difficulty of measuring uncertainty in regression models. Since normalizing flows offer a full predictive distribution instead of a point forecast, they facilitate direct usage of known heuristics for AL like Entropy or Least-Confident sampling. However, we show that most of these heuristics do not work well for normalizing flows in pool-based AL and we need more sophisticated algorithms to distinguish between aleatoric and epistemic uncertainty. In this work we propose BALSA, an adaptation of the BALD algorithm, tailored for regression with normalizing flows. With this work we extend current research on uncertainty quantification with normalizing flows \cite{berry2023normalizing, berry2023escaping} to real world data and pool-based AL with multiple acquisition functions and query sizes. We report SOTA results for BALSA across 4 different datasets and 2 different architectures.
- Paper ID: 2501.01248
- Title: Bayesian Active Learning By Distribution Disagreement
- Authors: Thorben Werner, Lars Schmidt-Thieme (University of Hildesheim)
- Category: cs.LG (Machine Learning)
- Publication Date: January 2, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2501.01248
Active learning for regression tasks remains understudied due to the difficulty of quantifying uncertainty in regression models. While normalizing flows provide complete predictive distributions rather than point estimates, enabling direct application of known heuristics such as entropy or least confidence sampling, this paper demonstrates that these heuristics perform poorly on normalizing flows in pool-based active learning, necessitating more sophisticated algorithms to distinguish between aleatoric and epistemic uncertainty. The paper proposes BALSA, an improved variant of the BALD algorithm specifically designed for regression tasks using normalizing flows. This work extends research on uncertainty quantification in normalizing flows to real-world data and pool-based active learning with multiple acquisition functions and query sizes. BALSA achieves state-of-the-art results across 4 different datasets and 2 different architectures.
- Core Problem: Active learning for regression tasks is severely understudied, primarily because uncertainty quantification in regression models is more challenging than in classification tasks
- Significance: Active learning can reduce the amount of labeled data required to train strong models, yet existing research primarily focuses on classification problems
- Limitations of Existing Methods:
- Traditional regression models (except Gaussian processes) cannot directly provide uncertainty quantification
- Existing uncertainty heuristics (e.g., standard deviation, least confidence, Shannon entropy) perform poorly on normalizing flows
- Inability to effectively distinguish between aleatoric uncertainty (data noise) and epistemic uncertainty (model underfitting)
- Research Motivation: Emerging models such as normalizing flows and Gaussian neural networks provide complete predictive distributions, offering new opportunities for active learning in regression tasks
- Proposes BALSA Algorithm: An improved version of the BALD algorithm designed for models with predictive distributions, including two variants (BALSA_KL and BALSA_EMD)
- Establishes Comprehensive Benchmark: Creates a comprehensive benchmark for active learning with predictive distributions, including 3 heuristic baselines and 3 BALD adaptation variants
- Technical Innovation: Two novel BALD extension algorithms that directly leverage predictive distributions rather than relying on aggregation methods
- Experimental Validation: Extensive comparisons across 4 real-world datasets and 2 model architectures, demonstrating method effectiveness
- Input: Training dataset Dtrain:={(xi,yi)}i=1N, where x∈X,y∈Y
- Objective: Select the most valuable samples for annotation through active learning strategy, minimizing annotation cost
- Constraint: Pool-based active learning setting with fixed annotation budget B
The paper employs two regression models with predictive distributions:
- Gaussian Neural Networks (GNN): Uses MLP encoder to produce μ and σ parameters, constructing Gaussian predictive distributions
- Normalizing Flows (NF): Uses invertible transformations to parameterize free-form predictive distributions, capable of modeling more complex target distributions
BALSA builds upon the core idea of the BALD algorithm but improves it for predictive distributions:
Original BALD Formula:
BALD(x)=∑i=1k(H[yˉ(x)]−H[y^θi(x)])
BALSA Improvement Strategy:
BALD(x)=∑i=1kϕ(y^θi(x),yˉ(x))
where φ is a distance metric function that directly measures the distance between predictive distributions.
Grid Sampling Method:
- Normalize target values to 0,1
- Sample distribution across 200 grid points
- Compute likelihood vectors and average: pˉ∣x=k1∑j=1kp^θj⊣∣x
Pairwise Comparison Method:
- Avoid computing average distribution
- Use k-1 pairs of parameter samples: ∑i=1k−1ϕ(p^θi∣x,p^θi+1∣x)
BALSA_KL (Kullback-Leibler Divergence):
- Grid version: BALSAKLGrid(x)=∑i=1kKL(p^θi⊣∣x,pˉ∣x)
- Pairwise version: BALSAKLPair(x)=∑i=1k−1KL(p^θi∣x,p^θi+1∣x)
BALSA_EMD (Earth Mover's Distance):
BALSAEMD(x)=∑i=1k−1EMD(yθi′,yθi+1′)
where yθ′∼p^θ∣x
Four regression datasets are used, covering different scales and complexities:
| Dataset | Features | Training Samples | Initial Labeled Set | Budget |
|---|
| Parkinsons | 61 | 3,760 | 200 | 800 |
| Superconductors | 81 | 13,608 | 200 | 800 |
| Sarcos | 21 | 28,470 | 200 | 1,200 |
| Diamonds | 26 | 34,522 | 200 | 1,200 |
- Primary Metric: Negative Log-Likelihood (NLL)
- Auxiliary Metrics: Mean Absolute Error (MAE), CRPS Score
- Statistical Method: Wilcoxon signed-rank test, results aggregated using CD diagrams
- Clustering Methods: Coreset, CoreGCN, TypiClust
- Heuristic Methods: Standard Deviation (Std), Least Confidence (LC), Shannon Entropy (Entropy)
- BALD Variants: BALD_σ, BALD_LC, BALD_H
- Proposed Methods: BALSA_KL Grid/Pair, BALSA_EMD
- Model Architecture: MLP encoder + distribution decoder
- Normalizing Flows: Autoregressive neural spline flows with rational quadratic spline transformations
- Optimizer: NAdam
- Dropout Rate: 0.008-0.05 (optimized for each dataset)
- Experimental Repetitions: 30 runs per experiment
Critical Difference diagram based on NLL metric shows:
- BALSA_KL Pairs: Best average ranking, optimal performance
- BALSA_KL Grid: Close second, second-best ranking
- BALD_H: Third ranking
- Coreset: Best performance among geometric methods
Key Findings:
- Traditional heuristic methods (entropy, standard deviation, least confidence) perform poorly on normalizing flows
- BALSA methods show clear advantages on normalizing flow architectures
- Coreset and CoreGCN perform better on GNN architectures
Tests the effect of using different dropout rates during training and evaluation:
- Inconsistent results: BALSA_EMD dual shows performance degradation, BALSA_KL Grid dual shows slight improvement
- Hypothesis: Dropout rate switching may affect model prediction quality
Tests normalized version of BALSA_KL Grid:
- Normalized version shows slightly lower performance than non-normalized version
- Simpler non-normalized formula is selected
Performance on τ = {50, 200}:
- Uncertainty sampling methods maintain performance with larger query sizes
- Clustering algorithms (Coreset, TypiClust) show faster performance degradation
- Contradicts common understanding from classification tasks
Active learning trajectory on Diamonds dataset demonstrates:
- BALSA methods converge faster
- Traditional heuristic methods approach random sampling performance
- Consistent performance across NLL and MAE metrics
- Geometric Methods: Coreset, CoreGCN, TypiClust and others based on data geometric properties
- Uncertainty Methods: Mostly bound to specific model architectures with poor generalizability
- BALD Algorithm: One of few model-agnostic methods
Berry and Meger's work 1,2:
- Proposes normalizing flow ensembles and MC dropout approximations
- Validation only on synthetic data
- This paper extends to real data and multiple acquisition functions
- Uses Shannon entropy rather than simple -∑logŷ_θ(x)
- Extends to real-world datasets
- Compares with multiple active learning algorithms
- Method Effectiveness: BALSA performs excellently on normalizing flows, particularly the BALSA_KL Pairs variant
- Heuristic Failure: Traditional uncertainty heuristics perform poorly on normalizing flows
- Architecture Dependency: Different algorithms show significant performance variations across model architectures
- Query Size Impact: Uncertainty methods are more stable with larger query sizes
- Insufficient Theoretical Analysis: Lacks convergence analysis for BALSA algorithm
- Computational Overhead: MC dropout and distribution distance computation increase computational cost
- Hyperparameter Sensitivity: Dropout rate selection significantly impacts performance
- Dataset Limitations: Validation on only 4 datasets, generalizability remains to be verified
- Extend to other parameter sampling methods (Langevin Dynamics, SVGD)
- Theoretical analysis of BALSA convergence properties
- Investigate additional distribution distance metrics
- Validate on larger-scale datasets
- Problem Importance: Addresses the overlooked yet important problem of regression active learning
- Method Novelty: First to directly apply distribution distances to active learning, avoiding information loss from aggregation methods
- Experimental Comprehensiveness: Multi-dataset, multi-architecture, multi-metric comprehensive evaluation
- Practical Value: Provides reproducible code and detailed experimental settings
- Weak Theoretical Foundation: Lacks theoretical analysis explaining why BALSA is more effective
- Computational Efficiency: MC dropout and EMD computation may impact practical applications
- Hyperparameter Tuning: Lack of principled guidance for dropout rate selection
- Evaluation Limitations: Primarily based on NLL, consistency across other regression metrics remains to be verified
- Academic Contribution: Provides new research direction for regression active learning
- Practical Value: Particularly suitable for regression applications requiring uncertainty quantification
- Reproducibility: Complete code and experimental configuration provided, facilitating subsequent research
- Scientific Computing: Physics/chemistry modeling requiring uncertainty quantification
- Risk Assessment: Finance, healthcare and other uncertainty-sensitive domains
- Engineering Optimization: Design optimization problems requiring exploration-exploitation balance
- Time Series: Prediction tasks with complex distributions
This paper primarily references the following key works:
- Berry & Meger (2023): Uncertainty modeling with normalizing flow ensembles
- Gal et al. (2017): Original BALD algorithm proposal
- Sener & Savarese (2017): Coreset active learning method
- Durkan et al. (2019): Technical foundation of neural spline flows
Overall Assessment: This is a high-quality research addressing the important yet overlooked problem of regression active learning. The proposal of the BALSA algorithm fills the gap in applying normalizing flows to active learning, with sufficient experimental design and convincing results. While there remains room for improvement in theoretical analysis and computational efficiency, it makes important contributions to the development of this field.