Uncertainty-aware machine learners, such as Bayesian neural networks, output a quantification of uncertainty instead of a point prediction. In this work, we provide uncertainty-aware learners with a principled framework to characterize, and identify ways to eliminate, errors that arise from reducible (epistemic) uncertainty. We introduce a principled definition of epistemic error, and provide a decompositional epistemic error bound which operates in the very general setting of imperfect multitask learning under distribution shift. In this setting, the training (source) data may arise from multiple tasks, the test (target) data may differ systematically from the source data tasks, and/or the learner may not arrive at an accurate characterization of the source data. Our bound separately attributes epistemic errors to each of multiple aspects of the learning procedure and environment. As corollaries of the general result, we provide epistemic error bounds specialized to the settings of Bayesian transfer learning and distribution shift within $ε$-neighborhoods. We additionally leverage the terms in our bound to provide a novel definition of negative transfer.
- Paper ID: 2505.23496
- Title: Epistemic Errors of Imperfect Multitask Learners When Distributions Shift
- Authors: Sabina J. Sloman, Michele Caprio, Samuel Kaski
- Classification: cs.LG stat.ML
- Publication Date: October 13, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2505.23496
This paper provides a principled framework for uncertainty-aware machine learning models (such as Bayesian neural networks) to characterize and reduce errors caused by reducible (epistemic) uncertainty. The paper introduces a principled definition of epistemic error and provides decomposable epistemic error bounds in the very general setting of imperfect multitask learning under distribution shift. In this setting, training (source) data may originate from multiple tasks, test (target) data may exhibit systematic differences from source tasks, and/or the learner may fail to accurately characterize source data. The bound attributes epistemic error to multiple aspects of both the learning process and the environment.
The core problem this research addresses is: How can we provide a theoretical framework for uncertainty-aware learners to characterize and reduce epistemic error? Specifically:
- Limitations of Traditional Learning Theory: Existing statistical learning theory primarily focuses on generalization error, but for learners that quantify output uncertainty, prediction error is an irrelevant, incomplete, or uninformative performance measure.
- Confusion of Uncertainty Types: Traditional approaches conflate reducible epistemic uncertainty with irreducible aleatoric uncertainty, failing to effectively guide model improvement.
- Lack of Theoretical Support for Complex Learning Scenarios: Complex real-world scenarios involving multitask learning, distribution shift, and imperfect learning lack theoretical guidance.
- Practical Application Value: Accurate uncertainty quantification is critical in high-risk domains such as healthcare
- Theoretical Advancement: Fills gaps in uncertainty-aware learning theory
- Practical Guidance: Provides theoretical basis for model selection and optimization
- Traditional frameworks such as PAC learning theory cannot distinguish epistemic error from aleatoric error
- Lack of unified theoretical framework for multitask learning and distribution shift scenarios
- Existing bounds typically assume perfect learning or absence of distribution shift
- Introduction of Epistemic Error Bounds: Proposes epistemic error bounds as a new theoretical tool specifically designed for uncertainty-aware learners
- Decomposable Epistemic Error Bounds: Provides bounds that decompose epistemic error into three components in the general setting of imperfect multitask learning with distribution shift
- Corollaries for Special Cases: Provides specialized epistemic error bounds for Bayesian transfer learning and distribution shift within ε-neighborhoods
- New Definition of Negative Transfer: Provides new theoretical characterization of negative transfer phenomena based on terms in the bounds
Epistemic error is defined as the degree to which the learner's understanding of the data-generating process (DGP) is incorrect, formalized as:
e:=dTV(P^,Qt)
where P^ is the learner's predictive distribution, Qt is the target task distribution, and dTV is the total variation distance.
- Task Distribution: Tasks themselves are sampled from a second-order task distribution Q∈Δ(ΔX)
- Source Tasks: Training data comes from n source tasks, each task Q∼QS
- Target Task: Test task Qt∼QT
- Distribution Shift: Occurs when QS=QT
- Centroid of Task Distribution (Definition 1):
Qˉ(x):=∫ΔXQ(x)q(Q)dQ=EQ∼Q[Q(x)]
- Variability of Task Distribution (Definition 2):
V[Q]:=supx∈X∫ΔX[Q(x)−Qˉ(x)]2q(Q)dQ
- Approximation Bias (Definition 7):
B:=dTV(P∗,QˉS)
where P∗=argminP∈πdTV(P,QˉS)
- Convergence Shortfall (Definition 8):
C:=dTV(P^,P∗)
- Degree of Distribution Shift (Definition 9):
D:=dTV(QˉS,QˉT)
Given model class π, predictor P^∈π, source task distribution QS, and second-order bounded target task distribution QT:
Pr(e≥α+B+C+D)≤α2V[QT]
This bound decomposes epistemic error into:
- B: Model Limitations (approximation bias)
- C: Data Scarcity (convergence shortfall)
- D: Distribution Shift
- V[QT]: Target task variability
Uses the triangle inequality to construct a path in metric space:
dTV(P^,Qt)≤dTV(P^,P∗)+dTV(P∗,QˉS)+dTV(QˉS,QˉT)+dTV(QˉT,Qt)
Combined with Chebyshev's inequality to control the impact of task variability.
- Unified Framework: First to handle multitask learning, imperfect learning, and distribution shift within a single framework
- Decomposable Analysis: Decomposes complex epistemic error into interpretable components
- Practical Guidance: Each component corresponds to concrete improvement strategies
- Theoretical Rigor: Based on rigorous metric space analysis and probability theory
For Bayesian learners, the convergence shortfall term can be expressed as posterior convergence:
CΘ:=dTV(P1Θ,P∗Θ)
This directly connects posterior convergence to epistemic error.
Under ε-neighborhood constraints:
Pr(e≥α+B+C+D)≤α2β(V[QS]+vol(QT))
where β=(1−bT)/bS, vol(QT)=(diam(QS)+ε)2.
- Model: Bayesian linear regression
- Data Generation: x∼N(β1Sξ1+β2Sξ2,σS)
- Prior: Normal-Inverse-Gamma model
- Distance Approximation: Uses Pinsker's inequality to approximate total variation distance
- Posterior Convergence Effect (Figure 1a): Epistemic error decreases as posterior probability of source data-generating parameters increases
- Neighborhood Size Effect (Figure 1b): Epistemic error increases with ε-neighborhood size
- Negative Transfer Phenomenon (Figure 3): Bound tightness is highly correlated with negative transfer phenomena
- Theoretical predictions align closely with experimental observations
- Bounds become looser in negative transfer cases, consistent with theoretical analysis
- Relative importance of components varies across scenarios
- Multitask Domain Generalization: Baxter (2000), Maurer et al., but without considering distribution shift
- Domain Adaptation Theory: Redko et al. (2019), but assumes learner knows distribution shift
- Credal Learning Theory: Caprio et al. (2024), but limited to specific learners
- Bayesian Deep Learning: Papamarkou et al. (2024)
- Conformal Prediction: Angelopoulos and Bates (2023)
- Credal Learning: Caprio et al. (2024)
- More General Setting: Simultaneously handles multitask learning, imperfect learning, and distribution shift
- Learner-Agnostic: Does not depend on specific learning algorithms
- Decomposable Analysis: Provides actionable improvement guidance
- Provides the first decomposable epistemic error bound for uncertainty-aware learners
- Works in very general settings, covering diverse practical scenarios
- Provides theoretical guidance framework for model selection and optimization
- Computational Complexity: Total variation distance is typically difficult to compute exactly
- Assumption Constraints: Requires technical assumptions such as second-order bounded distributions
- Conformal Prediction: Framework cannot fully characterize conformal prediction settings
- Experimental Validation: Validation only on low-dimensional synthetic data
- Extension to time-dependent tasks and data
- Complete characterization of conformal prediction settings
- Experimental validation on high-dimensional and real data
- Development of more computationally tractable bound variants
- Strong Theoretical Innovation: First systematic theoretical framework for uncertainty-aware learning
- High Practical Value: Decomposable analysis directly guides practical improvements
- Mathematical Rigor: Complete proofs with solid theoretical foundations
- Clear Presentation: Well-structured with clear concept definitions
- Computational Feasibility: Practical computation of theoretical results poses challenges
- Experimental Limitations: Limited experimental scale and complexity
- Strict Assumptions: Some technical assumptions may be difficult to satisfy in practice
- Incomplete Coverage: Incomplete support for certain uncertainty quantification methods (e.g., conformal prediction)
- Theoretical Contribution: Establishes foundation for uncertainty-aware learning theory
- Practical Guidance: Provides basis for model selection in high-risk applications
- Research Inspiration: Opens new research directions
- Medical Diagnosis: Clinical predictions requiring accurate uncertainty quantification
- Financial Risk: Risk modeling in multi-market environments
- Autonomous Driving: Safety decision-making under environmental changes
- Scientific Discovery: Cross-domain knowledge transfer
This paper cites important works from statistical learning theory, Bayesian inference, and uncertainty quantification, including:
- Shalev-Shwarz & Ben-David (2014): Foundations of statistical learning theory
- Papamarkou et al. (2024): Bayesian deep learning
- Angelopoulos & Bates (2023): Conformal prediction
- Redko et al. (2019): Domain adaptation theory
This is an important paper making significant theoretical contributions to uncertainty-aware machine learning. It provides a solid theoretical foundation and practical analytical framework for the field. While there is room for improvement in computational feasibility and experimental validation, its theoretical innovation and practical value make it an important work in the field.