2025-11-13T15:49:11.287474

Predictive posteriors under hidden confounding

Meixide, Insua
Predicting outcomes in external domains is challenging due to hidden confounders that potentially influence both predictors and outcomes. Well-established methods frequently rely on stringent assumptions, explicit knowledge about the distribution shift across domains, or bias-inducing regularization schemes to enhance generalization. While recent developments in point prediction under hidden confounding attempt to mitigate these shortcomings, they generally do not provide principled uncertainty quantification. We introduce a Bayesian framework that yields well-calibrated predictive distributions across external domains, supports valid model inference, and achieves posterior contraction rates that improve as the number of observed datasets increases. Simulations and a medical application highlight the remarkable empirical coverage of our approach, nearly unchanged when transitioning from low- to moderate-dimensional settings.
academic

Predictive posteriors under hidden confounding

Basic Information

  • Paper ID: 2507.05170
  • Title: Predictive posteriors under hidden confounding
  • Authors: Carlos García Meixide, David Ríos Insua
  • Classification: stat.ME
  • Publication Date: arXiv:2507.05170v2 stat.ME 11 Oct 2025
  • Paper Link: https://arxiv.org/abs/2507.05170v2

Abstract

Predicting outcomes in external domains is challenging because hidden confounders may simultaneously affect predictor variables and outcome variables. Existing methods typically rely on stringent assumptions, explicit knowledge of cross-domain distributional shifts, or introduce bias-inducing regularization schemes to enhance generalization capacity. While point prediction methods under hidden confounding attempt to mitigate these shortcomings, they typically fail to provide principled uncertainty quantification. This paper introduces a Bayesian framework capable of producing well-calibrated predictive distributions in external domains, supporting effective model inference, and achieving posterior contraction rates that improve with increasing numbers of observational datasets. Simulation experiments and medical applications highlight the method's remarkable empirical coverage rates, which remain nearly invariant across transitions from low-dimensional to moderate-dimensional settings.

Research Background and Motivation

Problem Definition

The core problem addressed by this research is: How to conduct reliable probabilistic predictions and provide calibrated uncertainty quantification in external domains with distributional shifts in the presence of hidden confounders?

Problem Significance

  1. Ubiquity of distributional shift: Machine learning applications frequently encounter inconsistencies between training and test domain distributions, challenging standard i.i.d. assumptions
  2. Impact of hidden confounding: Unobserved confounding variables simultaneously affect predictor variables X and outcome variables Y, causing traditional methods to fail
  3. Demand for uncertainty quantification: Existing methods primarily focus on point predictions, lacking principled uncertainty quantification mechanisms

Limitations of Existing Methods

  1. Distributionally robust optimization: Employs minimax optimization but requires introducing bias to enhance robustness
  2. Causal invariance methods: Such as anchor regression, rely on stringent invariance assumptions that are easily violated under hidden confounding
  3. Conformal prediction: While capable of providing prediction intervals, has limited capacity to handle distributional shifts
  4. Existing causal methods: Primarily provide point estimates, lacking uncertainty quantification

Research Motivation

Building upon prior work on Generative Invariance (GI), the authors aim to construct a unified Bayesian framework that simultaneously addresses two long-standing challenging problems: causal discovery and calibrated prediction.

Core Contributions

  1. First Bayesian framework: Proposes a complete Bayesian framework for probabilistic prediction under hidden confounding, enabling simultaneous causal discovery and prediction
  2. Theoretical guarantees: Establishes posterior consistency, contraction rates, and Bernstein-von Mises theorem, proving the asymptotic properties of the method
  3. Hypothesis testing capability: Provides the first computationally tractable hypothesis test for determining whether a variable is a parent node of the target response in linear structural equation models
  4. Calibrated predictions: Achieves well-calibrated predictions in distributional shift domains with coverage rates approaching theoretical levels
  5. Identifiability spectrum: First explicitly articulates weak identifiability as an empirical manifestation of an asymptotic phenomenon

Methodology Details

Task Definition

Given heterogeneous data sources from E training environments and one target test environment, the task is to:

  • Input: (X,Y) pairs from training environments, X from test environment
  • Output: Calibrated predictive distribution for Y in test environment and credible intervals for causal parameters
  • Constraint: Hidden confounders affect both X and Y

Model Architecture

Structural Equation Model

The foundational model is:

X ← ∑_z 1{Z = z}X_z
Y ← α* + γ*^T X + ε_Y

where Z is an environment indicator, and ε_Y may be correlated with X_z (hidden confounding).

Hierarchical Bayesian Model

For each environment e, the likelihood is:

X_ei ~ N_p(μ_e, Σ_e)
Y_ei | X_ei, w, ϑ_e ~ N(α + γ^T X_ei + K^⊤(X_ei - μ_e), σ_Y^2)

Key parameters:

  • w = (β, K): β = (α, γ) contains regression coefficients, K absorbs hidden confounding effects
  • ϑ_e = (μ_e, Σ_e, σ_Y^2): Environment-specific nuisance parameters

Prior Specification

Employs ridge-type Gaussian priors:

μ_1, ..., μ_E ~ N_p(μ̂, Σ_μ)
α ~ N(0, τ^2 σ_Y^2)
(γ, K) | τ^2, σ_Y^2 ~ N_2p(0, τ^2 σ_Y^2 I_2p)
σ_Y ~ π(σ_Y) ∝ 1/σ_Y
τ^2 ~ Beta-prime(a_τ, b_τ)

Technical Innovations

1. Confounding Correction Mechanism

Explicitly models the impact of hidden confounding through the term K^⊤(X_ei - μ_e), where:

  • K captures the covariance structure between hidden confounders and observed variables
  • This term has expectation zero in each environment, not affecting intercept estimation

2. Environmental Heterogeneity Modeling

Treats environment means μ_e as random quantities sampled from a common prior distribution rather than fixed parameters, achieving beneficial shrinkage effects.

3. Identifiability Handling

When identifiability conditions are nearly violated, the Bayesian approach avoids numerical instability of frequentist methods through controlled shrinkage.

4. Causal Discovery Criterion

Proposes a decision rule based on posterior distribution: j is deemed a causal parent of Y when min{|{i: γ_ji < 0}|, |{i: γ_ji > 0}|} < αm.

Experimental Setup

Datasets

Simulation Experiments

  1. Single-source example: One-dimensional setting, n₁=500, hidden confounder H~N(0,0.5²)
  2. Multi-source example: Multi-dimensional setting, E=p+1 environments with systematic variation in environment means

Real Data

BMI Analysis: Multi-province Spanish data

  • Predictor variables: Lifestyle factors (alcohol consumption, smoking habits, sleep quality, etc.)
  • Outcome variable: BMI
  • Hidden confounding: Gender, cholesterol, and blood glucose levels
  • Environment indicator: Province

Evaluation Metrics

  1. Empirical coverage rate: Proportion of prediction intervals containing true values
  2. Causal discovery accuracy: Ability to correctly identify causal variables
  3. Prediction calibration: Match between predictive distribution and true distribution

Comparison Methods

  1. OLS: Ordinary least squares
  2. IV: Instrumental variables
  3. Standard Bayesian linear regression

Implementation Details

  • MCMC sampling: Implemented using RStan, 4 chains × 1000 iterations
  • Hyperparameters: a_τ = b_τ = 1/2 (standard half-Cauchy prior)
  • Parallel computing: 8 cores, 3 simulations per core

Experimental Results

Main Results

Simulation Performance

Average empirical coverage rates in multi-dimensional settings (OLS vs. proposed method):

n, p2D5D10D
200.88/.96.85/.95.87/.90
500.91/.95.88/.93.83/.94
1000.89/.95.88/.95.85/.94
2000.90/.95.83/.94.80/.95

Key Findings:

  • The proposed method outperforms OLS in all scenarios
  • Coverage rates remain relatively stable as dimensionality increases
  • OLS performance deteriorates noticeably with increasing dimensionality

Single-source Example Results

  • Parameter estimation: Posterior distributions of β and K correctly centered at true values 1 and -0.25
  • Predictive performance: Empirical coverage rate 0.96, approaching theoretical level 0.95
  • Comparative effect: OLS and IV predictions completely miss the target

Medical Application Results

  • Empirical coverage rate: 0.95 (ideal level)
  • Causal discovery: Only physical activity identified as the sole causal variable
  • Comparative analysis: OLS incorrectly identifies multiple correlated but non-causal variables (e.g., former smokers)

Theoretical Verification

Figure 2 demonstrates the weak identifiability phenomenon: as μ→0, the posterior shrinks toward the prior mean, avoiding matrix singularity issues encountered by frequentist methods.

Main Research Directions

  1. Distributionally robust optimization: Minimax methods by Sinha et al. (2020)
  2. Causal invariance: Invariant prediction methods by Peters et al. (2016)
  3. Anchor regression: Heterogeneous data causal methods by Rothenhäusler et al. (2021)
  4. Conformal prediction: Robust prediction intervals by Tibshirani et al. (2019)

Advantages of This Work

  1. Unified framework: Simultaneously addresses causal discovery and prediction calibration
  2. Theoretical guarantees: Provides complete asymptotic theory
  3. Practicality: Requires no hyperparameter tuning or specific distributional shift knowledge
  4. Robustness: Maintains validity under hidden confounding

Conclusions and Discussion

Main Conclusions

  1. Successfully constructs a Bayesian predictive framework under hidden confounding
  2. Achieves calibrated probabilistic predictions and effective causal discovery
  3. Provides complete theoretical foundation and empirical validation
  4. Maintains stable performance across low to moderate-dimensional settings

Limitations

  1. Gaussian assumption: Current framework assumes covariates follow Gaussian distribution
  2. Linear models: Limited to linear structural equation models
  3. Computational complexity: MCMC sampling may be slow in high-dimensional settings
  4. Environment quantity: Requires sufficient training environments to ensure identifiability

Future Directions

  1. Nonparametric extensions: Integrate martingale posterior framework to eliminate likelihood-prior specification requirements
  2. Adversarial learning: Application to adversarial machine learning scenarios
  3. Relaxed assumptions: Allow confounding distribution to vary across environments
  4. PAC guarantees: Establish marginal PAC-learning theoretical guarantees

In-Depth Evaluation

Strengths

  1. Theoretical completeness: Provides comprehensive theoretical analysis from posterior consistency to Bernstein-von Mises theorem
  2. Methodological innovation: First to achieve hypothesis testing for causal discovery under hidden confounding
  3. Practical value: Unified solution to two long-standing challenging problems
  4. Experimental sufficiency: Comprehensive validation from simulations to real applications
  5. Writing clarity: Rigorous mathematical derivations with clear conceptual explanations

Weaknesses

  1. Assumption limitations: Gaussian assumptions and linear models restrict applicability
  2. Computational efficiency: MCMC methods may be slow on large-scale data
  3. Prior sensitivity: Although claiming robustness to priors, weak identifiability still introduces influence
  4. Environment requirements: Requires multiple training environments, potentially limiting practical applications

Impact

  1. Academic contribution: Provides new theoretical framework for causal inference and prediction calibration
  2. Practical value: Broad application prospects in fields with hidden confounding such as medicine and economics
  3. Methodological significance: Demonstrates advantages of Bayesian methods in handling identifiability issues

Applicable Scenarios

  1. Medical research: Epidemiological studies with unobserved confounding factors
  2. Economics: Causal inference in policy evaluation
  3. Machine learning: Domain adaptation and distributional shift problems
  4. Social sciences: Causal analysis in observational studies

References

  1. Rothenhäusler, D., et al. (2021). Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society Series B, 83(2), 215-246.
  2. Peters, J., Bühlmann, P., & Meinshausen, N. (2016). Causal inference by using invariant prediction: Identification and confidence intervals. Journal of the Royal Statistical Society Series B, 78(5), 947-1012.
  3. Tibshirani, R. J., et al. (2019). Conformal prediction under covariate shift. Advances in Neural Information Processing Systems, 32.
  4. Meixide, C. G., & Insua, D. R. (2025). Unsupervised domain adaptation under hidden confounding. arXiv preprint.