2025-11-13T15:49:11.287474

Predictive posteriors under hidden confounding

Meixide, Insua

Predicting outcomes in external domains is challenging due to hidden confounders that potentially influence both predictors and outcomes. Well-established methods frequently rely on stringent assumptions, explicit knowledge about the distribution shift across domains, or bias-inducing regularization schemes to enhance generalization. While recent developments in point prediction under hidden confounding attempt to mitigate these shortcomings, they generally do not provide principled uncertainty quantification. We introduce a Bayesian framework that yields well-calibrated predictive distributions across external domains, supports valid model inference, and achieves posterior contraction rates that improve as the number of observed datasets increases. Simulations and a medical application highlight the remarkable empirical coverage of our approach, nearly unchanged when transitioning from low- to moderate-dimensional settings.

academic

Predictive posteriors under hidden confounding

Basic Information

Paper ID: 2507.05170
Title: Predictive posteriors under hidden confounding
Authors: Carlos García Meixide, David Ríos Insua
Classification: stat.ME
Publication Date: arXiv:2507.05170v2 stat.ME 11 Oct 2025
Paper Link: https://arxiv.org/abs/2507.05170v2

Abstract

Predicting outcomes in external domains is challenging because hidden confounders may simultaneously affect predictor variables and outcome variables. Existing methods typically rely on stringent assumptions, explicit knowledge of cross-domain distributional shifts, or introduce bias-inducing regularization schemes to enhance generalization capacity. While point prediction methods under hidden confounding attempt to mitigate these shortcomings, they typically fail to provide principled uncertainty quantification. This paper introduces a Bayesian framework capable of producing well-calibrated predictive distributions in external domains, supporting effective model inference, and achieving posterior contraction rates that improve with increasing numbers of observational datasets. Simulation experiments and medical applications highlight the method's remarkable empirical coverage rates, which remain nearly invariant across transitions from low-dimensional to moderate-dimensional settings.

Research Background and Motivation

Problem Definition

The core problem addressed by this research is: How to conduct reliable probabilistic predictions and provide calibrated uncertainty quantification in external domains with distributional shifts in the presence of hidden confounders?

Problem Significance

Ubiquity of distributional shift: Machine learning applications frequently encounter inconsistencies between training and test domain distributions, challenging standard i.i.d. assumptions
Impact of hidden confounding: Unobserved confounding variables simultaneously affect predictor variables X and outcome variables Y, causing traditional methods to fail
Demand for uncertainty quantification: Existing methods primarily focus on point predictions, lacking principled uncertainty quantification mechanisms

Limitations of Existing Methods

Distributionally robust optimization: Employs minimax optimization but requires introducing bias to enhance robustness
Causal invariance methods: Such as anchor regression, rely on stringent invariance assumptions that are easily violated under hidden confounding
Conformal prediction: While capable of providing prediction intervals, has limited capacity to handle distributional shifts
Existing causal methods: Primarily provide point estimates, lacking uncertainty quantification

Research Motivation

Building upon prior work on Generative Invariance (GI), the authors aim to construct a unified Bayesian framework that simultaneously addresses two long-standing challenging problems: causal discovery and calibrated prediction.

Core Contributions

First Bayesian framework: Proposes a complete Bayesian framework for probabilistic prediction under hidden confounding, enabling simultaneous causal discovery and prediction
Theoretical guarantees: Establishes posterior consistency, contraction rates, and Bernstein-von Mises theorem, proving the asymptotic properties of the method
Hypothesis testing capability: Provides the first computationally tractable hypothesis test for determining whether a variable is a parent node of the target response in linear structural equation models
Calibrated predictions: Achieves well-calibrated predictions in distributional shift domains with coverage rates approaching theoretical levels
Identifiability spectrum: First explicitly articulates weak identifiability as an empirical manifestation of an asymptotic phenomenon

Methodology Details

Task Definition

Given heterogeneous data sources from E training environments and one target test environment, the task is to:

Input: (X,Y) pairs from training environments, X from test environment
Output: Calibrated predictive distribution for Y in test environment and credible intervals for causal parameters
Constraint: Hidden confounders affect both X and Y

Model Architecture

Structural Equation Model

The foundational model is:

X ← ∑_z 1{Z = z}X_z
Y ← α* + γ*^T X + ε_Y

where Z is an environment indicator, and ε_Y may be correlated with X_z (hidden confounding).

Hierarchical Bayesian Model

For each environment e, the likelihood is:

X_ei ~ N_p(μ_e, Σ_e)
Y_ei | X_ei, w, ϑ_e ~ N(α + γ^T X_ei + K^⊤(X_ei - μ_e), σ_Y^2)

Key parameters:

w = (β, K): β = (α, γ) contains regression coefficients, K absorbs hidden confounding effects
ϑ_e = (μ_e, Σ_e, σ_Y^2): Environment-specific nuisance parameters

Prior Specification

Employs ridge-type Gaussian priors:

μ_1, ..., μ_E ~ N_p(μ̂, Σ_μ)
α ~ N(0, τ^2 σ_Y^2)
(γ, K) | τ^2, σ_Y^2 ~ N_2p(0, τ^2 σ_Y^2 I_2p)
σ_Y ~ π(σ_Y) ∝ 1/σ_Y
τ^2 ~ Beta-prime(a_τ, b_τ)

Technical Innovations

1. Confounding Correction Mechanism

Explicitly models the impact of hidden confounding through the term K^⊤(X_ei - μ_e), where:

K captures the covariance structure between hidden confounders and observed variables
This term has expectation zero in each environment, not affecting intercept estimation

Single-source example: One-dimensional setting, n₁=500, hidden confounder H~N(0,0.5²)
Multi-source example: Multi-dimensional setting, E=p+1 environments with systematic variation in environment means

Real Data

BMI Analysis: Multi-province Spanish data

Predictor variables: Lifestyle factors (alcohol consumption, smoking habits, sleep quality, etc.)
Outcome variable: BMI
Hidden confounding: Gender, cholesterol, and blood glucose levels
Environment indicator: Province

Evaluation Metrics

Empirical coverage rate: Proportion of prediction intervals containing true values
Causal discovery accuracy: Ability to correctly identify causal variables
Prediction calibration: Match between predictive distribution and true distribution

Comparison Methods

OLS: Ordinary least squares
IV: Instrumental variables
Standard Bayesian linear regression

Implementation Details

MCMC sampling: Implemented using RStan, 4 chains × 1000 iterations
Hyperparameters: a_τ = b_τ = 1/2 (standard half-Cauchy prior)
Parallel computing: 8 cores, 3 simulations per core

n, p	2D	5D	10D
200	.88/.96	.85/.95	.87/.90
500	.91/.95	.88/.93	.83/.94
1000	.89/.95	.88/.95	.85/.94
2000	.90/.95	.83/.94	.80/.95

Key Findings:

The proposed method outperforms OLS in all scenarios
Coverage rates remain relatively stable as dimensionality increases
OLS performance deteriorates noticeably with increasing dimensionality

Single-source Example Results

Parameter estimation: Posterior distributions of β and K correctly centered at true values 1 and -0.25
Predictive performance: Empirical coverage rate 0.96, approaching theoretical level 0.95
Comparative effect: OLS and IV predictions completely miss the target

Medical Application Results

Empirical coverage rate: 0.95 (ideal level)
Causal discovery: Only physical activity identified as the sole causal variable
Comparative analysis: OLS incorrectly identifies multiple correlated but non-causal variables (e.g., former smokers)

Theoretical Verification

Figure 2 demonstrates the weak identifiability phenomenon: as μ→0, the posterior shrinks toward the prior mean, avoiding matrix singularity issues encountered by frequentist methods.

Main Research Directions

Distributionally robust optimization: Minimax methods by Sinha et al. (2020)
Causal invariance: Invariant prediction methods by Peters et al. (2016)
Anchor regression: Heterogeneous data causal methods by Rothenhäusler et al. (2021)
Conformal prediction: Robust prediction intervals by Tibshirani et al. (2019)

Advantages of This Work

Unified framework: Simultaneously addresses causal discovery and prediction calibration
Theoretical guarantees: Provides complete asymptotic theory
Practicality: Requires no hyperparameter tuning or specific distributional shift knowledge
Robustness: Maintains validity under hidden confounding

Conclusions and Discussion

Main Conclusions

Successfully constructs a Bayesian predictive framework under hidden confounding
Achieves calibrated probabilistic predictions and effective causal discovery
Provides complete theoretical foundation and empirical validation
Maintains stable performance across low to moderate-dimensional settings

Limitations

Gaussian assumption: Current framework assumes covariates follow Gaussian distribution
Linear models: Limited to linear structural equation models
Computational complexity: MCMC sampling may be slow in high-dimensional settings
Environment quantity: Requires sufficient training environments to ensure identifiability

Future Directions

Nonparametric extensions: Integrate martingale posterior framework to eliminate likelihood-prior specification requirements
Adversarial learning: Application to adversarial machine learning scenarios
Relaxed assumptions: Allow confounding distribution to vary across environments
PAC guarantees: Establish marginal PAC-learning theoretical guarantees

In-Depth Evaluation

Strengths

Theoretical completeness: Provides comprehensive theoretical analysis from posterior consistency to Bernstein-von Mises theorem
Methodological innovation: First to achieve hypothesis testing for causal discovery under hidden confounding
Practical value: Unified solution to two long-standing challenging problems
Experimental sufficiency: Comprehensive validation from simulations to real applications
Writing clarity: Rigorous mathematical derivations with clear conceptual explanations

Weaknesses

Assumption limitations: Gaussian assumptions and linear models restrict applicability
Computational efficiency: MCMC methods may be slow on large-scale data
Prior sensitivity: Although claiming robustness to priors, weak identifiability still introduces influence
Environment requirements: Requires multiple training environments, potentially limiting practical applications

Impact

Academic contribution: Provides new theoretical framework for causal inference and prediction calibration
Practical value: Broad application prospects in fields with hidden confounding such as medicine and economics
Methodological significance: Demonstrates advantages of Bayesian methods in handling identifiability issues

Applicable Scenarios

Medical research: Epidemiological studies with unobserved confounding factors
Economics: Causal inference in policy evaluation
Machine learning: Domain adaptation and distributional shift problems
Social sciences: Causal analysis in observational studies

References

Rothenhäusler, D., et al. (2021). Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society Series B, 83(2), 215-246.
Peters, J., Bühlmann, P., & Meinshausen, N. (2016). Causal inference by using invariant prediction: Identification and confidence intervals. Journal of the Royal Statistical Society Series B, 78(5), 947-1012.
Tibshirani, R. J., et al. (2019). Conformal prediction under covariate shift. Advances in Neural Information Processing Systems, 32.
Meixide, C. G., & Insua, D. R. (2025). Unsupervised domain adaptation under hidden confounding. arXiv preprint.