2025-11-22T18:49:15.334146

Constructing Confidence Intervals for Average Treatment Effects from Multiple Datasets

Wang, Schröder, Frauen et al.
Constructing confidence intervals (CIs) for the average treatment effect (ATE) from patient records is crucial to assess the effectiveness and safety of drugs. However, patient records typically come from different hospitals, thus raising the question of how multiple observational datasets can be effectively combined for this purpose. In our paper, we propose a new method that estimates the ATE from multiple observational datasets and provides valid CIs. Our method makes little assumptions about the observational datasets and is thus widely applicable in medical practice. The key idea of our method is that we leverage prediction-powered inferences and thereby essentially `shrink' the CIs so that we offer more precise uncertainty quantification as compared to naïve approaches. We further prove the unbiasedness of our method and the validity of our CIs. We confirm our theoretical results through various numerical experiments. Finally, we provide an extension of our method for constructing CIs from combinations of experimental and observational datasets.
academic

Constructing Confidence Intervals for Average Treatment Effects from Multiple Datasets

Basic Information

  • Paper ID: 2412.11511
  • Title: Constructing Confidence Intervals for Average Treatment Effects from Multiple Datasets
  • Authors: Yuxin Wang, Maresa Schröder, Dennis Frauen, Jonas Schweisthal, Konstantin Hess & Stefan Feuerriegel (LMU Munich, MCML)
  • Classification: cs.LG, stat.ML
  • Conference: ICLR 2025
  • Paper Link: https://arxiv.org/abs/2412.11511

Abstract

This paper proposes a novel method for constructing confidence intervals for average treatment effects (ATE) from multiple observational datasets. The method makes fewer assumptions about observational datasets and has broad applicability in medical practice. The core idea leverages prediction-powered inference (PPI) to "shrink" confidence intervals, providing more precise uncertainty quantification compared to naive approaches. The paper establishes the unbiasedness of the method and validity of the confidence intervals, with numerical experiments validating the theoretical results. Additionally, the method is extended to handle combinations of experimental and observational datasets.

Research Background and Motivation

Core Problem

In the medical field, constructing confidence intervals for ATE from patient records is crucial for assessing drug efficacy and safety. However, patient records typically originate from different hospitals, making effective integration of multiple observational datasets a key challenge.

Problem Significance

  1. Medical Decision-Making Needs: Reliable confidence intervals are critical for clinical decision-making, ensuring evidence-based treatment selection
  2. Data Fragmentation: Electronic health records are typically distributed across different healthcare institutions and countries, requiring integrated utilization
  3. COVID-19 Case Study: During the pandemic, rapid assessment of drug effects from multi-center data was needed, such as studies on nirmatrelvir/ritonavir

Limitations of Existing Methods

  1. Point Estimation Limitations: Most existing multi-dataset methods focus on point estimation, lacking uncertainty quantification
  2. Naive Approach Problems:
    • Direct dataset concatenation leads to biased estimates due to confounding bias
    • Using only small datasets ignores large dataset information, resulting in overly conservative confidence intervals
  3. Assumption Constraints: Existing methods impose strong assumptions on relationships between datasets

Core Contributions

  1. Novel Methodology: Proposes a prediction-powered inference-based method for constructing multi-dataset ATE confidence intervals
  2. Theoretical Guarantees: Establishes consistent estimation and confidence interval validity
  3. Broad Applicability: Extends to scenarios combining RCTs and observational datasets
  4. Experimental Validation: Verifies method effectiveness through synthetic and medical data

Methodology Details

Task Definition

Given a small unbiased observational dataset D₁ (satisfying unconfoundedness) and a large observational dataset D₂ (allowing unobserved confounding), the goal is to estimate the target population's ATE τ = EY¹(1) - Y¹(0) and construct valid confidence intervals.

Core Assumptions

D₁ Assumptions:

  • Consistency: A¹ = a ⇒ Y¹ = Y¹(a)
  • Overlap: 0 < π¹(x) < 1
  • Unconfoundedness: Y¹(0), Y¹(1) ⊥⊥ A¹ | X¹

D₂ Assumptions (more relaxed):

  • Consistency and overlap, but allowing unobserved confounding

Model Architecture

Four-Step Methodology Framework

Step A: Measure of Fit Estimate conditional average treatment effects (CATE) on D₂ using sample splitting:

τ̂₂(x) = E[Y²(1) - Y²(0) | X² = x]
τ̂₂ = (1/N)∑ᵢτ̂₂(xᵢ)

Step B: Influence Function Estimation Compute the non-centered influence function scores of the AIPW estimator on D₁:

Ỹη̂(xᵢ) = (aᵢ¹/π̂¹(xᵢ) - (1-aᵢ¹)/(1-π̂¹(xᵢ)))yᵢ¹ - (aᵢ¹-π̂¹(xᵢ))/(π̂¹(xᵢ)(1-π̂¹(xᵢ)))[(1-π̂¹(xᵢ))μ̂₁(xᵢ) + π̂¹(xᵢ)μ̂₀(xᵢ)]

Step C: Rectifier Define a rectifier quantifying ATE differences between datasets:

Δ̂τ = (1/n)∑ᵢ[Ỹη̂(xᵢ) - τ̂₂(xᵢ)]

Step D: Confidence Interval Construction Prediction-powered ATE estimator:

τ̂ᴾᴾ = Δ̂τ + τ̂₂

Confidence interval:

Cᴾᴾα = (τ̂ᴾᴾ ± z₁₋α/₂√(σ̂²Δ/n + σ̂²τ₂/N))

Technical Innovations

  1. PPI Framework Adaptation: First application of the PPI framework to ATE estimation in causal inference
  2. Rectifier Design: Cleverly designed rectifier to handle distribution differences and potential confounding between datasets
  3. Theoretical Guarantees: Provides asymptotic validity proofs ensuring statistical validity of confidence intervals
  4. Flexibility: Supports arbitrary CATE estimators without restricting specific methods

Theoretical Analysis

Theorem 4.2 (Confidence Interval Validity): Under appropriate conditions,

lim sup P(τ ∈ Cᴾᴾα) ≥ 1-α

Key Lemma 4.1: Asymptotic normality of the rectifier

√n(Δ̂τ - τ + E[τ₂]) → N(0, σ²Δ)

Experimental Setup

Datasets

Synthetic Data:

  • Data generation mechanism based on Gaussian processes
  • Three confounding scenarios: mild, moderate, severe
  • Controllable covariate dimensionality and sample sizes

Medical Data:

  1. MIMIC-III: Effect of mechanical ventilation on red blood cell count in ICU patients
  2. Brazilian COVID-19: Effect of comorbidities on mortality in COVID-19 patients

Evaluation Metrics

  • Confidence Interval Width: Measures precision of uncertainty quantification
  • Coverage Rate: Validates statistical validity of confidence intervals
  • RMSE: Evaluates point estimation accuracy

Comparison Methods

  1. τ̂ᴬᴵᴾᵂ(D₁ only): Naive baseline using only small dataset
  2. τ̂ᴬᴵᴾᵂ(D₂ only): Using only large dataset (biased estimator)
  3. A-TMLE: van der Laan et al.'s method (RCT + observational data)

Implementation Details

  • DR-learner for CATE estimation
  • Linear/logistic regression for nuisance function estimation
  • Cross-fitting to prevent overfitting
  • Results averaged over 5 random seeds

Experimental Results

Main Results

Synthetic Data Performance:

  1. Validity: Confidence intervals consistently cover true ATE
  2. Precision Improvement: CI width reduction of 49.99%-55.37% compared to naive methods
  3. Stability: Maintains excellent performance across different confounding strengths

Medical Data Validation:

  • MIMIC-III: CI width reduction of approximately 3.5-fold
  • COVID-19 data: Excellent performance across different splitting strategies
  • Minimum RMSE and narrowest valid confidence intervals

Sensitivity Analysis

Dataset Size Impact:

  • Advantages more pronounced when N≫n
  • Improvement magnitude gradually decreases as D₁ increases (as expected)

High-Dimensional Settings:

  • Maintains advantages in 5D, 50D, and 500D covariate spaces
  • Demonstrates method robustness in high-dimensional settings

Different Model Architectures:

  • Supports multiple base models including neural networks and XGBoost
  • Demonstrates method generality

RCT + Observational Data Extension

IPW-Based Method:

  • Leverages known propensity scores to simplify estimation
  • More stable than A-TMLE, avoiding numerical issues from matrix inversion

Performance Comparison:

  • Consistently covers true ATE
  • Significantly narrower CI width than baseline methods
  • Maintains validity even under strong confounding

ATE Confidence Interval Construction

  • Traditional methods based on asymptotic normality or finite-sample assumptions
  • Existing work primarily addresses single-dataset scenarios

Multi-Dataset ATE Estimation

  1. RCT + Observational Data: Kallus et al., Hatt et al., Demirel et al.
  2. Multiple Observational Data: Yang & Ding, Guo et al.
  3. Limitations: Most focus on point estimation, lacking uncertainty quantification

Prediction-Powered Inference

  • PPI framework proposed by Angelopoulos et al.
  • Primarily applied to traditional statistics (means, medians, etc.)
  • First application to causal inference in this work

Conclusions and Discussion

Main Conclusions

  1. Successfully extends PPI framework to multi-dataset causal inference
  2. Provides theoretically guaranteed valid confidence intervals
  3. Significantly improves precision compared to naive methods
  4. Validates practical utility on medical data

Limitations

  1. Assumption Dependence: D₁'s unconfoundedness assumption may be violated in practice
  2. Distribution Assumptions: Assumes identical marginal covariate distributions
  3. Sample Splitting: Requires sufficiently large D₂ for effective splitting

Future Directions

  1. Extension to CATE: Extend method to heterogeneous treatment effects
  2. Survival Analysis: Application to causal survival analysis
  3. Large Language Model Integration: Incorporate pre-trained models for text representation
  4. Robustness Analysis: Develop methods robust to assumption violations

In-Depth Evaluation

Strengths

  1. Theoretical Rigor: Provides complete asymptotic theoretical analysis and validity proofs
  2. Practical Value: Addresses real needs in medical practice
  3. Method Generality: Supports multiple CATE estimators with strong flexibility
  4. Comprehensive Experiments: Covers synthetic and real data with extensive sensitivity analyses

Limitations

  1. Assumption Constraints: Unconfoundedness assumption is strong in practical applications
  2. Computational Complexity: Cross-fitting and sample splitting increase computational cost
  3. Limited Extensibility: Primarily addresses binary treatments; extension to continuous treatments unclear

Impact

  1. Academic Contribution: First application of PPI to causal inference, opening new research directions
  2. Practical Value: Provides more reliable statistical tools for medical decision-making
  3. Reproducibility: Provides open-source code facilitating verification and application

Applicable Scenarios

  1. Multi-Center Medical Research: Integrating patient data from different hospitals
  2. Drug Safety Assessment: Combining RCTs with real-world data
  3. Health Policy Development: Evidence-based decision-making using multi-source data
  4. Regulatory Approval: Providing statistical evidence for drug approval

References

  1. Angelopoulos et al. (2023). Prediction-powered inference. Science.
  2. van der Laan et al. (2024). Adaptive-TMLE for average treatment effect. arXiv.
  3. Kallus et al. (2018). Removing hidden confounding by experimental grounding. NeurIPS.
  4. Yang & Ding (2020). Combining multiple observational data sources. JASA.

Overall Assessment: This is a high-quality causal inference paper that successfully applies the prediction-powered inference framework to multi-dataset ATE estimation. The paper has solid theoretical foundations, well-designed experiments, and significant practical value in medical applications. While subject to certain assumption constraints, its overall contributions are substantial, providing new methodological tools for the causal inference field.