2025-11-25T22:55:18.828107

Inference on effect size after multiple hypothesis testing

Dzemski, Okui, Wang

Significant treatment effects are often emphasized when interpreting and summarizing empirical findings in studies that estimate multiple, possibly many, treatment effects. Under this kind of selective reporting, conventional treatment effect estimates may be biased and their corresponding confidence intervals may undercover the true effect sizes. We propose new estimators and confidence intervals that provide valid inferences on the effect sizes of the significant effects after multiple hypothesis testing. Our methods are based on the principle of selective conditional inference and complement a wide range of tests, including step-up tests and bootstrap-based step-down tests. Our approach is scalable, allowing us to study an application with over 370 estimated effects. We justify our procedure for asymptotically normal treatment effect estimators. We provide two empirical examples that demonstrate bias correction and confidence interval adjustments for significant effects. The magnitude and direction of the bias correction depend on the correlation structure of the estimated effects and whether the interpretation of the significant effects depends on the (in)significance of other effects.

academic

Inference on Effect Size After Multiple Hypothesis Testing

Basic Information

Paper ID: 2503.22369
Title: Inference on effect size after multiple hypothesis testing
Authors: Andreas Dzemski (University of Gothenburg), Ryo Okui (University of Tokyo), Wenjie Wang (Nanyang Technological University)
Classification: econ.EM math.ST stat.TH
Publication Date: October 14, 2025
Paper Link: https://arxiv.org/abs/2503.22369

Abstract

In studies estimating multiple treatment effects, statistically significant treatment effects are often emphasized when interpreting and summarizing empirical findings. Under such selective reporting, conventional treatment effect estimates may be biased, and their corresponding confidence intervals may fail to provide adequate coverage of true effect sizes. This paper proposes new estimators and confidence intervals that provide valid inference on effect sizes of significant effects after multiple hypothesis testing. The method is based on the principle of selective conditional inference and applies to a broad range of testing procedures, including step-up tests and bootstrap-based step-down tests. The approach is scalable and can be applied to studies with over 370 estimated effects. The authors establish the validity of the procedure for asymptotically normal treatment effect estimators and provide two empirical examples demonstrating bias correction and confidence interval adjustment for significant effects.

Research Background and Motivation

Importance of the Problem

In empirical research across economics, medicine, psychology, and other fields, researchers frequently need to estimate multiple treatment effects. These effects may arise from different outcome variables, intervention types, or population subgroups. Through multiple hypothesis testing procedures, researchers classify these effects as statistically significant or insignificant, then focus on the practical importance of significant effects.

Limitations of Existing Methods

When researchers restrict attention to significant effects, the estimated magnitudes of these effects are subject to selection bias, which invalidates traditional statistical inference methods. Specifically:

Selection Bias: Significant effects tend to be positively selected ("winner's curse"), with magnitudes overestimated
Insufficient Confidence Interval Coverage: Traditional confidence intervals fail to provide valid statistical coverage
Lack of Bias Correction: Existing methods lack unbiased estimation for post-selection effect sizes

Research Motivation

The paper argues that avoiding selective summarization and interpretation does not solve the problem but merely shifts the burden of synthesizing results to readers, who still face selective inference issues. Therefore, specialized statistical methods are needed to handle inference after multiple hypothesis testing.

Core Contributions

Proposes a new method based on conditional selective inference: Provides valid point estimates and confidence intervals for effect sizes of significant effects after multiple hypothesis testing
Develops efficient computational algorithms: Proposes an algorithm with O(m³log m) time complexity, enabling the method to scale to applications with hundreds of effects
Establishes asymptotic theory: Proves the asymptotic validity of the procedure under asymptotically normal treatment effect estimators
Provides broad applicability: The method applies to various multiple testing procedures, including step-down and step-up tests
Demonstrates practical value: Two empirical applications validate the method's effectiveness and utility

Methodology Details

Task Definition

Given m treatment effect parameters θ = (θ₁, ..., θₘ)' and their estimators θ̂, after determining the set of significant effects Ŝ through multiple hypothesis testing, conduct unbiased inference on the true effect sizes of significant effects.

Core Method Framework

1. Basic Setup

Assume θ̂ ~ N(θ, V), where V is a known covariance matrix
t-statistics: X = diag⁻¹/²(v)θ̂, where v contains diagonal elements of V
Significant effects determined through step-down or step-up procedures: effect h is significant when |Xₕ| ≥ x̄ₕ

2. Conditional Inference Method

For significant effect s ∈ S, decompose X as:

X = Ω•,sXs + Z⁽ˢ⁾

where Z⁽ˢ⁾ = X - Ω•,sXs is independent of Xs.

The key innovation lies in the conditional distribution function:

Fs(xs | z, θs, S) = ∫{ξ∈ℝ:ξ+V⁻¹/²s,sθs∈Xs(z,S)} 1{ξ + V⁻¹/²s,sθs ≤ xs} dΦ(ξ) / ∫{ξ∈ℝ:ξ+V⁻¹/²s,sθs∈Xs(z,S)} dΦ(ξ)

3. Estimators and Confidence Intervals

Conditional Median Unbiased Estimator: θ̃ᵘᵇₛ = θ̃ₛ⁽⁰·⁵⁾, where θ̃ₛ⁽ᵖ⁾ satisfies Fs(Xs | Z⁽ˢ⁾, θ̃ₛ⁽ᵖ⁾, S) = p
Conditional Confidence Interval: θ̃ₛ⁽¹⁻α/²⁾, θ̃ₛ⁽α/²⁾

Technical Innovations

1. Efficient Algorithm Design

Traditional methods require direct computation of the complex selection event X(S). This paper avoids such computation through the following innovation:

Algorithm 2: Computing Conditional Support

(A) Find intervals I by computing all intersections of linear functions xz,h(xs)
(B) For each interval I:
    i. Find sorting permutation σ*I
    ii. Compute interval boundaries ℓ(I) and u(I)
(C) Return ∪I I ∩ [ℓ(I), u(I)]

2. Unified Treatment of Multiple Testing Procedures

The method supports various testing procedures:

Step-down procedures: Bonferroni, Holm, Romano-Wolf, etc.
Step-up procedures: Benjamini-Hochberg, Benjamini-Yekutieli, etc.

3. Flexible Definition of Selection Events

Two main selection events are provided:

Ŝ = S: Fully conditional on the observed significance pattern
Ŝ ⊇ S: Conditional only on the specific effect being found significant

Experimental Setup

Monte Carlo Simulation

Data Configuration

Number of effects: m = 5
True parameters: θ = (0.05, 0.03, 0.01, 0, 0)'
Sample sizes: n ∈ {100, 300, 500, 700, 900}
Correlation: ρ = 0.5
Testing procedure: Holm step-down, FWER = 10%

Two Designs

Normal Design: Yᵢ ~ multivariate normal distribution
Chi-square Design: Yᵢₖ = (U²ᵢₖ-1)/√2 + θₖ, where Uᵢ ~ multivariate normal

Empirical Applications

Application 1: Charitable Giving Study

Data Source: Karlan and List (2007) matching gift experiment
Number of Effects: Treatment effects for 4 outcome variables
Testing Procedures: Bonferroni, Holm, Romano-Wolf (RW2005)

Application 2: Mutual Fund Performance

Data Source: CRSP Mutual Fund Database, January 2000 - April 2024
Number of Effects: Alpha estimates for 371 funds
Model: Fama-French five-factor model
Testing Procedures: Holm (FWER control) and Benjamini-Yekutieli (FDR control)

Experimental Results

Monte Carlo Simulation Results

Coverage Performance

Conditional Confidence Intervals: Approach nominal 90% coverage across all designs and sample sizes
Traditional Confidence Intervals: Severely under-cover, particularly when selection frequency is low
Bonferroni Intervals: Achieve nominal coverage at large samples but are overly conservative

Interval Length Comparison

Conditional intervals are wider than traditional intervals but significantly shorter than Bonferroni intervals, demonstrating efficiency gains.

Bias Correction Effects

The conditional median unbiased estimator reduces the conditional bias of traditional estimators (e.g., 0.084 in normal design with n=100) to -0.015.

Empirical Application Results

Charitable Giving Application

Key Findings:

Response Rate and Donation Amount Including Match are significant across all three procedures
The direction and magnitude of bias correction depend on the correlation structure
For "Donation Amount Including Match," upward correction occurs under Holm and Bonferroni tests, related to the insignificance of the highly correlated "Donation Amount Excluding Match"

Mutual Fund Application

Key Results:

Five funds with significantly positive alpha identified among 371 funds
Conditional median unbiased estimates slightly smaller than unconditional estimates
Conditional confidence intervals 12-36% narrower than unconditional intervals
Lower bounds of joint conditional confidence intervals exceed 0.135 for 4 of 5 funds, indicating economically meaningful outperformance

Selective Inference Literature

The paper is part of the rapidly developing selective inference literature, with related research including:

Conditional Inference Methods: Lee et al. (2016), Fithian et al. (2017)
Unconditional Inference Methods: Benjamini and Yekutieli (2005), Berk et al. (2013)

Distinctions from Existing Methods

vs. Unconditional Methods:
- Conditional methods control statistical error given observed significance
- Unconditional methods average statistical error across different contexts
- Conditional methods provide point estimates with bias correction
vs. Simultaneous Inference:
- Conditional inference may produce tighter confidence intervals
- Power advantages of unconditional methods are inconsistent

Theoretical Results

Main Theorems

Theorem 1 (Median Unbiasedness)

P(θ̃ᵘᵇₛ ≥ θₓ | Ŝ = S) = P(θ̃ᵘᵇₛ ≤ θₛ | Ŝ = S) = 0.5

Theorem 2 (Confidence Set Validity)

P(θₛ ∈ CCIα(θₛ | S) | Ŝ = S) = 1 - α

Theorems 5-6 (Asymptotic Properties)

Under Assumption 1, establish asymptotic median unbiasedness of estimators and asymptotic validity of confidence intervals.

Convergence Results

Theorem 4 provides sufficient conditions for conditional confidence intervals to converge to unconditional confidence intervals, with the two methods tending to agree when effects are "highly significant."

Conclusions and Discussion

Main Conclusions

Method Validity: The proposed conditional inference method performs well in finite samples and captures selection bias even under non-Gaussian settings
Computational Feasibility: The polynomial time complexity of the algorithm enables the method to handle hundreds of effects
Practical Value: Two empirical applications show that the direction and magnitude of bias correction are difficult to anticipate, highlighting the relevance of formal statistical methods

Limitations

Pre-specification Assumption: The method assumes the full set of tested hypotheses is known, unable to handle cases where insignificant results are hidden
Computational Complexity: While polynomial time, the method may face computational challenges for very large m
Model Assumptions: Requires asymptotic normality and consistently estimable covariance matrices

Future Directions

Alternative Conditional Inference Procedures: Explore data carving and randomized response methods
Power Properties Research: Investigate power characteristics of the procedure
Nonparametric Extensions: Relax normality assumptions

In-Depth Evaluation

Strengths

Theoretical Contribution: Provides a rigorous theoretical framework for inference after multiple hypothesis testing
Methodological Innovation: Efficient algorithms make the method practically operational
Broad Applicability: Supports multiple testing procedures and selection events
Empirical Validation: Thoroughly validates method effectiveness through simulations and real applications
Clear Writing: Well-structured paper with detailed technical exposition

Weaknesses

Computational Complexity: While polynomial time, O(m³log m) may be a bottleneck for ultra-large-scale problems
Assumption Limitations: Normality assumption and known covariance structure may not hold in practical applications
Selection Event Definition: Needs more guidance on choosing among different selection event definitions

Impact

Academic Value: Provides important contribution to selective inference literature, particularly in multiple testing context
Practical Value: Method directly applicable to empirical research in economics, medicine, and other fields
Reproducibility: Detailed algorithm description and complete theoretical results ensure good reproducibility

Applicable Scenarios

The method is particularly suitable for:

Multiple Treatment Effects Studies: Randomized controlled trials requiring simultaneous estimation of multiple intervention effects
Subgroup Analysis: Evaluating treatment effects across multiple population subgroups
Multiple Outcome Variables: Assessing single intervention impacts on multiple outcome variables
Financial Applications: Portfolio performance evaluation, risk factor analysis, etc.

References

The paper cites key literature in selective inference, including Lee et al. (2016) on polyhedral methods, Fithian et al. (2017) on conditional selective inference principles, and Romano and Wolf (2005) on multiple testing procedures. These citations reflect the paper's depth and breadth in the field.