We address the classical problem of constructing confidence intervals (CIs) for the mean of a distribution, given \(N\) i.i.d. samples, such that the CI contains the true mean with probability at least \(1 - δ\), where \(δ\in (0,1)\). We characterize three distinct learning regimes based on the minimum achievable limiting width of any CI as the sample size \(N_δ \to \infty\) and \(δ\to 0\). In the first regime, where \(N_δ\) grows slower than \(\log(1/δ)\), the limiting width of any CI equals the width of the distribution's support, precluding meaningful inference. In the second regime, where \(N_δ\) scales as \(\log(1/δ)\), we precisely characterize the minimum limiting width, which depends on the scaling constant. In the third regime, where \(N_δ\) grows faster than \(\log(1/δ)\), complete learning is achievable, and the limiting width of the CI collapses to zero, converging to the true mean. We demonstrate that CIs derived from concentration inequalities based on Kullback--Leibler (KL) divergences achieve asymptotically optimal performance, attaining the minimum limiting width in both sufficient and complete learning regimes for distributions in two families: single-parameter exponential and bounded support. Additionally, these results extend to one-sided CIs, with the width notion adjusted appropriately. Finally, we generalize our findings to settings with random per-sample costs, motivated by practical applications such as stochastic simulators and cloud service selection. Instead of a fixed sample size, we consider a cost budget \(C_δ\), identifying analogous learning regimes and characterizing the optimal CI construction policy.
- Paper ID: 2501.19126
- Title: Asymptotic optimality theory of confidence intervals of the mean
- Authors: Vikas Deep (NUS, Singapore), Achal Bassamboo (Kellogg, Northwestern University), Sandeep Juneja (Ashoka University, India)
- Classification: math.ST stat.TH
- Publication Date: January 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2501.19126
This paper investigates the classical problem of constructing confidence intervals (CI) for the mean of a distribution based on N independent and identically distributed samples, requiring that the CI contains the true mean with probability at least 1-δ. Based on the minimum asymptotic width achievable by any CI as N_δ→∞ and δ→0, the authors characterize three distinct learning regimes: (1) No-learning regime: when N_δ grows slower than log(1/δ), the limiting width of the CI equals the width of the distribution's support; (2) Sufficient-learning regime: when N_δ grows proportionally to log(1/δ), the minimum limiting width can be precisely characterized depending on scaling constants; (3) Complete-learning regime: when N_δ grows faster than log(1/δ), the limiting width of the CI converges to zero. The authors prove that CIs constructed using concentration inequalities based on KL divergence achieve asymptotic optimality in both the sufficient-learning and complete-learning regimes.
Confidence interval construction is a fundamental problem in statistics with important applications in A/B testing, experimental design, data analysis, and simulation. Although multiple methods for constructing confidence intervals exist, there is a lack of theoretical characterization of optimal CIs with minimum width.
- Lack of optimality theory: While the literature provides various CI construction methods, no results characterize optimal CIs with minimum width
- Loose non-asymptotic bounds: Existing non-asymptotic lower bounds (e.g., Shekhar and Ramdas 2023) are loose in the asymptotic setting
- Strong assumptions: Existing bounds rely on strong assumptions that CI width is deterministically bounded by specific functions
This paper aims to fill this theoretical gap by introducing a stability assumption to characterize fundamental limits of CI width in an asymptotic framework and prove the optimality of KL divergence-based methods.
- Characterization of three learning regimes: Based on the relative scaling of sample size N_δ relative to precision 1-δ, the authors characterize three distinct regimes: no-learning, sufficient-learning, and complete-learning
- Sharp lower bounds: Derive sharp lower bounds for the limiting width of CIs in the sufficient-learning regime and prove that KL divergence-based CI construction methods achieve these bounds
- Asymptotic optimality proof: Prove that CI construction methods based on KL divergence concentration bounds are optimal in the studied asymptotic framework
- Extended results: Extend results to more general settings including random sampling costs, one-sided CIs, and non-parametric distributions
Given N independent and identically distributed samples X₁,...,X_N from distribution ν with mean μ, construct a confidence interval μ̂_L^π(N,δ), μ̂_R^π(N,δ) such that P_ν(μ ∈ μ̂_L^π(N,δ), μ̂_R^π(N,δ)) ≥ 1-δ.
Definition 1 (Stability): For a given distribution ν, a strategy π is called stable if as N_δ→∞ and δ→0:
- lim_{δ→0} μ̂_L^π(N_δ,δ) →^p μ_L^π(ν)
- lim_{δ→0} μ̂_R^π(N_δ,δ) →^p μ_R^π(ν)
where μ_L^π(ν) ≤ μ and μ_R^π(ν) ≥ μ are constants.
Based on the value k = lim_{δ→0} N_δ/log(1/δ):
No-learning regime (k→0):
- CI limiting width = distribution support width
- μ_L^π(μ) = μ̲, μ_R^π(μ) = μ̄
Sufficient-learning regime (k ∈ (0,∞)):
- Lower bound: μ_R^π(μ) - μ_L^π(μ) ≥ μ_R*(μ,k) - μ_L*(μ,k)
- where μ_L*(μ,k) < μ and μ_R*(μ,k) > μ uniquely satisfy:
d(μ, μ_R*(μ,k)) = d(μ, μ_L*(μ,k)) = 1/k
Complete-learning regime (k→∞):
For distributions in single-parameter exponential family S, define:
d(μ, μ̃) = KL(p_{θ(μ)}, p_{θ(μ̃)}) = b(θ(μ̃)) - b(θ(μ)) - b'(θ(μ))(θ(μ̃) - θ(μ))
This function possesses key properties such as strict quasi-convexity and continuity.
Based on the concentration inequality:
P_ν(nd(μ̂_n, μ) ≥ β(δ)) ≤ δ
where β(δ) = log(2/δ), construct CI:
- μ_R^{π₁}(n,δ) = max{q > μ̂_n : nd(μ̂_n, q) ≤ β(δ)}
- μ_L^{π₁}(n,δ) = min{q < μ̂_n : nd(μ̂_n, q) ≤ β(δ)}
- Introduction of stability concept: This is a key innovation for analyzing asymptotic behavior of CI width, making the limiting width a deterministic constant
- Clever application of data processing inequality: Combined with stability assumption, enables simultaneous consideration of hypothesis elimination on both sides
- Tightness proof: Proves that the proposed lower bounds are tight, i.e., there exist methods that achieve the bounds
- Bernoulli distribution: means 0.6 and 0.9
- Gaussian distribution: N(0,1) with known variance
- Pareto distribution: scale parameter x_m=1, shape parameter α=3
- Average CI width: average confidence interval width over 1000 independent datasets
- Coverage probability: frequency with which the confidence interval contains the true mean
- Hoeffding-based CI: based on Hoeffding inequality
- Empirical Bernstein (EB) CI: based on empirical Bernstein inequality
- Betting-based hedged CI: based on betting method
- Shekhar-Ramdas lower bound: existing theoretical lower bound
- δ = 0.01 (Bernoulli experiments), δ = 0.05 (Pareto experiments)
- Sample sizes: N ∈ {2000, 3000}
- Discretization parameter: m ∈ {1000, 3000, 5000} (betting method)
For the Gaussian case, the asymptotic lower bound in this paper is 2σ√(2/k), while the Shekhar-Ramdas bound is σ√(2/k), representing an improvement factor of 2.
| N | π₁ | Betting(m=1000) | Betting(m=3000) | Betting(m=5000) | Hoeffding | EB |
|---|
| Mean=0.6 | | | | | | |
| 2000 | 0.0712 | 0.0603 | 0.0596 | 0.0595 | 0.0728 | 0.0898 |
| 3000 | 0.0582 | 0.0592 | 0.0585 | 0.0584 | 0.0594 | 0.0712 |
| Mean=0.9 | | | | | | |
| 2000 | 0.0436 | 0.0378 | 0.0371 | 0.0369 | 0.0728 | 0.0606 |
| 3000 | 0.0356 | 0.0370 | 0.0363 | 0.0361 | 0.0594 | 0.0473 |
| Sample Size | Average CI Width |
|---|
| 500 | 0.492 |
| 1000 | 0.355 |
| 2000 | 0.255 |
| 3000 | 0.199 |
- Asymptotic advantage: The π₁ method performs excellently in large sample scenarios, particularly comparable to the betting method when N=3000
- Computational efficiency: The π₁ method is more computationally efficient than the betting method
- Theoretical verification: Experimental results verify the theoretically predicted improvement factor
- Duality between hypothesis testing and CI: Classical theory constructs CIs by inverting hypothesis tests
- UMP tests: Uniformly most powerful tests exist in parametric settings but are typically restricted to specific families (e.g., unbiased tests in exponential families)
- Hoeffding and Bernstein inequalities: applicable to bounded support distributions
- Chernoff bounds: applicable when MGF upper bounds are known
- Heavy-tailed distribution methods: using Markov and Chebyshev inequalities
- Waudby-Smith and Ramdas (2024): transform CI construction into betting problems
- Shekhar and Ramdas (2023): first provide explicit lower bounds with distribution-dependent complexity terms, but relatively loose
- Complete theoretical characterization: First complete characterization of fundamental limits of CI width, identifying three distinct learning regimes
- Optimal method: Prove that KL divergence-based CI construction methods are optimal in the asymptotic sense
- Broad applicability: Results apply to parametric and non-parametric distribution families, as well as random cost settings
- Asymptotic nature: Results are primarily asymptotic with limited guidance for finite samples
- Stability assumption: Although mild, this remains an additional assumption
- Distribution family restrictions: Main results focus on exponential families and bounded support distributions
- Non-asymptotic results: Develop more refined non-asymptotic theory
- Other statistics: Extend to variance and quantile estimation
- Multidimensional generalization: Consider confidence regions for multidimensional parameters
- Significant theoretical contribution: First complete theory of CI width optimality, filling an important theoretical gap
- Significant technical innovation: Introduction of stability concept and clever application of data processing inequality have methodological value
- Tight results: Not only provide lower bounds but also prove their achievability
- Broad applicability: Extensions to random costs, one-sided CIs, and other practically relevant settings
- Limited experiments: Numerical experiments are relatively simple; could include more complex real-world datasets
- Computational complexity: For non-parametric cases, computation of KL_inf may be complex
- Finite sample performance: Theory is asymptotic; finite sample performance guarantees are not sufficiently strong
- Theoretical impact: Provides new analytical framework for CI theory, expected to be widely cited
- Practical value: Provides theoretical guidance for selecting CI methods in practical applications
- Methodological contribution: Stability analysis methods may be applicable to other statistical inference problems
- Large-sample statistical inference: Particularly suitable for applications with large sample sizes
- Online experiments: A/B testing and other scenarios requiring reliable confidence intervals
- Simulation studies: Random cost settings particularly suitable for simulation applications
- Machine learning: Confidence interval construction in model performance evaluation
The paper cites important literature in statistics and machine learning, including:
- Hoeffding (1994): Classical work on probability inequalities
- Waudby-Smith & Ramdas (2024): Recent advances in betting methods
- Shekhar & Ramdas (2023): Related lower bound work
- Kaufmann & Koolen (2021): Anytime-valid concentration inequalities
This paper makes important contributions to confidence interval theory by introducing a new analytical framework that completely characterizes fundamental limits of CI width and proves the optimality of KL divergence methods. While primarily theoretical, it provides valuable guidance for practical applications.