2025-11-14T18:28:13.480518

MLE convergence speed to information projection of exponential family: Criterion for model dimension and sample size -- complete proof version--

Sheena
For a parametric model of distributions, the closest distribution in the model to the true distribution located outside the model is considered. Measuring the closeness between two distributions with the Kullback-Leibler (K-L) divergence, the closest distribution is called the "information projection." The estimation risk of the maximum likelihood estimator (MLE) is defined as the expectation of K-L divergence between the information projection and the predictive distribution with plugged-in MLE. Here, the asymptotic expansion of the risk is derived up to $n^{-2}$-order, and the sufficient condition on the risk for the Bayes error rate between the true distribution and the information projection to be lower than a specified value is investigated. Combining these results, the "$p-n$ criterion" is proposed, which determines whether the MLE is sufficiently close to the information projection for the given model and sample. In particular, the criterion for an exponential family model is relatively simple and can be used for a complex model with no explicit form of normalizing constant. This criterion can constitute a solution to the sample size or model acceptance problem. Use of the $p-n$ criteria is demonstrated for two practical datasets. The relationship between the results and information criteria is also studied.
academic

MLE Convergence Speed to Information Projection of Exponential Family: Criterion for Model Dimension and Sample Size -- Complete Proof Version --

Basic Information

  • Paper ID: 2105.08947
  • Title: MLE convergence speed to information projection of exponential family: Criterion for model dimension and sample size -- complete proof version--
  • Author: Yo Sheena (Faculty of Data Science, Shiga University; Visiting Professor, Institute of Statistical Mathematics)
  • Classification: math.ST stat.TH
  • Publication Date: May 2021 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2105.08947

Abstract

This paper investigates the problem of finding the distribution within a parametric model that is closest to the true distribution when the true distribution lies outside the model. Using Kullback-Leibler (K-L) divergence to measure distributional distance, the closest distribution is termed the "information projection." The estimation risk of the maximum likelihood estimator (MLE) is defined as the expected K-L divergence between the information projection and the predictive distribution with the MLE plugged in. The paper derives the asymptotic expansion of risk up to order n2n^{-2} and investigates sufficient conditions for the Bayes error rate between the true distribution and information projection to be below a specified value. Combining these results, a "pnp-n criterion" is proposed to determine whether the MLE is sufficiently close to the information projection under a given model and sample size. Notably, the criterion for exponential family models is relatively simple and applicable to complex models without explicit forms of normalizing constants. This criterion can serve as a solution to sample size determination or model acceptance problems.

Research Background and Motivation

Core Problem

Given a dataset, one must assume an unknown probability distribution as the generator of independent and identically distributed (i.i.d.) samples. If a parametric distribution model is adopted to "explain" the data, the primary task is to find the "best" distribution within the model. Since the true distribution typically lies outside the model, "best" means the distribution "closest" to the true distribution.

Problem Significance

Successful distributional approximation has broad applications:

  1. Regression or discriminant analysis based on conditional distributions
  2. Multiple imputation using conditional or unconditional distributions
  3. Anomaly detection based on probability contour regions
  4. Embodying C.R. Rao's famous equation: "uncertain knowledge" + "knowledge of the degree of uncertainty" = "available knowledge"

Limitations of Existing Methods

Three important problems exist in the distributional approximation process:

  1. Systematic methods for constructing distributional models
  2. Methods for assessing how close an estimator is to the best distribution
  3. Methods for assessing how close the best distribution is to the true distribution

Existing research primarily focuses on the closeness between the predictive distribution and the true distribution, rather than closeness to the best distribution.

Research Motivation

This paper focuses on the second problem, establishing a criterion for determining whether the MLE is sufficiently close to the best distribution. By separating the second and third problems, the paper fixes the model and derives the asymptotic expansion of risk with respect to sample size nn.

Core Contributions

  1. Theoretical Contribution: Derives the asymptotic expansion of MLE estimation risk up to order n2n^{-2} for general distributional models with complete mathematical proofs
  2. Exponential Family Specialization: Provides simplified risk expressions and practical pnp-n criterion for exponential family models
  3. Practical Criterion: Proposes the pnp-n criterion for determining whether sample size is sufficient or model dimension is appropriate
  4. Algorithmic Framework: Provides computational algorithms for complex exponential family models without requiring explicit normalizing constants
  5. Empirical Validation: Verifies the effectiveness of the pnp-n criterion on two real datasets
  6. Theoretical Connections: Establishes relationships with information criteria (AIC/TIC)

Methodology Details

Task Definition

Given a parametric distributional model M={g(x;θ)θΘ}M = \{g(x; \theta) | \theta \in \Theta\}, where g(x;θ)g(x; \theta) is a probability density function with respect to a reference measure dμd\mu. The true distribution has density g(x)g(x). The objectives are:

  • Find the information projection g(x;θ)g(x; \theta^*) in the model
  • Assess the distance between the predictive distribution g(x;θ^)g(x; \hat{\theta}) corresponding to MLE θ^\hat{\theta} and the information projection
  • Establish a criterion for determining whether the MLE is sufficiently close to the information projection

Core Framework

Information Projection Definition

The information projection g(x;θ)g(x; \theta^*) is defined as: θ=argminθΘD[g(x)g(x;θ)]\theta^* = \arg \min_{\theta \in \Theta} D[g(x) | g(x; \theta)] where D[g1g2]=g1(x)log(g1(x)/g2(x))dμD[g_1 | g_2] = \int g_1(x) \log(g_1(x)/g_2(x))d\mu is the K-L divergence.

Estimation Risk Definition

The estimation risk is defined as: R[g(x;θ)g(x;θ^)]=E[D[g(x;θ)g(x;θ^)]]R[g(x; \theta^*) | g(x; \hat{\theta})] = E[D[g(x; \theta^*) | g(x; \hat{\theta})]]

Theoretical Results

Asymptotic Expansion for General Models

Theorem 1: The estimation risk of MLE with respect to K-L divergence is: R[g(x;θ)g(x;θ^)]=(2n)1tr(G~1GG~1G)+n2[complex second-order terms]+O(n3)R[g(x; \theta^*) | g(x; \hat{\theta})] = (2n)^{-1}\text{tr}(\tilde{G}^{-1}G\tilde{G}^{-1}G^*) + n^{-2}[\text{complex second-order terms}] + O(n^{-3})

where:

  • Gij(θ)G^*_{ij}(\theta^*): Fisher information matrix
  • G~ij(θ)\tilde{G}_{ij}(\theta^*): Negative expectation of Hessian matrix
  • Gij(θ)G_{ij}(\theta^*): Variance-covariance matrix under the true distribution

Simplified Results for Exponential Families

Corollary 1: For exponential family models g(x;θ)=exp(i=1pθiξi(x)Ψ(θ))g(x; \theta) = \exp(\sum_{i=1}^p \theta_i \xi_i(x) - \Psi(\theta)): R[g(x;θ)g(x;θ^)]=12ntr(G~1G)+124n2[function of third and fourth cumulants]+O(n3)R[g(x; \theta^*) | g(x; \hat{\theta})] = \frac{1}{2n}\text{tr}(\tilde{G}^{-1}G) + \frac{1}{24n^2}[\text{function of third and fourth cumulants}] + O(n^{-3})

Key property: G=G~=Ψ¨(θ)G^* = \tilde{G} = \ddot{\Psi}(\theta^*) (second derivative matrix)

The pnp-n Criterion

Criterion for General Models

C12ntr(G~^1G^G~^1G^)C \geq \frac{1}{2n}\text{tr}(\hat{\tilde{G}}^{-1}\hat{G}\hat{\tilde{G}}^{-1}\hat{G}^*)

Criterion for Exponential Families

C12ntr(Σ^(Ψ¨(θ^))1)+124n2[estimated second-order terms]C \geq \frac{1}{2n}\text{tr}(\hat{\Sigma}(\ddot{\Psi}(\hat{\theta}))^{-1}) + \frac{1}{24n^2}[\text{estimated second-order terms}]

where Σ^\hat{\Sigma} is the sample covariance matrix of the ξi\xi_i terms.

Threshold Setting

The threshold CC is set through the relationship between Bayes error rate and K-L divergence:

  • If D[g1g2]δD[g_1 | g_2] \leq \delta, then error rate Er[g1g2]1/2δ/8\text{Er}[g_1 | g_2] \geq 1/2 - \sqrt{\delta/8}
  • For error rate threshold 1/2α1/2 - \alpha, approximately Cα=8α2C_\alpha = 8\alpha^2

Experimental Setup

Datasets

  1. Red Wine Quality Dataset:
    • Source: UCI Machine Learning Repository
    • Sample size: 1599 (red wine data)
    • Variables: 11 chemical substances (continuous) + quality indicator (integer 3-8)
    • Model: 47-dimensional exponential family model (after correlation filtering)
  2. Abalone Dataset:
    • Source: UCI Machine Learning Repository
    • Sample size: 4177
    • Variables: Sex (3 categories) + rings (integer 1-29)
    • Model: 62-dimensional multinomial distribution (63 categories)

Experimental Design

  • Red wine data: Randomly split in half, one half for model construction, one half for parameter estimation
  • Abalone data: Direct application of the pnp-n criterion formula for multinomial distribution
  • MCMC methods used to handle normalizing constants for complex exponential family models

Experimental Results

Red Wine Dataset Results

  • 47-dimensional model (n=799n=799):
    • First-order term: 2.95e-02
    • Second-order term: -1.30e-04
    • Total estimated risk: 2.93e-02
    • Corresponding α0.06\alpha \approx 0.06, Bayes error rate > 0.44
  • 37-dimensional simplified model:
    • Total estimated risk: 1.62e-02 < 0.02 (threshold for α=0.05\alpha=0.05)
    • Satisfies pnp-n criterion requirements
  • Classification Performance: Generative classifier accuracy 58%, decision tree 63%, but generative model shows less overfitting

Abalone Dataset Results

  • p=62p=62, n=4177n=4177, M^=36128.33\hat{M}=36128.33
  • First-order risk: 0.0074, second-order risk: 1.73e-04
  • Total risk: 0.0076 < 0.02 (for α=0.05\alpha=0.05)
  • Satisfies pnp-n criterion
  • However, for α=0.01\alpha=0.01 requires n38847n \geq 38847, actual sample is insufficient

Key Findings

  1. Second-order terms contribute minimally to total risk; first-order approximation is usually sufficient
  2. The pnp-n criterion effectively guides model selection and sample size determination
  3. Complex models can be implemented via MCMC without requiring explicit normalizing constants

Exponential Family Theory

  • Portnoy, Stone, Barron & Sheu studied convergence of exponential family sequences
  • Wainwright & Jordan studied basis function selection in graphical models
  • Efron & Tibshirani studied mixed exponential family construction

Information Geometry

  • Amari & Nagaoka's information geometry theory provides geometric foundations for this work
  • Csiszár's information projection concept
  • α\alpha-divergence theoretical framework

Model Selection

  • Relationships with AIC/TIC information criteria
  • This paper's method separates estimation risk and approximation risk

Conclusions and Discussion

Main Conclusions

  1. Establishes precise asymptotic theory for MLE estimation risk, particularly simplified forms for exponential families
  2. Proposes the practical pnp-n criterion for sample size determination and model acceptance problems
  3. Provides algorithmic frameworks for handling complex exponential family models
  4. Establishes theoretical connections with information criteria

Limitations

  1. Theoretical assumptions require appropriate regularity conditions
  2. Second-order term computation is complex; first-order approximation is typically used in practice
  3. Threshold setting is based on approximate relationships, potentially lacking precision
  4. For non-exponential family models, the criterion form is more complex

Future Directions

  1. Extension to more general divergence families
  2. Investigation of finite sample properties
  3. Development of more efficient computational algorithms
  4. Application to modern statistical models such as deep learning

In-Depth Evaluation

Strengths

  1. Theoretical Rigor: Provides complete mathematical proofs with in-depth theoretical analysis
  2. Practical Value: The pnp-n criterion can be directly applied to real problems
  3. Methodological Innovation: Novel approach of separating estimation risk and approximation risk
  4. Computational Feasibility: Provides MCMC implementation for complex models
  5. Broad Applicability: Applicable to various exponential family models

Weaknesses

  1. Computational Complexity: Large computational burden for second-order terms limits practical application
  2. Assumption Requirements: Requires strong regularity assumptions
  3. Limited Experiments: Validation on only two datasets
  4. Threshold Approximation: The approximate relationship between Bayes error rate and K-L divergence may lack precision

Impact

  1. Theoretical Contribution: Provides new analytical tools for statistical learning theory
  2. Practical Guidance: Provides quantitative criteria for model selection
  3. Methodological Framework: Establishes new framework for risk decomposition
  4. Extensibility: Provides theoretical foundation for subsequent research

Applicable Scenarios

  1. Sample size planning for exponential family models
  2. Model selection for complex statistical models
  3. Model complexity control in machine learning
  4. Guidance for prior selection in Bayesian statistics

References

This paper cites 28 important references covering information geometry, exponential family theory, asymptotic statistics, and other fields, providing solid theoretical foundations. Key references include Amari's monograph on information geometry, Barron & Sheu's research on exponential family convergence, and classical statistical learning theory literature.