2025-11-14T18:28:13.480518

MLE convergence speed to information projection of exponential family: Criterion for model dimension and sample size -- complete proof version--

Sheena

For a parametric model of distributions, the closest distribution in the model to the true distribution located outside the model is considered. Measuring the closeness between two distributions with the Kullback-Leibler (K-L) divergence, the closest distribution is called the "information projection." The estimation risk of the maximum likelihood estimator (MLE) is defined as the expectation of K-L divergence between the information projection and the predictive distribution with plugged-in MLE. Here, the asymptotic expansion of the risk is derived up to $n^{-2}$-order, and the sufficient condition on the risk for the Bayes error rate between the true distribution and the information projection to be lower than a specified value is investigated. Combining these results, the "$p-n$ criterion" is proposed, which determines whether the MLE is sufficiently close to the information projection for the given model and sample. In particular, the criterion for an exponential family model is relatively simple and can be used for a complex model with no explicit form of normalizing constant. This criterion can constitute a solution to the sample size or model acceptance problem. Use of the $p-n$ criteria is demonstrated for two practical datasets. The relationship between the results and information criteria is also studied.

academic

MLE Convergence Speed to Information Projection of Exponential Family: Criterion for Model Dimension and Sample Size -- Complete Proof Version --

Basic Information

Paper ID: 2105.08947
Title: MLE convergence speed to information projection of exponential family: Criterion for model dimension and sample size -- complete proof version--
Author: Yo Sheena (Faculty of Data Science, Shiga University; Visiting Professor, Institute of Statistical Mathematics)
Classification: math.ST stat.TH
Publication Date: May 2021 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2105.08947

Abstract

This paper investigates the problem of finding the distribution within a parametric model that is closest to the true distribution when the true distribution lies outside the model. Using Kullback-Leibler (K-L) divergence to measure distributional distance, the closest distribution is termed the "information projection." The estimation risk of the maximum likelihood estimator (MLE) is defined as the expected K-L divergence between the information projection and the predictive distribution with the MLE plugged in. The paper derives the asymptotic expansion of risk up to order $n^{-2}$ and investigates sufficient conditions for the Bayes error rate between the true distribution and information projection to be below a specified value. Combining these results, a " $p-n$ criterion" is proposed to determine whether the MLE is sufficiently close to the information projection under a given model and sample size. Notably, the criterion for exponential family models is relatively simple and applicable to complex models without explicit forms of normalizing constants. This criterion can serve as a solution to sample size determination or model acceptance problems.

Research Background and Motivation

Core Problem

Given a dataset, one must assume an unknown probability distribution as the generator of independent and identically distributed (i.i.d.) samples. If a parametric distribution model is adopted to "explain" the data, the primary task is to find the "best" distribution within the model. Since the true distribution typically lies outside the model, "best" means the distribution "closest" to the true distribution.

Problem Significance

Successful distributional approximation has broad applications:

Regression or discriminant analysis based on conditional distributions
Multiple imputation using conditional or unconditional distributions
Anomaly detection based on probability contour regions
Embodying C.R. Rao's famous equation: "uncertain knowledge" + "knowledge of the degree of uncertainty" = "available knowledge"

Limitations of Existing Methods

Three important problems exist in the distributional approximation process:

Systematic methods for constructing distributional models
Methods for assessing how close an estimator is to the best distribution
Methods for assessing how close the best distribution is to the true distribution

Existing research primarily focuses on the closeness between the predictive distribution and the true distribution, rather than closeness to the best distribution.

Research Motivation

This paper focuses on the second problem, establishing a criterion for determining whether the MLE is sufficiently close to the best distribution. By separating the second and third problems, the paper fixes the model and derives the asymptotic expansion of risk with respect to sample size $n$ .

Core Contributions

Theoretical Contribution: Derives the asymptotic expansion of MLE estimation risk up to order $n^{-2}$ for general distributional models with complete mathematical proofs
Exponential Family Specialization: Provides simplified risk expressions and practical $p-n$ criterion for exponential family models
Practical Criterion: Proposes the $p-n$ criterion for determining whether sample size is sufficient or model dimension is appropriate
Algorithmic Framework: Provides computational algorithms for complex exponential family models without requiring explicit normalizing constants
Empirical Validation: Verifies the effectiveness of the $p-n$ criterion on two real datasets
Theoretical Connections: Establishes relationships with information criteria (AIC/TIC)

Methodology Details

Task Definition

Given a parametric distributional model $M = \{g(x; \theta) | \theta \in \Theta\}$ , where $g(x; \theta)$ is a probability density function with respect to a reference measure $d\mu$ . The true distribution has density $g(x)$ . The objectives are:

Find the information projection $g(x; \theta^*)$ in the model
Assess the distance between the predictive distribution $g(x; \hat{\theta})$ corresponding to MLE $\hat{\theta}$ and the information projection
Establish a criterion for determining whether the MLE is sufficiently close to the information projection

Core Framework

Information Projection Definition

The information projection $g(x; \theta^*)$ is defined as: $\theta^* = \arg \min_{\theta \in \Theta} D[g(x) | g(x; \theta)]$ where $D[g_1 | g_2] = \int g_1(x) \log(g_1(x)/g_2(x))d\mu$ is the K-L divergence.

Estimation Risk Definition

The estimation risk is defined as: $R[g(x; \theta^*) | g(x; \hat{\theta})] = E[D[g(x; \theta^*) | g(x; \hat{\theta})]]$

Theoretical Results

Asymptotic Expansion for General Models

Theorem 1: The estimation risk of MLE with respect to K-L divergence is: $R[g(x; \theta^*) | g(x; \hat{\theta})] = (2n)^{-1}\text{tr}(\tilde{G}^{-1}G\tilde{G}^{-1}G^*) + n^{-2}[\text{complex second-order terms}] + O(n^{-3})$

where:

$G^*_{ij}(\theta^*)$ : Fisher information matrix
$\tilde{G}_{ij}(\theta^*)$ : Negative expectation of Hessian matrix
$G_{ij}(\theta^*)$ : Variance-covariance matrix under the true distribution

Simplified Results for Exponential Families

Corollary 1: For exponential family models $g(x; \theta) = \exp(\sum_{i=1}^p \theta_i \xi_i(x) - \Psi(\theta))$ : $R[g(x; \theta^*) | g(x; \hat{\theta})] = \frac{1}{2n}\text{tr}(\tilde{G}^{-1}G) + \frac{1}{24n^2}[\text{function of third and fourth cumulants}] + O(n^{-3})$

Key property: $G^* = \tilde{G} = \ddot{\Psi}(\theta^*)$ (second derivative matrix)

The $p-n$ Criterion

Criterion for General Models

$C \geq \frac{1}{2n}\text{tr}(\hat{\tilde{G}}^{-1}\hat{G}\hat{\tilde{G}}^{-1}\hat{G}^*)$

Criterion for Exponential Families

$C \geq \frac{1}{2n}\text{tr}(\hat{\Sigma}(\ddot{\Psi}(\hat{\theta}))^{-1}) + \frac{1}{24n^2}[\text{estimated second-order terms}]$

where $\hat{\Sigma}$ is the sample covariance matrix of the $\xi_i$ terms.

Threshold Setting

The threshold $C$ is set through the relationship between Bayes error rate and K-L divergence:

If $D[g_1 | g_2] \leq \delta$ , then error rate $\text{Er}[g_1 | g_2] \geq 1/2 - \sqrt{\delta/8}$
For error rate threshold $1/2 - \alpha$ , approximately $C_\alpha = 8\alpha^2$

Experimental Setup

Datasets

Red Wine Quality Dataset:
- Source: UCI Machine Learning Repository
- Sample size: 1599 (red wine data)
- Variables: 11 chemical substances (continuous) + quality indicator (integer 3-8)
- Model: 47-dimensional exponential family model (after correlation filtering)
Abalone Dataset:
- Source: UCI Machine Learning Repository
- Sample size: 4177
- Variables: Sex (3 categories) + rings (integer 1-29)
- Model: 62-dimensional multinomial distribution (63 categories)

Experimental Design

Red wine data: Randomly split in half, one half for model construction, one half for parameter estimation
Abalone data: Direct application of the $p-n$ criterion formula for multinomial distribution
MCMC methods used to handle normalizing constants for complex exponential family models

Experimental Results

Red Wine Dataset Results

47-dimensional model ( $n=799$ $n = 799$ ):
- First-order term: 2.95e-02
- Second-order term: -1.30e-04
- Total estimated risk: 2.93e-02
- Corresponding $\alpha \approx 0.06$ , Bayes error rate > 0.44
37-dimensional simplified model:
- Total estimated risk: 1.62e-02 < 0.02 (threshold for $\alpha=0.05$ )
- Satisfies $p-n$ criterion requirements
Classification Performance: Generative classifier accuracy 58%, decision tree 63%, but generative model shows less overfitting

Abalone Dataset Results

$p=62$ , $n=4177$ , $\hat{M}=36128.33$
First-order risk: 0.0074, second-order risk: 1.73e-04
Total risk: 0.0076 < 0.02 (for $\alpha=0.05$ )
Satisfies $p-n$ criterion
However, for $\alpha=0.01$ requires $n \geq 38847$ , actual sample is insufficient

Key Findings

Second-order terms contribute minimally to total risk; first-order approximation is usually sufficient
The $p-n$ criterion effectively guides model selection and sample size determination
Complex models can be implemented via MCMC without requiring explicit normalizing constants

Exponential Family Theory

Portnoy, Stone, Barron & Sheu studied convergence of exponential family sequences
Wainwright & Jordan studied basis function selection in graphical models
Efron & Tibshirani studied mixed exponential family construction

Information Geometry

Amari & Nagaoka's information geometry theory provides geometric foundations for this work
Csiszár's information projection concept
$\alpha$ -divergence theoretical framework

Model Selection

Relationships with AIC/TIC information criteria
This paper's method separates estimation risk and approximation risk

Conclusions and Discussion

Main Conclusions

Establishes precise asymptotic theory for MLE estimation risk, particularly simplified forms for exponential families
Proposes the practical $p-n$ criterion for sample size determination and model acceptance problems
Provides algorithmic frameworks for handling complex exponential family models
Establishes theoretical connections with information criteria

Limitations

Theoretical assumptions require appropriate regularity conditions
Second-order term computation is complex; first-order approximation is typically used in practice
Threshold setting is based on approximate relationships, potentially lacking precision
For non-exponential family models, the criterion form is more complex

Future Directions

Extension to more general divergence families
Investigation of finite sample properties
Development of more efficient computational algorithms
Application to modern statistical models such as deep learning

In-Depth Evaluation

Strengths

Theoretical Rigor: Provides complete mathematical proofs with in-depth theoretical analysis
Practical Value: The $p-n$ criterion can be directly applied to real problems
Methodological Innovation: Novel approach of separating estimation risk and approximation risk
Computational Feasibility: Provides MCMC implementation for complex models
Broad Applicability: Applicable to various exponential family models

Weaknesses

Computational Complexity: Large computational burden for second-order terms limits practical application
Assumption Requirements: Requires strong regularity assumptions
Limited Experiments: Validation on only two datasets
Threshold Approximation: The approximate relationship between Bayes error rate and K-L divergence may lack precision

Impact

Theoretical Contribution: Provides new analytical tools for statistical learning theory
Practical Guidance: Provides quantitative criteria for model selection
Methodological Framework: Establishes new framework for risk decomposition
Extensibility: Provides theoretical foundation for subsequent research

Applicable Scenarios

Sample size planning for exponential family models
Model selection for complex statistical models
Model complexity control in machine learning
Guidance for prior selection in Bayesian statistics

References

This paper cites 28 important references covering information geometry, exponential family theory, asymptotic statistics, and other fields, providing solid theoretical foundations. Key references include Amari's monograph on information geometry, Barron & Sheu's research on exponential family convergence, and classical statistical learning theory literature.