MLE convergence speed to information projection of exponential family: Criterion for model dimension and sample size -- complete proof version--
Sheena
For a parametric model of distributions, the closest distribution in the model to the true distribution located outside the model is considered. Measuring the closeness between two distributions with the Kullback-Leibler (K-L) divergence, the closest distribution is called the "information projection." The estimation risk of the maximum likelihood estimator (MLE) is defined as the expectation of K-L divergence between the information projection and the predictive distribution with plugged-in MLE. Here, the asymptotic expansion of the risk is derived up to $n^{-2}$-order, and the sufficient condition on the risk for the Bayes error rate between the true distribution and the information projection to be lower than a specified value is investigated. Combining these results, the "$p-n$ criterion" is proposed, which determines whether the MLE is sufficiently close to the information projection for the given model and sample. In particular, the criterion for an exponential family model is relatively simple and can be used for a complex model with no explicit form of normalizing constant. This criterion can constitute a solution to the sample size or model acceptance problem. Use of the $p-n$ criteria is demonstrated for two practical datasets. The relationship between the results and information criteria is also studied.
academic
MLE Convergence Speed to Information Projection of Exponential Family: Criterion for Model Dimension and Sample Size -- Complete Proof Version --
Title: MLE convergence speed to information projection of exponential family: Criterion for model dimension and sample size -- complete proof version--
Author: Yo Sheena (Faculty of Data Science, Shiga University; Visiting Professor, Institute of Statistical Mathematics)
This paper investigates the problem of finding the distribution within a parametric model that is closest to the true distribution when the true distribution lies outside the model. Using Kullback-Leibler (K-L) divergence to measure distributional distance, the closest distribution is termed the "information projection." The estimation risk of the maximum likelihood estimator (MLE) is defined as the expected K-L divergence between the information projection and the predictive distribution with the MLE plugged in. The paper derives the asymptotic expansion of risk up to order n−2 and investigates sufficient conditions for the Bayes error rate between the true distribution and information projection to be below a specified value. Combining these results, a "p−n criterion" is proposed to determine whether the MLE is sufficiently close to the information projection under a given model and sample size. Notably, the criterion for exponential family models is relatively simple and applicable to complex models without explicit forms of normalizing constants. This criterion can serve as a solution to sample size determination or model acceptance problems.
Given a dataset, one must assume an unknown probability distribution as the generator of independent and identically distributed (i.i.d.) samples. If a parametric distribution model is adopted to "explain" the data, the primary task is to find the "best" distribution within the model. Since the true distribution typically lies outside the model, "best" means the distribution "closest" to the true distribution.
Three important problems exist in the distributional approximation process:
Systematic methods for constructing distributional models
Methods for assessing how close an estimator is to the best distribution
Methods for assessing how close the best distribution is to the true distribution
Existing research primarily focuses on the closeness between the predictive distribution and the true distribution, rather than closeness to the best distribution.
This paper focuses on the second problem, establishing a criterion for determining whether the MLE is sufficiently close to the best distribution. By separating the second and third problems, the paper fixes the model and derives the asymptotic expansion of risk with respect to sample size n.
Theoretical Contribution: Derives the asymptotic expansion of MLE estimation risk up to order n−2 for general distributional models with complete mathematical proofs
Exponential Family Specialization: Provides simplified risk expressions and practical p−n criterion for exponential family models
Practical Criterion: Proposes the p−n criterion for determining whether sample size is sufficient or model dimension is appropriate
Algorithmic Framework: Provides computational algorithms for complex exponential family models without requiring explicit normalizing constants
Empirical Validation: Verifies the effectiveness of the p−n criterion on two real datasets
Theoretical Connections: Establishes relationships with information criteria (AIC/TIC)
Given a parametric distributional model M={g(x;θ)∣θ∈Θ}, where g(x;θ) is a probability density function with respect to a reference measure dμ. The true distribution has density g(x). The objectives are:
Find the information projection g(x;θ∗) in the model
Assess the distance between the predictive distribution g(x;θ^) corresponding to MLE θ^ and the information projection
Establish a criterion for determining whether the MLE is sufficiently close to the information projection
Theorem 1: The estimation risk of MLE with respect to K-L divergence is:
R[g(x;θ∗)∣g(x;θ^)]=(2n)−1tr(G~−1GG~−1G∗)+n−2[complex second-order terms]+O(n−3)
where:
Gij∗(θ∗): Fisher information matrix
G~ij(θ∗): Negative expectation of Hessian matrix
Gij(θ∗): Variance-covariance matrix under the true distribution
Corollary 1: For exponential family models g(x;θ)=exp(∑i=1pθiξi(x)−Ψ(θ)):
R[g(x;θ∗)∣g(x;θ^)]=2n1tr(G~−1G)+24n21[function of third and fourth cumulants]+O(n−3)
This paper cites 28 important references covering information geometry, exponential family theory, asymptotic statistics, and other fields, providing solid theoretical foundations. Key references include Amari's monograph on information geometry, Barron & Sheu's research on exponential family convergence, and classical statistical learning theory literature.