2025-11-14T18:28:13.480518

MLE convergence speed to information projection of exponential family: Criterion for model dimension and sample size -- complete proof version--

Sheena

For a parametric model of distributions, the closest distribution in the model to the true distribution located outside the model is considered. Measuring the closeness between two distributions with the Kullback-Leibler (K-L) divergence, the closest distribution is called the "information projection." The estimation risk of the maximum likelihood estimator (MLE) is defined as the expectation of K-L divergence between the information projection and the predictive distribution with plugged-in MLE. Here, the asymptotic expansion of the risk is derived up to $n^{-2}$-order, and the sufficient condition on the risk for the Bayes error rate between the true distribution and the information projection to be lower than a specified value is investigated. Combining these results, the "$p-n$ criterion" is proposed, which determines whether the MLE is sufficiently close to the information projection for the given model and sample. In particular, the criterion for an exponential family model is relatively simple and can be used for a complex model with no explicit form of normalizing constant. This criterion can constitute a solution to the sample size or model acceptance problem. Use of the $p-n$ criteria is demonstrated for two practical datasets. The relationship between the results and information criteria is also studied.

academic

MLE convergence speed to information projection of exponential family: Criterion for model dimension and sample size -- complete proof version--

基本信息

论文ID: 2105.08947
标题: MLE convergence speed to information projection of exponential family: Criterion for model dimension and sample size -- complete proof version--
作者: Yo Sheena (滋贺大学数据科学学部，统计数理研究所访问教授)
分类: math.ST stat.TH
发表时间: May 2021 (arXiv预印本)
论文链接: https://arxiv.org/abs/2105.08947

摘要

本文研究参数分布模型中，当真实分布位于模型外部时，模型中最接近真实分布的分布问题。使用Kullback-Leibler (K-L)散度衡量分布间距离，最接近的分布称为"信息投影"。最大似然估计器(MLE)的估计风险定义为信息投影与插入MLE的预测分布之间K-L散度的期望。本文推导了风险的渐近展开至 $n^{-2}$ 阶，并研究了使真实分布与信息投影之间贝叶斯错误率低于指定值的风险充分条件。结合这些结果，提出了" $p-n$ 准则"，用于判断给定模型和样本下MLE是否足够接近信息投影。特别地，指数族模型的准则相对简单，可用于没有标准化常数显式形式的复杂模型。该准则可作为样本量或模型接受问题的解决方案。

研究背景与动机

核心问题

当给定数据集时，需要假设一个未知概率分布作为独立同分布(i.i.d.)样本的生成器。如果采用某个参数分布模型来"解释"数据，首要任务是在模型中找到"最佳"分布。由于真实分布通常位于模型外部，"最佳"意味着最"接近"真实分布的分布。

问题的重要性

成功的分布近似具有广泛应用：

基于条件分布进行回归或判别分析
使用条件或无条件分布进行多重插补
基于概率等高线区域判断异常值
体现C.R. Rao著名方程："不确定知识" + "不确定性程度的知识" = "可用知识"

现有方法局限性

分布近似过程中存在三个重要问题：

系统性构建分布模型的方法
评估估计器与最佳分布接近程度的方法
评估最佳分布与真实分布接近程度的方法

现有研究主要关注预测分布与真实分布的接近度，而非与最佳分布的接近度。

研究动机

本文专注于第二个问题，建立判断MLE是否足够接近最佳分布的准则。通过分离第二和第三个问题，固定模型并推导关于样本量n的风险渐近展开。

核心贡献

理论贡献：推导了一般分布模型下MLE估计风险的渐近展开至 $n^{-2}$ 阶，给出了完整的数学证明
指数族特化：为指数族模型提供了简化的风险表达式和实用的 $p-n$ 准则
实用准则：提出 $p-n$ 准则，可用于确定样本量是否足够或模型维度是否合适
算法框架：提供了无需显式标准化常数的复杂指数族模型计算算法
实证验证：在两个实际数据集上验证了 $p-n$ 准则的有效性
理论联系：建立了与信息准则(AIC/TIC)的关系

方法详解

任务定义

给定参数分布模型 $M = \{g(x; \theta) | \theta \in \Theta\}$ ，其中 $g(x; \theta)$ 是关于参考测度 $d\mu$ 的概率密度函数。真实分布的密度函数为 $g(x)$ 。目标是：

找到模型中的信息投影 $g(x; \theta^*)$
评估MLE $\hat{\theta}$ 对应的预测分布 $g(x; \hat{\theta})$ 与信息投影的距离
建立判断MLE是否充分接近信息投影的准则

核心框架

信息投影定义

信息投影 $g(x; \theta^*)$ 定义为： $\theta^* = \arg \min_{\theta \in \Theta} D[g(x) | g(x; \theta)]$ 其中 $D[g_1 | g_2] = \int g_1(x) \log(g_1(x)/g_2(x))d\mu$ 为K-L散度。

估计风险定义

估计风险定义为： $R[g(x; \theta^*) | g(x; \hat{\theta})] = E[D[g(x; \theta^*) | g(x; \hat{\theta})]]$

理论结果

一般模型的渐近展开

定理1：MLE关于K-L散度的估计风险为： $R[g(x; \theta^*) | g(x; \hat{\theta})] = (2n)^{-1}\text{tr}(\tilde{G}^{-1}G\tilde{G}^{-1}G^*) + n^{-2}[\text{复杂的二阶项}] + O(n^{-3})$

其中：

$G^*_{ij}(\theta^*)$ ：Fisher信息矩阵
$\tilde{G}_{ij}(\theta^*)$ ：Hessian矩阵的负期望
$G_{ij}(\theta^*)$ ：真实分布下的方差-协方差矩阵

指数族的简化结果

推论1：对于指数族模型 $g(x; \theta) = \exp(\sum_{i=1}^p \theta_i \xi_i(x) - \Psi(\theta))$ ： $R[g(x; \theta^*) | g(x; \hat{\theta})] = \frac{1}{2n}\text{tr}(\tilde{G}^{-1}G) + \frac{1}{24n^2}[\text{三阶和四阶累积量的函数}] + O(n^{-3})$

关键性质： $G^* = \tilde{G} = \ddot{\Psi}(\theta^*)$ （二阶导数矩阵）

$p-n$ 准则

一般模型准则

$C \geq \frac{1}{2n}\text{tr}(\hat{\tilde{G}}^{-1}\hat{G}\hat{\tilde{G}}^{-1}\hat{G}^*)$

指数族准则

$C \geq \frac{1}{2n}\text{tr}(\hat{\Sigma}(\ddot{\Psi}(\hat{\theta}))^{-1}) + \frac{1}{24n^2}[\text{估计的二阶项}]$

其中 $\hat{\Sigma}$ 是 $\xi_i$ 项的样本协方差矩阵。

阈值设定

通过贝叶斯错误率与K-L散度的关系设定阈值 $C$ ：

如果 $D[g_1 | g_2] \leq \delta$ ，则错误率 $\text{Er}[g_1 | g_2] \geq 1/2 - \sqrt{\delta/8}$
对于错误率阈值 $1/2 - \alpha$ ，近似有 $C_\alpha = 8\alpha^2$

实验设置

数据集

红酒质量数据集：
- 来源：UCI机器学习库
- 样本量：1599（红酒数据）
- 变量：11个化学物质（连续变量）+ 质量指标（3-8整数）
- 模型：47维指数族模型（经过相关性筛选）
鲍鱼数据集：
- 来源：UCI机器学习库
- 样本量：4177
- 变量：性别（3类）+ 环数（1-29整数）
- 模型：62维多项分布（63个类别）

实验设计

红酒数据：随机分为两半，一半用于模型构建，一半用于参数估计
鲍鱼数据：直接应用多项分布的 $p-n$ 准则公式
使用MCMC方法处理复杂指数族模型的标准化常数问题

实验结果

红酒数据集结果

47维模型（ $n=799$ $n = 799$ ）：
- 一阶项：2.95e-02
- 二阶项：-1.30e-04
- 总估计风险：2.93e-02
- 对应 $\alpha \approx 0.06$ ，贝叶斯错误率 > 0.44
37维简化模型：
- 总估计风险：1.62e-02 < 0.02（ $\alpha=0.05$ 的阈值）
- 满足 $p-n$ 准则要求
分类性能：生成式分类器准确率58%，决策树63%，但生成式模型过拟合更少