2025-11-14T18:28:13.480518

MLE convergence speed to information projection of exponential family: Criterion for model dimension and sample size -- complete proof version--

Sheena

For a parametric model of distributions, the closest distribution in the model to the true distribution located outside the model is considered. Measuring the closeness between two distributions with the Kullback-Leibler (K-L) divergence, the closest distribution is called the "information projection." The estimation risk of the maximum likelihood estimator (MLE) is defined as the expectation of K-L divergence between the information projection and the predictive distribution with plugged-in MLE. Here, the asymptotic expansion of the risk is derived up to $n^{-2}$-order, and the sufficient condition on the risk for the Bayes error rate between the true distribution and the information projection to be lower than a specified value is investigated. Combining these results, the "$p-n$ criterion" is proposed, which determines whether the MLE is sufficiently close to the information projection for the given model and sample. In particular, the criterion for an exponential family model is relatively simple and can be used for a complex model with no explicit form of normalizing constant. This criterion can constitute a solution to the sample size or model acceptance problem. Use of the $p-n$ criteria is demonstrated for two practical datasets. The relationship between the results and information criteria is also studied.

academic

指数族の情報投影へのMLE収束速度：モデル次元とサンプルサイズの基準 -- 完全証明版--

基本情報

論文ID: 2105.08947
タイトル: MLE convergence speed to information projection of exponential family: Criterion for model dimension and sample size -- complete proof version--
著者: Yo Sheena（滋賀大学データサイエンス学部、統計数理研究所客員教授）
分類: math.ST stat.TH
発表時期: 2021年5月（arXiv プレプリント）
論文リンク: https://arxiv.org/abs/2105.08947

要約

本論文は、パラメトリック分布モデルにおいて、真の分布がモデル外部に位置する場合に、モデル内で真の分布に最も近い分布の問題を研究している。Kullback-Leibler (K-L)ダイバージェンスを用いて分布間の距離を測定し、最も近い分布を「情報投影」と呼ぶ。最大尤度推定量(MLE)の推定リスクは、情報投影とMLEを代入した予測分布間のK-Lダイバージェンスの期待値として定義される。本論文はリスクの漸近展開を $n^{-2}$ 次まで導出し、真の分布と情報投影間のベイズ誤り率を指定値以下にするリスクの十分条件を研究した。これらの結果を組み合わせることで、「 $p-n$ 基準」を提案し、与えられたモデルとサンプルの下でMLEが情報投影に十分近いかどうかを判定する。特に、指数族モデルの基準は比較的単純であり、正規化定数の明示的形式を持たない複雑なモデルに適用可能である。この基準は、サンプルサイズまたはモデル受容問題の解決策として機能する。

研究背景と動機

核心問題

与えられたデータセットに対して、独立同分布(i.i.d.)サンプルの生成器として未知の確率分布を仮定する必要がある。パラメトリック分布モデルを用いてデータを「説明」する場合、最初のタスクはモデル内で「最適な」分布を見つけることである。真の分布は通常モデル外部に位置するため、「最適」とは真の分布に最も「近い」分布を意味する。

問題の重要性

分布近似の成功は広範な応用を持つ：

条件付き分布に基づく回帰または判別分析
条件付きまたは無条件分布を用いた多重代入法
確率等高線領域に基づく異常値判定
C.R. Raoの有名な方程式を体現：「不確実な知識」+「不確実性の程度に関する知識」=「利用可能な知識」

既存方法の限界

分布近似プロセスには3つの重要な問題が存在する：

分布モデルを体系的に構築する方法
推定量と最適分布の近接度を評価する方法
最適分布と真の分布の近接度を評価する方法

既存研究は主に予測分布と真の分布の近接度に焦点を当てており、最適分布との近接度ではない。

研究動機

本論文は第2の問題に焦点を当て、MLEが最適分布に十分近いかどうかを判定する基準を確立する。第2と第3の問題を分離することで、モデルを固定し、サンプルサイズnに関するリスクの漸近展開を導出する。

核心貢献

理論的貢献：一般分布モデルの下でのMLE推定リスクの漸近展開を $n^{-2}$ 次まで導出し、完全な数学的証明を提供
指数族の特化：指数族モデルに対して簡略化されたリスク表現と実用的な $p-n$ 基準を提供
実用的基準： $p-n$ 基準を提案し、サンプルサイズが十分であるか、またはモデル次元が適切であるかを判定するために使用可能
アルゴリズムフレームワーク：正規化定数の明示的形式を必要としない複雑な指数族モデルの計算アルゴリズムを提供
実証的検証：2つの実データセット上で $p-n$ 基準の有効性を検証
理論的関連性：情報準則(AIC/TIC)との関係を確立

方法の詳細

タスク定義

パラメトリック分布モデル $M = \{g(x; \theta) | \theta \in \Theta\}$ が与えられ、ここで $g(x; \theta)$ は参照測度 $d\mu$ に関する確率密度関数である。真の分布の密度関数を $g(x)$ とする。目標は：

モデル内の情報投影 $g(x; \theta^*)$ を見つける
MLE $\hat{\theta}$ に対応する予測分布 $g(x; \hat{\theta})$ と情報投影間の距離を評価する
MLEが情報投影に十分近いかどうかを判定する基準を確立する

コアフレームワーク

情報投影の定義

情報投影 $g(x; \theta^*)$ は以下のように定義される： $\theta^* = \arg \min_{\theta \in \Theta} D[g(x) | g(x; \theta)]$ ここで $D[g_1 | g_2] = \int g_1(x) \log(g_1(x)/g_2(x))d\mu$ はK-Lダイバージェンスである。

推定リスクの定義

推定リスクは以下のように定義される： $R[g(x; \theta^*) | g(x; \hat{\theta})] = E[D[g(x; \theta^*) | g(x; \hat{\theta})]]$

理論的結果

一般モデルの漸近展開

定理1：K-Lダイバージェンスに関するMLEの推定リスクは： $R[g(x; \theta^*) | g(x; \hat{\theta})] = (2n)^{-1}\text{tr}(\tilde{G}^{-1}G\tilde{G}^{-1}G^*) + n^{-2}[\text{複雑な二次項}] + O(n^{-3})$

ここで：

$G^*_{ij}(\theta^*)$ ：Fisher情報行列
$\tilde{G}_{ij}(\theta^*)$ ：Hessian行列の負の期待値
$G_{ij}(\theta^*)$ ：真の分布下の分散共分散行列

指数族の簡略化結果

系1：指数族モデル $g(x; \theta) = \exp(\sum_{i=1}^p \theta_i \xi_i(x) - \Psi(\theta))$ に対して： $R[g(x; \theta^*) | g(x; \hat{\theta})] = \frac{1}{2n}\text{tr}(\tilde{G}^{-1}G) + \frac{1}{24n^2}[\text{三次および四次累積量の関数}] + O(n^{-3})$

重要な性質： $G^* = \tilde{G} = \ddot{\Psi}(\theta^*)$ （二次導関数行列）

$p-n$ 基準

一般モデルの基準

$C \geq \frac{1}{2n}\text{tr}(\hat{\tilde{G}}^{-1}\hat{G}\hat{\tilde{G}}^{-1}\hat{G}^*)$

指数族の基準

$C \geq \frac{1}{2n}\text{tr}(\hat{\Sigma}(\ddot{\Psi}(\hat{\theta}))^{-1}) + \frac{1}{24n^2}[\text{推定された二次項}]$

ここで $\hat{\Sigma}$ は $\xi_i$ 項の標本共分散行列である。

閾値設定

ベイズ誤り率とK-Lダイバージェンスの関係を通じて閾値 $C$ を設定する：

$D[g_1 | g_2] \leq \delta$ の場合、誤り率 $\text{Er}[g_1 | g_2] \geq 1/2 - \sqrt{\delta/8}$
誤り率閾値 $1/2 - \alpha$ に対して、近似的に $C_\alpha = 8\alpha^2$

実験設定

データセット

赤ワイン品質データセット：
- 出典：UCI機械学習リポジトリ
- サンプルサイズ：1599（赤ワインデータ）
- 変数：11個の化学物質（連続変数）+品質指標（3-8整数）
- モデル：47次元指数族モデル（相関性フィルタリング後）
アワビデータセット：
- 出典：UCI機械学習リポジトリ
- サンプルサイズ：4177
- 変数：性別（3カテゴリ）+環数（1-29整数）
- モデル：62次元多項分布（63カテゴリ）

実験設計

赤ワインデータ：ランダムに2分割、一方はモデル構築用、他方はパラメータ推定用
アワビデータ：多項分布の $p-n$ 基準公式を直接適用
MCMC法を用いて複雑な指数族モデルの正規化定数の問題に対処

実験結果

赤ワインデータセットの結果

47次元モデル（ $n=799$ $n = 799$ ）：
- 一次項：2.95e-02
- 二次項：-1.30e-04
- 総推定リスク：2.93e-02
- 対応する $\alpha \approx 0.06$ 、ベイズ誤り率 > 0.44
37次元簡略化モデル：
- 総推定リスク：1.62e-02 < 0.02（ $\alpha=0.05$ の閾値）
- $p-n$ 基準要件を満たす
分類性能：生成モデル分類器の精度58%、決定木63%、ただし生成モデルは過学習が少ない