Criterion for the resemblance between the mother and the model distribution
Sheena
If the probability distribution model aims to approximate the hidden mother distribution, it is imperative to establish a useful criterion for the resemblance between the mother and the model distributions.
This study proposes a criterion that measures the Hellinger distance between discretized (quantized) samples from both distributions. Unlike information criteria such as AIC, this criterion does not require the probability density function of the model distribution, which cannot be explicitly obtained for a complicated model such as a deep learning machine. Second, it can draw a positive conclusion (i.e., both distributions are sufficiently close) under a given threshold, whereas a statistical hypothesis test, such as the Kolmogorov-Smirnov test, cannot genuinely lead to a positive conclusion when the hypothesis is accepted.
In this study, we establish a reasonable threshold for the criterion deduced from the Bayes error rate and also present the asymptotic bias of the estimator of the criterion. From these results, a reasonable and easy-to-use criterion is established that can be directly calculated from the two sets of samples from both distributions.
academic
Criterion for the resemblance between the mother and the model distribution
This paper investigates the measurement of similarity between probabilistic distribution models and true data distributions (mother distributions). It proposes a criterion based on discretized sample Hellinger distance that does not require explicit probability density functions of the model distribution, making it applicable to complex models such as deep learning. Unlike traditional hypothesis testing (e.g., Kolmogorov-Smirnov test), this criterion can yield positive conclusions that "two distributions are sufficiently close" under a given threshold. The research establishes reasonable thresholds derived from Bayes error rates and provides asymptotic bias analysis of the criterion estimators.
When a probabilistic distribution model aims to approximate an unknown true data distribution (mother distribution), establishing an effective similarity measurement criterion is a fundamental problem. This is particularly important in the evaluation of generative models (such as deep generative models and Bayesian models).
Model Evaluation Needs: In machine learning and statistical modeling, it is necessary to determine whether the generated model sufficiently approximates the true data distribution
Practical Significance: Addresses practical questions such as whether training is sufficient, whether parametric models are appropriate, and whether sample sizes are adequate
Theoretical Value: Provides interpretable quantitative standards for distribution similarity
Proposes a two-sample criterion based on discretized Hellinger distance: By discretizing (quantizing) samples from two distributions and comparing Hellinger distances at the multinomial distribution level
Establishes theoretical connection with Bayes error rate (Theorem 1): Proves the relationship between f-divergence and Bayes error rate, making divergence values practically interpretable
Derives reasonable threshold standards: Based on Bayes error rate, derives the Hellinger distance threshold δ* = 8ϵ², where ϵ corresponds to the degree of error rate deviation from random guessing
Proposes moving region discretization method: Compared to fixed region methods, achieves superior asymptotic efficiency at the n⁻² order (Theorems 2 and 3)
Provides asymptotic bias analysis of estimators (Theorem 4): Proves that the upper bound of estimator E_Dm̂⁽¹⁾ : m̂⁽²⁾ is E_Dm⁽¹⁾ : m⁽²⁾ + √(8p'/n₂) + o(n₁⁻¹) + o(n₂⁻¹/²)
Establishes general relationship between f-divergence and Bayes error rate (Theorem 1), providing intuitive interpretation of divergence values in terms of classification error
Proves asymptotic superiority of moving region method in single-sample problems (Theorems 2, 3)
Method Innovation:
Uses moving region method rather than fixed regions, improving estimation efficiency
Selects Hellinger distance to avoid zero estimation problems (does not diverge when -1 < α < 1)
Uses model samples X⁽²⁾ to construct regions (since typically n₂ >> n₁)
Bias Analysis:
Theorem 4 provides asymptotic bias upper bound for estimators
Effect of n₂ is O(n₂⁻¹/²), effect of n₁ is O(n₁⁻¹)
Explains why relatively large n₂ is needed
Practical Criterion:
Provides complete criterion with bias correction (Formula 40)
Threshold 8ϵ² has clear statistical meaning (corresponding to Bayes error rate)
Theorem 1 establishing f-divergence and Bayes error rate relationship has universality and depth
Asymptotic analysis (Theorems 2-4) with complete mathematical derivations and detailed proofs
Theoretical results provide solid foundation for practice
Method Innovation:
Core Innovation: Introduces Bayes error rate into divergence threshold setting, making abstract divergence values have intuitive interpretation as classification accuracy
Moving region method superiority over fixed regions has theoretical support
Amari (2016): Information Geometry and Its Applications - Information geometric foundations of f-divergence theory
Csiszár (1975): Foundational work on f-divergence
Gretton et al. (2007): Application of kernel methods in two-sample testing
Richardson and Weiss (2018): Closest method to this paper, employing fixed regions
Sheena (2018): Author's prior work proving superiority of moving region method in scalar case
Theis et al. (2015): Comparative study of generative model evaluation methods
Borji (2018): Comprehensive survey of GAN evaluation metrics
Overall Assessment: This is an excellent paper with rigorous theory and practical methods. The core innovation lies in introducing Bayes error rate into divergence threshold setting, making abstract statistics have intuitive classification interpretation. The method is particularly suitable for evaluating complex models without explicit density functions, filling an important gap in the field. Main limitations are incomplete high-dimensional theory and limited experimental coverage, but these do not diminish its academic value and practicality. Readers are advised to pay attention to sample size requirements (particularly n₂) and dimensionality limitations when applying the method, employing dimensionality reduction strategies when necessary.