2025-11-28T22:22:19.391257

Criterion for the resemblance between the mother and the model distribution

Sheena

If the probability distribution model aims to approximate the hidden mother distribution, it is imperative to establish a useful criterion for the resemblance between the mother and the model distributions. This study proposes a criterion that measures the Hellinger distance between discretized (quantized) samples from both distributions. Unlike information criteria such as AIC, this criterion does not require the probability density function of the model distribution, which cannot be explicitly obtained for a complicated model such as a deep learning machine. Second, it can draw a positive conclusion (i.e., both distributions are sufficiently close) under a given threshold, whereas a statistical hypothesis test, such as the Kolmogorov-Smirnov test, cannot genuinely lead to a positive conclusion when the hypothesis is accepted. In this study, we establish a reasonable threshold for the criterion deduced from the Bayes error rate and also present the asymptotic bias of the estimator of the criterion. From these results, a reasonable and easy-to-use criterion is established that can be directly calculated from the two sets of samples from both distributions.

academic

Criterion for the resemblance between the mother and the model distribution

Basic Information

Paper ID: 2212.03397
Title: Criterion for the resemblance between the mother and the model distribution
Author: Yo Sheena (Faculty of Data Science, Shiga University, Japan; Visiting Professor of the Institute of Statistical Mathematics, Japan)
Classification: math.ST stat.TH
Publication Date: November 13, 2025 (arXiv v3)
Paper Link: https://arxiv.org/abs/2212.03397

Abstract

This paper investigates the measurement of similarity between probabilistic distribution models and true data distributions (mother distributions). It proposes a criterion based on discretized sample Hellinger distance that does not require explicit probability density functions of the model distribution, making it applicable to complex models such as deep learning. Unlike traditional hypothesis testing (e.g., Kolmogorov-Smirnov test), this criterion can yield positive conclusions that "two distributions are sufficiently close" under a given threshold. The research establishes reasonable thresholds derived from Bayes error rates and provides asymptotic bias analysis of the criterion estimators.

Research Background and Motivation

1. Core Problem

When a probabilistic distribution model aims to approximate an unknown true data distribution (mother distribution), establishing an effective similarity measurement criterion is a fundamental problem. This is particularly important in the evaluation of generative models (such as deep generative models and Bayesian models).

2. Importance of the Problem

Model Evaluation Needs: In machine learning and statistical modeling, it is necessary to determine whether the generated model sufficiently approximates the true data distribution
Practical Significance: Addresses practical questions such as whether training is sufficient, whether parametric models are appropriate, and whether sample sizes are adequate
Theoretical Value: Provides interpretable quantitative standards for distribution similarity

3. Limitations of Existing Methods

Kullback-Leibler Divergence and Information Criteria (e.g., AIC):

Require explicit probability density functions g_m(x) of the model distribution
Difficult to obtain explicit forms for complex models (e.g., deep neural networks, Bayesian models)
While usable for model comparison, the numerical values themselves lack statistical significance and cannot be used for model evaluation

Statistical Hypothesis Testing (e.g., K-S Test):

When rejecting the null hypothesis, only the conclusion "two distributions differ" can be drawn, though they may actually be quite similar
With large samples, easily rejects hypotheses due to detecting minor differences
When accepting the hypothesis, cannot draw positive conclusions that "two distributions are sufficiently close"
p-values do not directly reflect the degree of distribution proximity

4. Research Motivation

Propose a criterion that:

Can be computed directly from samples without requiring explicit density functions
Yields positive conclusions about "sufficient closeness"
Has interpretable thresholds

Core Contributions

Proposes a two-sample criterion based on discretized Hellinger distance: By discretizing (quantizing) samples from two distributions and comparing Hellinger distances at the multinomial distribution level
Establishes theoretical connection with Bayes error rate (Theorem 1): Proves the relationship between f-divergence and Bayes error rate, making divergence values practically interpretable
Derives reasonable threshold standards: Based on Bayes error rate, derives the Hellinger distance threshold δ* = 8ϵ², where ϵ corresponds to the degree of error rate deviation from random guessing
Proposes moving region discretization method: Compared to fixed region methods, achieves superior asymptotic efficiency at the n⁻² order (Theorems 2 and 3)
Provides asymptotic bias analysis of estimators (Theorem 4): Proves that the upper bound of estimator E_Dm̂⁽¹⁾ : m̂⁽²⁾ is E_Dm⁽¹⁾ : m⁽²⁾ + √(8p'/n₂) + o(n₁⁻¹) + o(n₂⁻¹/²)

Establishes practical model fitting criterion:

D[m̂⁽¹⁾ : m̂⁽²⁾] + p'/(2n₁) + √(8p'/n₂) < 8ϵ²

Detailed Methodology

Task Definition

Given two sample sets:

Mother distribution observations: X⁽¹⁾ = {X₁⁽¹⁾, ..., Xₙ₁⁽¹⁾}
Model-generated samples: X⁽²⁾ = {X₁⁽²⁾, ..., Xₙ₂⁽²⁾}

Objective: Establish a criterion to determine whether the mother distribution and model distribution are sufficiently close.

Method Architecture

1. Relationship between f-divergence and Bayes Error Rate

For two probability density functions g₁(x) and g₂(x), f-divergence is defined as:

D_f[g₁(x) | g₂(x)] = ∫ g₁(x)f(g₂(x)/g₁(x))dµ(x)

Bayes error rate is:

Er[g₁(x)|g₂(x)] = (1/2)∫ min(g₁(x), g₂(x))dµ

Theorem 1 establishes the key connection: If D_fg₁(x) | g₂(x) < δ, then Erg₁(x) | g₂(x) ≥ α(δ), where α(δ) is a function of δ.

For Hellinger distance (f(x) = 2(1-√x)²), approximately:

α(δ) ≈ (1 - √(δ/2))/2

Setting the threshold as Bayes error rate 1/2 - ϵ (close to random guessing), we obtain:

δ* = 8ϵ²

2. Discretization Methods

Fixed Region Method: Pre-specify region partitions I_i independently of samples.

Moving Region Method (recommended in this paper): Dynamically determine regions based on quantiles of sample X⁽²⁾.

For scalar case (k=1):

Select quantile points λᵢ = i/(p+1), i = 1,...,p
Use order statistics of X⁽²⁾ to determine interval endpoints: ξ̂ᵢ = X₍ñᵢ₎⁽²⁾, where ñᵢ = ⌊n₂λᵢ⌋
Define moving intervals I_i = (ξ̂ᵢ, ξ̂ᵢ₊₁)

For vector case (k≥2):

Employ recursive partitioning method
At step i, partition along the i-th coordinate using order statistics
Partition depth is l (≤ k)

3. Multinomial Distribution Construction

Based on moving regions A_j(l), construct two multinomial distributions:

m⁽¹⁾ = {m_j(l)⁽¹⁾}, m_j(l)⁽¹⁾ = P(X ∈ A_j(l)|mother distribution)
m⁽²⁾ = {m_j(l)⁽²⁾}, m_j(l)⁽²⁾ = P(X ∈ A_j(l)|model distribution)

Estimators are:

m̂⁽¹⁾ = {m̂_j(l)⁽¹⁾}, m̂_j(l)⁽¹⁾ = #{X⁽¹⁾ | X⁽¹⁾ ∈ A_j(l)}/n₁
m̂⁽²⁾ = {m̂_j(l)⁽²⁾}, m̂_j(l)⁽²⁾ = 1/(p'_j(l-1) + 1)

4. Hellinger Distance Calculation

Hellinger distance is defined as:

D[m⁽¹⁾ : m⁽²⁾] = 2∑_j(l) (√m_j(l)⁽¹⁾ - √m_j(l)⁽²⁾)²

Estimator is:

D[m̂⁽¹⁾ : m̂⁽²⁾] = 2∑_j(l) (√m̂_j(l)⁽¹⁾ - √m̂_j(l)⁽²⁾)²

Technical Innovations

Theoretical Innovation:
- Establishes general relationship between f-divergence and Bayes error rate (Theorem 1), providing intuitive interpretation of divergence values in terms of classification error
- Proves asymptotic superiority of moving region method in single-sample problems (Theorems 2, 3)
Method Innovation:
- Uses moving region method rather than fixed regions, improving estimation efficiency
- Selects Hellinger distance to avoid zero estimation problems (does not diverge when -1 < α < 1)
- Uses model samples X⁽²⁾ to construct regions (since typically n₂ >> n₁)
Bias Analysis:
- Theorem 4 provides asymptotic bias upper bound for estimators
- Effect of n₂ is O(n₂⁻¹/²), effect of n₁ is O(n₁⁻¹)
- Explains why relatively large n₂ is needed
Practical Criterion:
- Provides complete criterion with bias correction (Formula 40)
- Threshold 8ϵ² has clear statistical meaning (corresponding to Bayes error rate)

Experimental Setup

Datasets

Case 1: Multivariate Normal Distribution

Mother Distribution: X⁽¹⁾ᵢ ~ N(α, I_k + βV), where V_ = 0.95^{|i-j|}
Model Distribution: X⁽²⁾ᵢ ~ N(0, I_k) (standard normal)
Parameter Settings:
- Dimension k = 3, partition depth l = 3
- Number of partitions per variable p = p_{j(1)} = p_{j(2)} = 3
- Total number of regions p' = (3+1)³ - 1 = 63
- Similarity parameters (α, β) = (0,0), (0.01,0.01), (0.1,0.1), (1,1)
- Sample sizes n₁ ∈ {10³, 10⁴, 10⁵, 10⁶, 10⁷}, n₂ = 10⁷

High-Dimensional Case:

k = 10, p = p_{j(1)} = ... = p_{j(9)} = 3
Since full-depth partitioning requires p' = (3+1)¹⁰ - 1 > 10⁶, use l = 2
Examine all pairwise two-dimensional marginal distributions

Case 2: Bayesian Model

Dataset: UCI Power Plant dataset (9568 samples)
Model: Normal regression model y = β₁ + ∑ᵢ₌₂⁵ βᵢxᵢ + ϵ
Prior Distribution:
- β₁ ~ Cauchy(0, 10)
- βᵢ ~ Cauchy(0, 2.5), i = 2,...,5
- σ ~ t(5, 5, 1)
MCMC Samples: 4000 posterior samples of β
Predicted Value Samples: n₂ = 4000 × 9568 ≈ 3.827×10⁷
True Value Samples: n₁ = 9568
Number of Regions: p' = 10

Evaluation Metrics

Hellinger Distance: Dm̂⁽¹⁾ : m̂⁽²⁾
Complete Criterion Value (left side of Formula 40): Dm̂⁽¹⁾ : m̂⁽²⁾ + p'/(2n₁) + √(8p'/n₂)
Threshold: 8ϵ² (0.02 when ϵ = 0.05, 0.0008 when ϵ = 0.01)
Comparison Method: p-value from Kolmogorov-Smirnov test

Implementation Details

Bias correction terms: p'/(2n₁) + √(8p'/n₂)
Moving region method uses equal-mass partitioning (λᵢ = i/(p+1))
For high-dimensional cases, employ dimensionality reduction strategy (two-dimensional marginal distributions)

Experimental Results

Main Results

Case 1: Three-Dimensional Normal Distribution (k=3, l=3, p'=63, n₂=10⁷)

(α, β)	n₁=10⁷	n₁=10⁶	n₁=10⁵	n₁=10⁴
(0, 0)	0.00711	0.00717	0.00773	0.0136
(0.01, 0.01)	0.00735	0.00741	0.00797	0.0137
(0.1, 0.1)	0.0277	0.0277	0.0290	0.0349
(1, 1)	0.699	0.698	0.707	0.707

Key Findings:

(α, β) = (0, 0) and (0.01, 0.01): Criterion value < 0.02 (threshold for ϵ=0.05), conclusion is sufficiently close
(α, β) = (0.1, 0.1): Criterion value approximately 0.028-0.035 > 0.02, but < 0.08 (threshold for ϵ=0.1), close under relaxed standard
(α, β) = (1, 1): Criterion value approximately 0.7 >> 0.02, clearly not close
Sample Size Effect: As n₁ increases from 10⁴ to 10⁷, criterion value decreases from 0.0136 to 0.00711 (α=β=0 case)

High-Dimensional Case (k=10, l=2, Two-Dimensional Marginal Distributions)

For (α, β) = (0.1, 0.1):

n₁=10³, n₂=10⁷: All 45 variable pairs have criterion values between 0.023-0.038, all > 0.02, cannot conclude closeness
n₁=10⁴, n₂=10⁷: All pairs have criterion values between 0.015-0.019, all < 0.02, conclusion is sufficiently close

This validates sample size requirements, particularly that n₁ needs to reach the 10⁴ order of magnitude.

Case Analysis

Bayesian Regression Model

Experimental Results:

Hellinger Distance: Dm̂⁽¹⁾ : m̂⁽²⁾ ≈ 0.0113
Bias Correction Term: p'/(2n₁) + √(8p'/n₂) ≈ 0.0020
Complete Criterion Value: ≈ 0.0133
Corresponding ϵ: Solving 8ϵ² = 0.0133 yields ϵ ≈ 0.04
Corresponding Bayes Error Rate: 0.5 - 0.04 = 0.46

K-S Test Comparison:

p-value = 7.587×10⁻⁸, rejects null hypothesis at extremely low significance level
However, this paper's criterion indicates that under Bayes error rate standard of 0.46, distributions are sufficiently close

Histogram Analysis (Figure 2):

Distributions of ŷ and y have similar shapes
Supports the "sufficiently close" conclusion

This case demonstrates:

K-S test yields "rejection" conclusion, but actual distributions are already quite close
This paper's criterion provides positive "sufficiently close" conclusion, better matching practical needs
Threshold interpretability (Bayes error rate 0.46 close to random guessing's 0.5)

Experimental Findings

Method Effectiveness: The criterion correctly distinguishes distribution pairs with different similarities
Sample Size Requirements:
- Effect of n₂ is O(n₂⁻¹/²), requiring relatively large values (10⁷ in experiments)
- Effect of n₁ is O(n₁⁻¹), typically 10⁴ is sufficient
- Consistent with theoretical analysis (Theorem 4)
Dimensionality Effect:
- High-dimensional full-depth partitioning requires exponential sample sizes
- Two-dimensional marginal distribution strategy is practical compromise
Comparison with Hypothesis Testing:
- K-S test is overly sensitive with large samples
- This paper's criterion provides interpretable "sufficiently close" judgment
Threshold Reasonableness:
- ϵ = 0.05 (corresponding threshold 0.02) is reasonable standard choice
- Can be adjusted based on application needs (e.g., ϵ = 0.1 corresponds to 0.08)

1. Two-Sample Comparison Methods

Richardson and Weiss (2018):

Closest method to this paper
Employs fixed region method
Uses binomial distribution collection rather than multinomial distribution
Finally uses z-test for evaluation

Johnson and Dasu (1998):

Divides high-dimensional data into categorical and continuous variables
Uses multiple testing to judge similarity

2. K-S Test Extensions

Press and Teukolsky (1988): Two-dimensional K-S test

Hagen et al. (2020): High-dimensional K-S distance

Loudin and Miettinen (2003):

Compresses high-dimensional distributions to one dimension
Uses one-dimensional K-S test

3. Kernel Methods

Gretton et al. (2007):

Applies reproducing kernel Hilbert space theory
Measures distribution similarity through function similarity
Ultimately employs traditional hypothesis testing

4. Generative Model Evaluation

Theis et al. (2015):

Evaluates probabilistic image generation models
Points out that different evaluation methods may lead to completely different conclusions

Borji (2018):

Comprehensive survey of evaluation metrics for generative adversarial networks
Some methods applicable to two-sample problems

Advantages of This Paper

No Explicit Density Required: Applicable to complex models (deep learning, Bayesian models)
Positive Conclusions: Can judge "sufficiently close" rather than only "different"
Interpretable Threshold: Based on Bayes error rate with statistical meaning
Theoretical Guarantees: Provides asymptotic bias analysis and efficiency comparison
Practicality: Computed directly from samples, easy to implement

Conclusions and Discussion

Main Conclusions

Theoretical Contributions:
- Establishes general relationship between f-divergence and Bayes error rate (Theorem 1)
- Proves asymptotic superiority of moving region method (Theorems 2, 3)
- Provides bias upper bound for two-sample problem estimators (Theorem 4)
Method Contributions:
- Proposes practical criterion based on discretized Hellinger distance
- Threshold δ* = 8ϵ² has clear statistical interpretation
- Complete criterion includes bias correction, directly applicable
Experimental Validation:
- Multivariate normal distribution experiments validate method effectiveness and sample size requirements
- Bayesian model case demonstrates practical application value
- Comparison with K-S test shows advantages of "positive conclusions"

Limitations

Sample Size Requirements:
- n₂ needs to be relatively large (O(n₂⁻¹/²) effect)
- While model samples are typically easy to obtain, computational cost remains
Curse of Dimensionality:
- Full-depth partitioning infeasible in high dimensions
- Requires dimensionality reduction strategies (e.g., two-dimensional marginal distributions)
- May lose high-dimensional dependency structure information
Incomplete High-Dimensional Theory:
- Theorem 3's O(n⁻²) superiority only fully proven for scalar case (k=1)
- High-dimensional case (k≥2) O(n⁻²) superiority not rigorously proven
Threshold Selection:
- Choice of ϵ (0.05 or 0.01) remains somewhat subjective
- While based on Bayes error rate, different applications may require different standards
Distribution Assumptions:
- Method designed for continuous distributions
- Requires adjustment for mixed (discrete + continuous) distributions

Future Directions

High-Dimensional Theory: Complete asymptotic theory for moving region method when k≥2
Adaptive Region Partitioning:
- Adaptively select partition number p and depth l based on data characteristics
- Non-uniform partitioning strategies
Multi-Sample Extension: Generalize to simultaneous comparison of multiple distributions
Computational Optimization:
- Efficient implementation for large-scale data
- Parallel computing strategies
Other Divergences:
- Study properties of other f-divergences (e.g., χ² divergence)
- Compare applicable scenarios for different divergences

In-Depth Evaluation

Strengths

Theoretical Rigor:
- Theorem 1 establishing f-divergence and Bayes error rate relationship has universality and depth
- Asymptotic analysis (Theorems 2-4) with complete mathematical derivations and detailed proofs
- Theoretical results provide solid foundation for practice
Method Innovation:
- Core Innovation: Introduces Bayes error rate into divergence threshold setting, making abstract divergence values have intuitive interpretation as classification accuracy
- Moving region method superiority over fixed regions has theoretical support
- Hellinger distance choice avoiding technical issues (zero estimates) reflects practical consideration
Practical Value:
- Criterion (40) has simple form, easy to compute and apply
- No explicit density function needed, applicable to black-box models (deep learning)
- Provides "positive conclusions" satisfying practical needs
Experimental Sufficiency:
- Multivariate normal experiments systematically examine different similarities and sample sizes
- Bayesian model case demonstrates practical application scenarios
- K-S test comparison is convincing
Writing Clarity:
- Clear structure, coherent logic
- Mathematical notation clearly defined
- Figures (e.g., Figures 1, Tables 1-6) effectively support arguments

Weaknesses

Incomplete High-Dimensional Theory:
- Theorem 3 only gives O(n⁻¹) results, O(n⁻²) terms not explicit
- Moving region method superiority for k≥2 not rigorously proven
- Limits theoretical completeness
Limited Experimental Design:
- Case 1 only considers normal distributions, limited distribution types
- Lacks systematic comparison with other two-sample methods (e.g., MMD)
- High-dimensional experiments only to k=10, higher dimensions unexplored
Method Applicability Limitations:
- Handling of discrete or mixed distributions not discussed
- Choice of region number p' and depth l lacks systematic guidance
- Sample size requirements (particularly n₂) may still be high in some scenarios
Threshold Subjectivity:
- Choice of ϵ (0.05, 0.01) while having Bayes error rate interpretation still requires user decision
- Reasonable thresholds may vary greatly across application domains
- Lacks guidance for threshold selection in specific applications
Missing Computational Complexity Analysis:
- Algorithm time and space complexity not discussed
- Scalability to large-scale data not clarified
Theorem 1 Approximation:
- Computing α(δ) involves complex optimization (Equations 9-10)
- Actual use involves Taylor expansion approximation (around Figure 1)
- Approximation error quantification insufficient

Impact

Contribution to Field:
- Provides new theoretical perspective for distribution similarity evaluation (Bayes error rate connection)
- Advances application of discretization methods in statistical inference
- Provides practical tool for generative model evaluation
Practical Value:
- High Practicality: Applicable to deep generative models (GANs, VAEs), Bayesian models and other scenarios without explicit densities
- Can be used for model selection, training monitoring, data quality assessment
- Relatively simple code implementation
Reproducibility:
- Detailed method description, clear algorithm steps
- Explicit experimental settings (sample sizes, parameters, etc.)
- Complete theoretical derivations (proofs in appendix)
- Recommendation: Providing open-source code would greatly enhance reproducibility
Potential Application Domains:
- Machine Learning: Generative model evaluation, domain adaptation
- Statistics: Goodness-of-fit testing, model diagnostics
- Data Science: Data quality monitoring, A/B testing
- Scientific Computing: Simulation validation, uncertainty quantification

Applicable Scenarios

Most Suitable Scenarios:

Complex Generative Model Evaluation: Deep neural network generative models (GANs, VAEs, diffusion models)
Bayesian Posterior Evaluation: Comparing MCMC samples with true distributions
Large Samples Available: Model can generate large sample quantities (n₂ >> n₁)
Positive Conclusions Needed: Determining "whether sufficiently good" rather than "whether different"
Continuous Distributions: Method designed for continuous random vectors

Less Suitable Scenarios:

Small Samples: When both n₁ and n₂ are small, bias correction terms may be substantial
Very High Dimensions: Dimension k >> 10 requires special handling (dimensionality reduction)
Discrete Distributions: Requires method adjustment
Exact p-values Needed: This method provides threshold judgment rather than p-values
Real-Time Online Evaluation: Computational cost may be high

Comparison with Other Methods:

vs. K-S Test: This method provides positive conclusions and interpretable thresholds
vs. AIC/BIC: This method requires no explicit density functions
vs. MMD (Maximum Mean Discrepancy): This method has clear statistical interpretation (Bayes error rate)
vs. FID (Fréchet Inception Distance): This method does not depend on specific feature extractors

References

Key references cited in this paper include:

Amari (2016): Information Geometry and Its Applications - Information geometric foundations of f-divergence theory
Csiszár (1975): Foundational work on f-divergence
Gretton et al. (2007): Application of kernel methods in two-sample testing
Richardson and Weiss (2018): Closest method to this paper, employing fixed regions
Sheena (2018): Author's prior work proving superiority of moving region method in scalar case
Theis et al. (2015): Comparative study of generative model evaluation methods
Borji (2018): Comprehensive survey of GAN evaluation metrics

Overall Assessment: This is an excellent paper with rigorous theory and practical methods. The core innovation lies in introducing Bayes error rate into divergence threshold setting, making abstract statistics have intuitive classification interpretation. The method is particularly suitable for evaluating complex models without explicit density functions, filling an important gap in the field. Main limitations are incomplete high-dimensional theory and limited experimental coverage, but these do not diminish its academic value and practicality. Readers are advised to pay attention to sample size requirements (particularly n₂) and dimensionality limitations when applying the method, employing dimensionality reduction strategies when necessary.