2025-11-28T22:22:19.391257

Criterion for the resemblance between the mother and the model distribution

Sheena
If the probability distribution model aims to approximate the hidden mother distribution, it is imperative to establish a useful criterion for the resemblance between the mother and the model distributions. This study proposes a criterion that measures the Hellinger distance between discretized (quantized) samples from both distributions. Unlike information criteria such as AIC, this criterion does not require the probability density function of the model distribution, which cannot be explicitly obtained for a complicated model such as a deep learning machine. Second, it can draw a positive conclusion (i.e., both distributions are sufficiently close) under a given threshold, whereas a statistical hypothesis test, such as the Kolmogorov-Smirnov test, cannot genuinely lead to a positive conclusion when the hypothesis is accepted. In this study, we establish a reasonable threshold for the criterion deduced from the Bayes error rate and also present the asymptotic bias of the estimator of the criterion. From these results, a reasonable and easy-to-use criterion is established that can be directly calculated from the two sets of samples from both distributions.
academic

Criterion for the resemblance between the mother and the model distribution

Basic Information

  • Paper ID: 2212.03397
  • Title: Criterion for the resemblance between the mother and the model distribution
  • Author: Yo Sheena (Faculty of Data Science, Shiga University, Japan; Visiting Professor of the Institute of Statistical Mathematics, Japan)
  • Classification: math.ST stat.TH
  • Publication Date: November 13, 2025 (arXiv v3)
  • Paper Link: https://arxiv.org/abs/2212.03397

Abstract

This paper investigates the measurement of similarity between probabilistic distribution models and true data distributions (mother distributions). It proposes a criterion based on discretized sample Hellinger distance that does not require explicit probability density functions of the model distribution, making it applicable to complex models such as deep learning. Unlike traditional hypothesis testing (e.g., Kolmogorov-Smirnov test), this criterion can yield positive conclusions that "two distributions are sufficiently close" under a given threshold. The research establishes reasonable thresholds derived from Bayes error rates and provides asymptotic bias analysis of the criterion estimators.

Research Background and Motivation

1. Core Problem

When a probabilistic distribution model aims to approximate an unknown true data distribution (mother distribution), establishing an effective similarity measurement criterion is a fundamental problem. This is particularly important in the evaluation of generative models (such as deep generative models and Bayesian models).

2. Importance of the Problem

  • Model Evaluation Needs: In machine learning and statistical modeling, it is necessary to determine whether the generated model sufficiently approximates the true data distribution
  • Practical Significance: Addresses practical questions such as whether training is sufficient, whether parametric models are appropriate, and whether sample sizes are adequate
  • Theoretical Value: Provides interpretable quantitative standards for distribution similarity

3. Limitations of Existing Methods

Kullback-Leibler Divergence and Information Criteria (e.g., AIC):

  • Require explicit probability density functions g_m(x) of the model distribution
  • Difficult to obtain explicit forms for complex models (e.g., deep neural networks, Bayesian models)
  • While usable for model comparison, the numerical values themselves lack statistical significance and cannot be used for model evaluation

Statistical Hypothesis Testing (e.g., K-S Test):

  • When rejecting the null hypothesis, only the conclusion "two distributions differ" can be drawn, though they may actually be quite similar
  • With large samples, easily rejects hypotheses due to detecting minor differences
  • When accepting the hypothesis, cannot draw positive conclusions that "two distributions are sufficiently close"
  • p-values do not directly reflect the degree of distribution proximity

4. Research Motivation

Propose a criterion that:

  • Can be computed directly from samples without requiring explicit density functions
  • Yields positive conclusions about "sufficient closeness"
  • Has interpretable thresholds

Core Contributions

  1. Proposes a two-sample criterion based on discretized Hellinger distance: By discretizing (quantizing) samples from two distributions and comparing Hellinger distances at the multinomial distribution level
  2. Establishes theoretical connection with Bayes error rate (Theorem 1): Proves the relationship between f-divergence and Bayes error rate, making divergence values practically interpretable
  3. Derives reasonable threshold standards: Based on Bayes error rate, derives the Hellinger distance threshold δ* = 8ϵ², where ϵ corresponds to the degree of error rate deviation from random guessing
  4. Proposes moving region discretization method: Compared to fixed region methods, achieves superior asymptotic efficiency at the n⁻² order (Theorems 2 and 3)
  5. Provides asymptotic bias analysis of estimators (Theorem 4): Proves that the upper bound of estimator E_Dm̂⁽¹⁾ : m̂⁽²⁾ is E_Dm⁽¹⁾ : m⁽²⁾ + √(8p'/n₂) + o(n₁⁻¹) + o(n₂⁻¹/²)
  6. Establishes practical model fitting criterion:
    D[m̂⁽¹⁾ : m̂⁽²⁾] + p'/(2n₁) + √(8p'/n₂) < 8ϵ²
    

Detailed Methodology

Task Definition

Given two sample sets:

  • Mother distribution observations: X⁽¹⁾ = {X₁⁽¹⁾, ..., Xₙ₁⁽¹⁾}
  • Model-generated samples: X⁽²⁾ = {X₁⁽²⁾, ..., Xₙ₂⁽²⁾}

Objective: Establish a criterion to determine whether the mother distribution and model distribution are sufficiently close.

Method Architecture

1. Relationship between f-divergence and Bayes Error Rate

For two probability density functions g₁(x) and g₂(x), f-divergence is defined as:

D_f[g₁(x) | g₂(x)] = ∫ g₁(x)f(g₂(x)/g₁(x))dµ(x)

Bayes error rate is:

Er[g₁(x)|g₂(x)] = (1/2)∫ min(g₁(x), g₂(x))dµ

Theorem 1 establishes the key connection: If D_fg₁(x) | g₂(x) < δ, then Erg₁(x) | g₂(x) ≥ α(δ), where α(δ) is a function of δ.

For Hellinger distance (f(x) = 2(1-√x)²), approximately:

α(δ) ≈ (1 - √(δ/2))/2

Setting the threshold as Bayes error rate 1/2 - ϵ (close to random guessing), we obtain:

δ* = 8ϵ²

2. Discretization Methods

Fixed Region Method: Pre-specify region partitions I_i independently of samples.

Moving Region Method (recommended in this paper): Dynamically determine regions based on quantiles of sample X⁽²⁾.

For scalar case (k=1):

  • Select quantile points λᵢ = i/(p+1), i = 1,...,p
  • Use order statistics of X⁽²⁾ to determine interval endpoints: ξ̂ᵢ = X₍ñᵢ₎⁽²⁾, where ñᵢ = ⌊n₂λᵢ⌋
  • Define moving intervals I_i = (ξ̂ᵢ, ξ̂ᵢ₊₁)

For vector case (k≥2):

  • Employ recursive partitioning method
  • At step i, partition along the i-th coordinate using order statistics
  • Partition depth is l (≤ k)

3. Multinomial Distribution Construction

Based on moving regions A_j(l), construct two multinomial distributions:

m⁽¹⁾ = {m_j(l)⁽¹⁾}, m_j(l)⁽¹⁾ = P(X ∈ A_j(l)|mother distribution)
m⁽²⁾ = {m_j(l)⁽²⁾}, m_j(l)⁽²⁾ = P(X ∈ A_j(l)|model distribution)

Estimators are:

m̂⁽¹⁾ = {m̂_j(l)⁽¹⁾}, m̂_j(l)⁽¹⁾ = #{X⁽¹⁾ | X⁽¹⁾ ∈ A_j(l)}/n₁
m̂⁽²⁾ = {m̂_j(l)⁽²⁾}, m̂_j(l)⁽²⁾ = 1/(p'_j(l-1) + 1)

4. Hellinger Distance Calculation

Hellinger distance is defined as:

D[m⁽¹⁾ : m⁽²⁾] = 2∑_j(l) (√m_j(l)⁽¹⁾ - √m_j(l)⁽²⁾)²

Estimator is:

D[m̂⁽¹⁾ : m̂⁽²⁾] = 2∑_j(l) (√m̂_j(l)⁽¹⁾ - √m̂_j(l)⁽²⁾)²

Technical Innovations

  1. Theoretical Innovation:
    • Establishes general relationship between f-divergence and Bayes error rate (Theorem 1), providing intuitive interpretation of divergence values in terms of classification error
    • Proves asymptotic superiority of moving region method in single-sample problems (Theorems 2, 3)
  2. Method Innovation:
    • Uses moving region method rather than fixed regions, improving estimation efficiency
    • Selects Hellinger distance to avoid zero estimation problems (does not diverge when -1 < α < 1)
    • Uses model samples X⁽²⁾ to construct regions (since typically n₂ >> n₁)
  3. Bias Analysis:
    • Theorem 4 provides asymptotic bias upper bound for estimators
    • Effect of n₂ is O(n₂⁻¹/²), effect of n₁ is O(n₁⁻¹)
    • Explains why relatively large n₂ is needed
  4. Practical Criterion:
    • Provides complete criterion with bias correction (Formula 40)
    • Threshold 8ϵ² has clear statistical meaning (corresponding to Bayes error rate)

Experimental Setup

Datasets

Case 1: Multivariate Normal Distribution

  • Mother Distribution: X⁽¹⁾ᵢ ~ N(α, I_k + βV), where V_ = 0.95^{|i-j|}
  • Model Distribution: X⁽²⁾ᵢ ~ N(0, I_k) (standard normal)
  • Parameter Settings:
    • Dimension k = 3, partition depth l = 3
    • Number of partitions per variable p = p_{j(1)} = p_{j(2)} = 3
    • Total number of regions p' = (3+1)³ - 1 = 63
    • Similarity parameters (α, β) = (0,0), (0.01,0.01), (0.1,0.1), (1,1)
    • Sample sizes n₁ ∈ {10³, 10⁴, 10⁵, 10⁶, 10⁷}, n₂ = 10⁷

High-Dimensional Case:

  • k = 10, p = p_{j(1)} = ... = p_{j(9)} = 3
  • Since full-depth partitioning requires p' = (3+1)¹⁰ - 1 > 10⁶, use l = 2
  • Examine all pairwise two-dimensional marginal distributions

Case 2: Bayesian Model

  • Dataset: UCI Power Plant dataset (9568 samples)
  • Model: Normal regression model y = β₁ + ∑ᵢ₌₂⁵ βᵢxᵢ + ϵ
  • Prior Distribution:
    • β₁ ~ Cauchy(0, 10)
    • βᵢ ~ Cauchy(0, 2.5), i = 2,...,5
    • σ ~ t(5, 5, 1)
  • MCMC Samples: 4000 posterior samples of β
  • Predicted Value Samples: n₂ = 4000 × 9568 ≈ 3.827×10⁷
  • True Value Samples: n₁ = 9568
  • Number of Regions: p' = 10

Evaluation Metrics

  1. Hellinger Distance: Dm̂⁽¹⁾ : m̂⁽²⁾
  2. Complete Criterion Value (left side of Formula 40): Dm̂⁽¹⁾ : m̂⁽²⁾ + p'/(2n₁) + √(8p'/n₂)
  3. Threshold: 8ϵ² (0.02 when ϵ = 0.05, 0.0008 when ϵ = 0.01)
  4. Comparison Method: p-value from Kolmogorov-Smirnov test

Implementation Details

  • Bias correction terms: p'/(2n₁) + √(8p'/n₂)
  • Moving region method uses equal-mass partitioning (λᵢ = i/(p+1))
  • For high-dimensional cases, employ dimensionality reduction strategy (two-dimensional marginal distributions)

Experimental Results

Main Results

Case 1: Three-Dimensional Normal Distribution (k=3, l=3, p'=63, n₂=10⁷)

(α, β)n₁=10⁷n₁=10⁶n₁=10⁵n₁=10⁴
(0, 0)0.007110.007170.007730.0136
(0.01, 0.01)0.007350.007410.007970.0137
(0.1, 0.1)0.02770.02770.02900.0349
(1, 1)0.6990.6980.7070.707

Key Findings:

  1. (α, β) = (0, 0) and (0.01, 0.01): Criterion value < 0.02 (threshold for ϵ=0.05), conclusion is sufficiently close
  2. (α, β) = (0.1, 0.1): Criterion value approximately 0.028-0.035 > 0.02, but < 0.08 (threshold for ϵ=0.1), close under relaxed standard
  3. (α, β) = (1, 1): Criterion value approximately 0.7 >> 0.02, clearly not close
  4. Sample Size Effect: As n₁ increases from 10⁴ to 10⁷, criterion value decreases from 0.0136 to 0.00711 (α=β=0 case)

High-Dimensional Case (k=10, l=2, Two-Dimensional Marginal Distributions)

For (α, β) = (0.1, 0.1):

  • n₁=10³, n₂=10⁷: All 45 variable pairs have criterion values between 0.023-0.038, all > 0.02, cannot conclude closeness
  • n₁=10⁴, n₂=10⁷: All pairs have criterion values between 0.015-0.019, all < 0.02, conclusion is sufficiently close

This validates sample size requirements, particularly that n₁ needs to reach the 10⁴ order of magnitude.

Case Analysis

Bayesian Regression Model

Experimental Results:

  • Hellinger Distance: Dm̂⁽¹⁾ : m̂⁽²⁾ ≈ 0.0113
  • Bias Correction Term: p'/(2n₁) + √(8p'/n₂) ≈ 0.0020
  • Complete Criterion Value: ≈ 0.0133
  • Corresponding ϵ: Solving 8ϵ² = 0.0133 yields ϵ ≈ 0.04
  • Corresponding Bayes Error Rate: 0.5 - 0.04 = 0.46

K-S Test Comparison:

  • p-value = 7.587×10⁻⁸, rejects null hypothesis at extremely low significance level
  • However, this paper's criterion indicates that under Bayes error rate standard of 0.46, distributions are sufficiently close

Histogram Analysis (Figure 2):

  • Distributions of ŷ and y have similar shapes
  • Supports the "sufficiently close" conclusion

This case demonstrates:

  1. K-S test yields "rejection" conclusion, but actual distributions are already quite close
  2. This paper's criterion provides positive "sufficiently close" conclusion, better matching practical needs
  3. Threshold interpretability (Bayes error rate 0.46 close to random guessing's 0.5)

Experimental Findings

  1. Method Effectiveness: The criterion correctly distinguishes distribution pairs with different similarities
  2. Sample Size Requirements:
    • Effect of n₂ is O(n₂⁻¹/²), requiring relatively large values (10⁷ in experiments)
    • Effect of n₁ is O(n₁⁻¹), typically 10⁴ is sufficient
    • Consistent with theoretical analysis (Theorem 4)
  3. Dimensionality Effect:
    • High-dimensional full-depth partitioning requires exponential sample sizes
    • Two-dimensional marginal distribution strategy is practical compromise
  4. Comparison with Hypothesis Testing:
    • K-S test is overly sensitive with large samples
    • This paper's criterion provides interpretable "sufficiently close" judgment
  5. Threshold Reasonableness:
    • ϵ = 0.05 (corresponding threshold 0.02) is reasonable standard choice
    • Can be adjusted based on application needs (e.g., ϵ = 0.1 corresponds to 0.08)

1. Two-Sample Comparison Methods

Richardson and Weiss (2018):

  • Closest method to this paper
  • Employs fixed region method
  • Uses binomial distribution collection rather than multinomial distribution
  • Finally uses z-test for evaluation

Johnson and Dasu (1998):

  • Divides high-dimensional data into categorical and continuous variables
  • Uses multiple testing to judge similarity

2. K-S Test Extensions

Press and Teukolsky (1988): Two-dimensional K-S test

Hagen et al. (2020): High-dimensional K-S distance

Loudin and Miettinen (2003):

  • Compresses high-dimensional distributions to one dimension
  • Uses one-dimensional K-S test

3. Kernel Methods

Gretton et al. (2007):

  • Applies reproducing kernel Hilbert space theory
  • Measures distribution similarity through function similarity
  • Ultimately employs traditional hypothesis testing

4. Generative Model Evaluation

Theis et al. (2015):

  • Evaluates probabilistic image generation models
  • Points out that different evaluation methods may lead to completely different conclusions

Borji (2018):

  • Comprehensive survey of evaluation metrics for generative adversarial networks
  • Some methods applicable to two-sample problems

Advantages of This Paper

  1. No Explicit Density Required: Applicable to complex models (deep learning, Bayesian models)
  2. Positive Conclusions: Can judge "sufficiently close" rather than only "different"
  3. Interpretable Threshold: Based on Bayes error rate with statistical meaning
  4. Theoretical Guarantees: Provides asymptotic bias analysis and efficiency comparison
  5. Practicality: Computed directly from samples, easy to implement

Conclusions and Discussion

Main Conclusions

  1. Theoretical Contributions:
    • Establishes general relationship between f-divergence and Bayes error rate (Theorem 1)
    • Proves asymptotic superiority of moving region method (Theorems 2, 3)
    • Provides bias upper bound for two-sample problem estimators (Theorem 4)
  2. Method Contributions:
    • Proposes practical criterion based on discretized Hellinger distance
    • Threshold δ* = 8ϵ² has clear statistical interpretation
    • Complete criterion includes bias correction, directly applicable
  3. Experimental Validation:
    • Multivariate normal distribution experiments validate method effectiveness and sample size requirements
    • Bayesian model case demonstrates practical application value
    • Comparison with K-S test shows advantages of "positive conclusions"

Limitations

  1. Sample Size Requirements:
    • n₂ needs to be relatively large (O(n₂⁻¹/²) effect)
    • While model samples are typically easy to obtain, computational cost remains
  2. Curse of Dimensionality:
    • Full-depth partitioning infeasible in high dimensions
    • Requires dimensionality reduction strategies (e.g., two-dimensional marginal distributions)
    • May lose high-dimensional dependency structure information
  3. Incomplete High-Dimensional Theory:
    • Theorem 3's O(n⁻²) superiority only fully proven for scalar case (k=1)
    • High-dimensional case (k≥2) O(n⁻²) superiority not rigorously proven
  4. Threshold Selection:
    • Choice of ϵ (0.05 or 0.01) remains somewhat subjective
    • While based on Bayes error rate, different applications may require different standards
  5. Distribution Assumptions:
    • Method designed for continuous distributions
    • Requires adjustment for mixed (discrete + continuous) distributions

Future Directions

  1. High-Dimensional Theory: Complete asymptotic theory for moving region method when k≥2
  2. Adaptive Region Partitioning:
    • Adaptively select partition number p and depth l based on data characteristics
    • Non-uniform partitioning strategies
  3. Multi-Sample Extension: Generalize to simultaneous comparison of multiple distributions
  4. Computational Optimization:
    • Efficient implementation for large-scale data
    • Parallel computing strategies
  5. Other Divergences:
    • Study properties of other f-divergences (e.g., χ² divergence)
    • Compare applicable scenarios for different divergences

In-Depth Evaluation

Strengths

  1. Theoretical Rigor:
    • Theorem 1 establishing f-divergence and Bayes error rate relationship has universality and depth
    • Asymptotic analysis (Theorems 2-4) with complete mathematical derivations and detailed proofs
    • Theoretical results provide solid foundation for practice
  2. Method Innovation:
    • Core Innovation: Introduces Bayes error rate into divergence threshold setting, making abstract divergence values have intuitive interpretation as classification accuracy
    • Moving region method superiority over fixed regions has theoretical support
    • Hellinger distance choice avoiding technical issues (zero estimates) reflects practical consideration
  3. Practical Value:
    • Criterion (40) has simple form, easy to compute and apply
    • No explicit density function needed, applicable to black-box models (deep learning)
    • Provides "positive conclusions" satisfying practical needs
  4. Experimental Sufficiency:
    • Multivariate normal experiments systematically examine different similarities and sample sizes
    • Bayesian model case demonstrates practical application scenarios
    • K-S test comparison is convincing
  5. Writing Clarity:
    • Clear structure, coherent logic
    • Mathematical notation clearly defined
    • Figures (e.g., Figures 1, Tables 1-6) effectively support arguments

Weaknesses

  1. Incomplete High-Dimensional Theory:
    • Theorem 3 only gives O(n⁻¹) results, O(n⁻²) terms not explicit
    • Moving region method superiority for k≥2 not rigorously proven
    • Limits theoretical completeness
  2. Limited Experimental Design:
    • Case 1 only considers normal distributions, limited distribution types
    • Lacks systematic comparison with other two-sample methods (e.g., MMD)
    • High-dimensional experiments only to k=10, higher dimensions unexplored
  3. Method Applicability Limitations:
    • Handling of discrete or mixed distributions not discussed
    • Choice of region number p' and depth l lacks systematic guidance
    • Sample size requirements (particularly n₂) may still be high in some scenarios
  4. Threshold Subjectivity:
    • Choice of ϵ (0.05, 0.01) while having Bayes error rate interpretation still requires user decision
    • Reasonable thresholds may vary greatly across application domains
    • Lacks guidance for threshold selection in specific applications
  5. Missing Computational Complexity Analysis:
    • Algorithm time and space complexity not discussed
    • Scalability to large-scale data not clarified
  6. Theorem 1 Approximation:
    • Computing α(δ) involves complex optimization (Equations 9-10)
    • Actual use involves Taylor expansion approximation (around Figure 1)
    • Approximation error quantification insufficient

Impact

  1. Contribution to Field:
    • Provides new theoretical perspective for distribution similarity evaluation (Bayes error rate connection)
    • Advances application of discretization methods in statistical inference
    • Provides practical tool for generative model evaluation
  2. Practical Value:
    • High Practicality: Applicable to deep generative models (GANs, VAEs), Bayesian models and other scenarios without explicit densities
    • Can be used for model selection, training monitoring, data quality assessment
    • Relatively simple code implementation
  3. Reproducibility:
    • Detailed method description, clear algorithm steps
    • Explicit experimental settings (sample sizes, parameters, etc.)
    • Complete theoretical derivations (proofs in appendix)
    • Recommendation: Providing open-source code would greatly enhance reproducibility
  4. Potential Application Domains:
    • Machine Learning: Generative model evaluation, domain adaptation
    • Statistics: Goodness-of-fit testing, model diagnostics
    • Data Science: Data quality monitoring, A/B testing
    • Scientific Computing: Simulation validation, uncertainty quantification

Applicable Scenarios

Most Suitable Scenarios:

  1. Complex Generative Model Evaluation: Deep neural network generative models (GANs, VAEs, diffusion models)
  2. Bayesian Posterior Evaluation: Comparing MCMC samples with true distributions
  3. Large Samples Available: Model can generate large sample quantities (n₂ >> n₁)
  4. Positive Conclusions Needed: Determining "whether sufficiently good" rather than "whether different"
  5. Continuous Distributions: Method designed for continuous random vectors

Less Suitable Scenarios:

  1. Small Samples: When both n₁ and n₂ are small, bias correction terms may be substantial
  2. Very High Dimensions: Dimension k >> 10 requires special handling (dimensionality reduction)
  3. Discrete Distributions: Requires method adjustment
  4. Exact p-values Needed: This method provides threshold judgment rather than p-values
  5. Real-Time Online Evaluation: Computational cost may be high

Comparison with Other Methods:

  • vs. K-S Test: This method provides positive conclusions and interpretable thresholds
  • vs. AIC/BIC: This method requires no explicit density functions
  • vs. MMD (Maximum Mean Discrepancy): This method has clear statistical interpretation (Bayes error rate)
  • vs. FID (Fréchet Inception Distance): This method does not depend on specific feature extractors

References

Key references cited in this paper include:

  1. Amari (2016): Information Geometry and Its Applications - Information geometric foundations of f-divergence theory
  2. Csiszár (1975): Foundational work on f-divergence
  3. Gretton et al. (2007): Application of kernel methods in two-sample testing
  4. Richardson and Weiss (2018): Closest method to this paper, employing fixed regions
  5. Sheena (2018): Author's prior work proving superiority of moving region method in scalar case
  6. Theis et al. (2015): Comparative study of generative model evaluation methods
  7. Borji (2018): Comprehensive survey of GAN evaluation metrics

Overall Assessment: This is an excellent paper with rigorous theory and practical methods. The core innovation lies in introducing Bayes error rate into divergence threshold setting, making abstract statistics have intuitive classification interpretation. The method is particularly suitable for evaluating complex models without explicit density functions, filling an important gap in the field. Main limitations are incomplete high-dimensional theory and limited experimental coverage, but these do not diminish its academic value and practicality. Readers are advised to pay attention to sample size requirements (particularly n₂) and dimensionality limitations when applying the method, employing dimensionality reduction strategies when necessary.