2025-11-10T02:47:56.247933

Strong consistency of pseudo-likelihood parameter estimator for univariate Gaussian mixture models

Lember, Kangro, Kuljus
We consider a new method for estimating the parameters of univariate Gaussian mixture models. The method relies on a nonparametric density estimator $\hat{f}_n$ (typically a kernel estimator). For every set of Gaussian mixture components, $\hat{f}_n$ is used to find the best set of mixture weights. That set is obtained by minimizing the $L_2$ distance between $\hat{f}_n$ and the Gaussian mixture density with the given component parameters. The densities together with the obtained weights are then plugged in to the likelihood function, resulting in the so-called pseudo-likelihood function. The final parameter estimators are the parameter values that maximize the pseudo-likelihood function together with the corresponding weights. The advantages of the pseudo-likelihood over the full likelihood are: 1) its arguments are the means and variances only, mixture weights are also functions of the means and variances; 2) unlike the likelihood function, it is always bounded above. Thus, the maximizer of the pseudo-likelihood function -- referred to as the pseudo-likelihood estimator -- always exists. In this article, we prove that the pseudo-likelihood estimator is strongly consistent.
academic

Strong consistency of pseudo-likelihood parameter estimator for univariate Gaussian mixture models

Basic Information

  • Paper ID: 2510.14482
  • Title: Strong consistency of pseudo-likelihood parameter estimator for univariate Gaussian mixture models
  • Authors: Jüri Lember, Raul Kangro, Kristi Kuljus (Institute of Mathematics and Statistics, University of Tartu, Estonia)
  • Classification: math.ST stat.TH
  • Publication Date: October 16, 2025
  • Paper Link: https://arxiv.org/abs/2510.14482

Abstract

This paper proposes a novel method for estimating parameters of univariate Gaussian mixture models. The method is based on a nonparametric density estimator f^n\hat{f}_n (typically a kernel estimator). For each given set of Gaussian mixture component parameters, optimal mixing weights are found by minimizing the L2L_2 distance between f^n\hat{f}_n and the Gaussian mixture density. The obtained weights are then substituted together with the density into the likelihood function, forming the so-called pseudo-likelihood function. The final parameter estimator is the parameter value and corresponding weights that maximize the pseudo-likelihood function. Compared to the complete likelihood, the pseudo-likelihood has two advantages: 1) its parameters consist only of means and variances, with mixing weights also being functions of means and variances; 2) unlike the likelihood function, it is always bounded. Therefore, the maximizer of the pseudo-likelihood function—the pseudo-likelihood estimator—always exists. This paper proves the strong consistency of the pseudo-likelihood estimator.

Research Background and Motivation

Problem Background

  1. Unboundedness of likelihood for Gaussian mixture models: The likelihood function of Gaussian mixture models is unbounded, a well-known problem. When the variances of certain components approach zero, the likelihood function may tend to infinity.
  2. Limitations of existing solutions:
    • Restricting the parameter space
    • Using sieve methods
    • Penalized maximum likelihood estimation
    • Bayesian methods
    • Profile likelihood, etc.

    These methods typically require imposing restrictions or penalty terms on variances.
  3. Research motivation:
    • Provide a method that does not require any restrictions on parameters
    • Maintain similarity with standard maximum likelihood estimation
    • Ensure existence and consistency of the estimator

Why It Matters

  • Gaussian mixture models are widely applied in statistics and machine learning
  • The unbounded likelihood problem hinders the application of standard MLE
  • There is a need for theoretically reliable and practically feasible estimation methods

Core Contributions

  1. Proposes the pseudo-likelihood method: A novel parameter estimation method that determines mixing weights through L2L_2 distance minimization and then constructs the pseudo-likelihood function.
  2. Proves strong consistency: Under i.i.d. sample assumptions, proves the strong consistency of the pseudo-likelihood estimator: θ^na.s.θ\hat{\theta}_n \xrightarrow{a.s.} \theta^* and vn(θ^n)a.s.wv_n(\hat{\theta}_n) \xrightarrow{a.s.} w^*.
  3. No parameter restrictions: The method does not require imposing lower bounds on variances or other constraints.
  4. Theoretical framework: Establishes a complete theoretical framework for handling unbounded means, vanishing or unbounded variances.

Methodology Details

Problem Definition

Given i.i.d. observations Y1,,YnY_1, \ldots, Y_n from a kk-component univariate Gaussian mixture distribution, the goal is to estimate:

  • Component parameters: θi=(μi,σi)\theta_i = (\mu_i, \sigma_i), i=1,,ki = 1, \ldots, k
  • Mixing weights: wi>0w_i > 0, i=1kwi=1\sum_{i=1}^k w_i = 1

The true density is: f()=i=1kwig(θi,)f(\cdot) = \sum_{i=1}^k w_i^* g(\theta_i^*, \cdot)

Model Architecture

Step One: Weight Estimation

For given parameters θ=(θ1,,θk)\theta = (\theta_1, \ldots, \theta_k), determine weights by minimizing L2L_2 distance:

vn(θ):=arginfwSkf^n()i=1kwig(θi,)v_n(\theta) := \arg \inf_{w \in S_k} \|\hat{f}_n(\cdot) - \sum_{i=1}^k w_i g(\theta_i, \cdot)\|

where SkS_k is the (k1)(k-1)-dimensional simplex and f^n\hat{f}_n is a nonparametric density estimator.

Step Two: Pseudo-likelihood Construction

Substitute the obtained weights into the likelihood function:

Ln(θ):=t=1n(i=1kvn,i(θ)g(θi,Yt))L_n(\theta) := \prod_{t=1}^n \left( \sum_{i=1}^k v_{n,i}(\theta) g(\theta_i, Y_t) \right)

Log pseudo-likelihood function: n(θ):=1nt=1nln(vn(θ)g(θ,Yt))\ell_n(\theta) := \frac{1}{n} \sum_{t=1}^n \ln\left( v_n(\theta)g(\theta, Y_t) \right)

Step Three: Parameter Estimation

The pseudo-likelihood estimator is defined as: θ^n satisfies n(θ^n)supθΘon(θ)ϵn\hat{\theta}_n \text{ satisfies } \ell_n(\hat{\theta}_n) \geq \sup_{\theta \in \Theta_o} \ell_n(\theta) - \epsilon_n

where ϵn0\epsilon_n \searrow 0.

Technical Innovations

  1. Two-step estimation strategy:
    • Step one uses L2L_2 distance to estimate weights
    • Step two uses likelihood method to estimate component parameters
    • This combination ensures boundedness of the objective function
  2. Uniqueness of weights: Although weights vn(θ)v_n(\theta) may not be unique, the density vn(θ)g(θ,)v_n(\theta)g(\theta, \cdot) is unique (Lemma 2.1).
  3. Treatment of parameter space: Handles parameter non-identifiability (e.g., permutation invariance) through the concept of equivalence classes.

Theoretical Analysis

Main Theorem

Theorem 2.1 (Strong Consistency): Assume f^na.s.f\hat{f}_n \xrightarrow{a.s.} f (in L2L_2 sense) and C<\exists C < \infty such that P(f^n<C eventually)=1P(\|\hat{f}_n\|_\infty < C \text{ eventually}) = 1, then:

θ^na.s.θ,vn(θ^n)a.s.w,vn(θ^n)g(θ^n,)a.s.f()\hat{\theta}_n \xrightarrow{a.s.} \theta^*, \quad v_n(\hat{\theta}_n) \xrightarrow{a.s.} w^*, \quad v_n(\hat{\theta}_n)g(\hat{\theta}_n, \cdot) \xrightarrow{a.s.} f(\cdot)

Proof Strategy

1. Compactification of Parameter Space

Proposition 3.1: Proves the existence of constants 0<u<U<0 < u < U < \infty and N<N < \infty such that for sufficiently large nn, at least one component i(n)i(n) satisfies: μi(n)n<N,uσi(n)nU|\mu_{i(n)}^n| < N, \quad u \leq \sigma_{i(n)}^n \leq U

This ensures that θ^n\hat{\theta}_n eventually belongs to a bounded parameter space Θo(u,U,N)\Theta_o(u,U,N).

2. Generalization of Strong Law of Large Numbers

Lemma 4.1: Generalizes the strong law of large numbers to handle sample-dependent random function sequences hnh_n.

3. Uniform Convergence

Proposition 6.1: Establishes uniform convergence of the criterion function: supθΘo(u,U,N)n(θ)(θ)a.s.0\sup_{\theta \in \Theta_o(u,U,N)} |\ell_n(\theta) - \ell(\theta)| \xrightarrow{a.s.} 0

4. Treatment of Limiting Cases

Proposition 5.1: Handles cases where parameters approach boundaries (zero variance, infinite variance, infinite mean).

Technical Challenges

  1. Unbounded parameters: Must handle cases where means tend to infinity and variances tend to zero or infinity.
  2. Randomness of weights: Weights vn(θ)v_n(\theta) depend on random f^n\hat{f}_n, so standard strong law of large numbers cannot be directly applied.
  3. Uniform convergence: Must establish uniform convergence over the entire parameter space, not just pointwise convergence.

Comparison with Existing Methods

  1. Variance-restricted MLE:
    • Chen (2017): Assumes all component variances are equal
    • Tanaka & Takemura (2006): Requires standard deviation lower bound exp[nd]\exp[-n^d]
    • Tanaka (2009): Imposes penalties on variance ratios
  2. Distance-based estimation:
    • Completely estimates the entire mixture model based on distance minimization
    • This paper uses distance method only for weights and likelihood method for component parameters
  3. Doubly smoothed likelihood:
    • Seo & Lindsay (2010, 2013): Smooths both empirical measure and specified distribution
    • High computational complexity, requires Monte Carlo estimation

Advantages of This Paper

  1. Theoretical guarantees: Provides strong consistency proof
  2. Computational efficiency: Can be solved using standard optimization tools
  3. No parameter restrictions: Does not require variance constraints
  4. Preserves likelihood properties: Stays as close as possible to standard MLE properties

Extensibility Discussion

Beyond the i.i.d. Case

The paper discusses applicability of the method in more general settings:

  1. Hidden Markov models: When X1,X2,X_1, X_2, \ldots is a stationary ergodic process with YtXt=iN(θi)Y_t|X_t = i \sim N(\theta_i)
  2. General latent variable models: As long as ergodicity conditions are satisfied

Practical Applications

  • Signal denoising (generalization of DUDE method)
  • Emission parameter estimation for hidden Markov models
  • General latent variable models

Conclusions and Discussion

Main Conclusions

  1. The pseudo-likelihood estimator converges strongly to true parameters under mild conditions
  2. The method avoids the unboundedness problem of traditional MLE
  3. No artificial parameter restrictions are needed

Limitations

  1. Kernel estimator requirements: Requires f^na.s.f\hat{f}_n \xrightarrow{a.s.} f and f^n\|\hat{f}_n\|_\infty bounded
  2. Bandwidth selection: The bandwidth of the kernel estimator must decrease sufficiently slowly
  3. Computational complexity: For general kk, the weight optimization problem has no closed-form solution

Future Directions

  1. Establishment of asymptotic normality
  2. Generalization to multivariate cases
  3. Consistency under more general dependence structures
  4. Study of finite sample properties

In-Depth Evaluation

Strengths

  1. Theoretical rigor: Provides complete strong consistency proof, addressing various technical challenges
  2. Methodological innovation: Cleverly combines distance and likelihood methods to solve a classical problem
  3. Practical value: Method is computationally feasible without parameter constraints
  4. Clear presentation: Well-structured paper with clear proof strategy

Weaknesses

  1. Strong assumptions: Requires strong convergence conditions for kernel estimators
  2. Computational efficiency: Weight optimization problem may be computationally complex
  3. Finite sample properties: Lacks analysis of finite sample behavior
  4. Experimental validation: Paper is primarily theoretical, lacking numerical experiments

Impact

  1. Academic contribution: Provides new theoretical framework for Gaussian mixture model estimation
  2. Practical value: Solves important problems in practical applications
  3. Methodological significance: Demonstrates effectiveness of combining different criterion functions

Applicable Scenarios

  • Gaussian mixture model parameter estimation, especially with many components
  • Application scenarios requiring avoidance of parameter constraints
  • Emission parameter estimation for hidden Markov models
  • Density estimation in signal processing and pattern recognition

References

The paper cites 21 important references covering:

  • Classical mixture model theory (Teicher, 1963)
  • MLE consistency theory (Chen, 2017; van der Vaart, 2000)
  • Kernel density estimation theory (Silverman, 1978)
  • Distance-based estimation methods (Cutler & Cordero-Brana, 1996)
  • Related pseudo-likelihood methods (Kangro et al., 2025)

These references provide a solid foundation for the theoretical development of this paper.