2025-11-22T01:28:15.129039

EM Approaches to Nonparametric Estimation for Mixture of Linear Regressions

Welbaum, Qiao
In a mixture of linear regression model, the regression coefficients are treated as random vectors that may follow either a continuous or discrete distribution. We propose two Expectation-Maximization (EM) algorithms to estimate this prior distribution. The first algorithm solves a kernelized version of the nonparametric maximum likelihood estimation (NPMLE). This method not only recovers continuous prior distributions but also accurately estimates the number of clusters when the prior is discrete. The second algorithm, designed to approximate the NPMLE, targets prior distributions with a density. It also performs well for discrete priors when combined with a post-processing step. We study the convergence properties of both algorithms and demonstrate their effectiveness through simulations and applications to real datasets.
academic

EM Approaches to Nonparametric Estimation for Mixture of Linear Regressions

Basic Information

  • Paper ID: 2510.14890
  • Title: EM Approaches to Nonparametric Estimation for Mixture of Linear Regressions
  • Authors: Andrew Welbaum, Wanli Qiao (George Mason University)
  • Classification: stat.ME stat.ML
  • Publication Date: October 17, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.14890

Abstract

In mixture linear regression models, regression coefficients are treated as random vectors that may follow continuous or discrete distributions. This paper proposes two Expectation-Maximization (EM) algorithms to estimate such prior distributions. The first algorithm solves a kernelized version of the nonparametric maximum likelihood estimation (NPMLE), which can recover continuous prior distributions and accurately estimate the number of clusters when the prior is discrete. The second algorithm aims to approximate the NPMLE for prior distributions with density. Combined with post-processing steps, it also performs well on discrete priors. The convergence properties of both algorithms are studied, and their effectiveness is demonstrated through simulations and real data applications.

Research Background and Motivation

Problem Definition

Mixture linear regression models extend multivariate linear regression by allowing coefficient vectors to have continuous or discrete prior distributions. This model has broad applications when response variables and covariates may have personalized or clustered linear relationships, including market segmentation, medical research, educational research, and various industrial and economic studies.

Model Setup

Consider n independent observations (x1,y1),,(xn,yn)Rd×R(x_1, y_1), \ldots, (x_n, y_n) \in \mathbb{R}^d \times \mathbb{R} generated by the following model: yi=xiTβi+σziy_i = x_i^T \beta_i + \sigma z_i where β1,,βniidG\beta_1, \ldots, \beta_n \stackrel{iid}{\sim} G^*, z1,,zniidN(0,1)z_1, \ldots, z_n \stackrel{iid}{\sim} N(0,1), σ>0\sigma > 0 is known, and GG^* is an unknown probability distribution on Rd\mathbb{R}^d.

Research Motivation

  1. Limitations of Existing Methods: Traditional EM algorithms require prior knowledge of the number of components K, while NPMLE-based methods (e.g., Jiang and Guntuboyina 2025), though theoretically consistent, often fail to accurately detect the true number of components in practice.
  2. Practical Requirements: There is a need for methods that can handle both continuous distributions and automatically detect the number of components in discrete distributions.
  3. Clustering Applications: When GG^* is discrete, clustering observations based on estimated results is necessary.

Core Contributions

  1. Proposes EM-NPMLE Algorithm: Converges to NPMLE for prior distributions with density.
  2. Proposes EM-NPKMLE Algorithm: Through constrained optimization with kernel density estimation, automatically detects the number of components in discrete distributions.
  3. Theoretical Guarantees: Establishes convergence properties for both algorithms.
  4. Post-processing Strategy: Introduces mean shift and SCMS post-processing methods for handling special structures.
  5. Practical Validation: Verifies method effectiveness on simulations and real data.

Methodology Details

Task Definition

Given observed data {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n, the objective is to estimate the unknown prior distribution GG^*, and subsequently:

  1. Perform nonparametric estimation for continuous distributions
  2. Automatically determine the number of components and estimate parameters for discrete distributions
  3. Perform clustering based on estimated results

EM-NPMLE Algorithm (Method 1)

Applicable Scenario: GG^* has a density function gg^*

Algorithm Flow:

  1. E-step: Compute posterior density fi(t+1)(β)=ϕσ(yixiTβ)g(t)(β)Rdϕσ(yixiTβ)g(t)(β)dβf_i^{(t+1)}(\beta) = \frac{\phi_\sigma(y_i - x_i^T\beta)g^{(t)}(\beta)}{\int_{\mathbb{R}^d} \phi_\sigma(y_i - x_i^T\beta)g^{(t)}(\beta)d\beta}
  2. M-step: Update density estimate g(t+1)=1ni=1nfi(t+1)g^{(t+1)} = \frac{1}{n}\sum_{i=1}^n f_i^{(t+1)}

Theoretical Properties:

  • Theorem 2.1: Under appropriate conditions, G(t)G^{(t)} weakly converges to the unique NPMLE G^\hat{G}.

EM-NPKMLE Algorithm (Method 2)

Core Idea: Restrict optimization to the kernel density estimation set Gkde\mathcal{G}_{kde}: Gkde={1nhd=1nv(β~2h2):β~1,,β~nRd}\mathcal{G}_{kde} = \left\{\frac{1}{nh^d}\sum_{\ell=1}^n v\left(\frac{\|\cdot - \tilde{\beta}_\ell\|^2}{h^2}\right) : \tilde{\beta}_1, \ldots, \tilde{\beta}_n \in \mathbb{R}^d\right\}

Algorithm Structure: Nested-loop EM algorithm

  • Outer loop: EM iterations update the distribution
  • Inner loop: Gradient ascent optimizes kernel density estimation parameters

Key Update Formula: ν(r+1)=ξ(ν(r);β(t),x,y)=A(ν(r);β(t),x,y)C(ν(r),β(t),x,y)\nu_\ell^{(r+1)} = \xi(\nu_\ell^{(r)}; \beta^{(t)}, x, y) = \frac{A(\nu_\ell^{(r)}; \beta^{(t)}, x, y)}{C(\nu_\ell^{(r)}, \beta^{(t)}, x, y)}

where AA and CC are determined by gradient calculations.

Technical Innovations

  1. Adaptive Step Size: Gradient ascent uses adaptive step size 1/C(ν(r),β(t),x,y)1/C(\nu_\ell^{(r)}, \beta^{(t)}, x, y), eliminating manual tuning.
  2. Bandwidth Selection: Bandwidth selection strategy based on maximum smoothness principle, avoiding spurious modes.
  3. Post-processing Flexibility: Tailored post-processing methods for different prior structures.

Experimental Setup

Simulation Data

Simulation 1: Three-component discrete distribution

  • Components: y=3xy = 3-x, y=1+1.5xy = 1+1.5x, y=1+0.5xy = -1+0.5x
  • Weights: (0.3, 0.3, 0.4)
  • Noise: σ=0.5\sigma = 0.5
  • Sample sizes: 500 to 10,000

Simulation 2: Continuous distribution

  • Uniform distribution on two concentric circles: 12×Uniform{B(1)}+12×Uniform{B(2)}\frac{1}{2} \times \text{Uniform}\{B(1)\} + \frac{1}{2} \times \text{Uniform}\{B(2)\}

Evaluation Metrics

  1. Adjusted Rand Index (ARI): Clustering quality
  2. Component Detection Accuracy: Proportion of correctly identifying true number of components
  3. Wasserstein-2 Distance: Distribution estimation quality
  4. Bias and Standard Deviation: Parameter estimation precision

Comparison Methods

  1. CGM Method: Conditional gradient method from Jiang and Guntuboyina (2025)
  2. EM-NPMLE + Mean Shift: Post-processed version
  3. Oracle Method: Theoretical upper bound with known true distribution

Implementation Details

  • Kernel function: Gaussian kernel
  • Bandwidth: Selected based on maximum smoothness principle
  • Initialization: Uniform distribution or EM-NPMLE output
  • Convergence criterion: L2L_2 distance below preset threshold

Experimental Results

Main Results

Simulation 1 Results (sample size 10,000):

  • EM-NPKMLE: ARI=0.651, component detection rate=99.5%, W-2 distance=0.288
  • EM-NPMLE+Mean Shift: ARI=0.662, component detection rate=100%, W-2 distance=0.265
  • CGM: ARI=0.596, component detection rate=0%, average components=7.57

Key Findings:

  1. Both EM-NPKMLE and EM-NPMLE+Mean Shift consistently estimate the true number of components
  2. CGM method systematically overestimates the number of components
  3. As sample size increases, all estimates converge to true values

Parameter Estimation Accuracy

Coefficient estimates for three components (n=10,000):

  • Component 1: True value (3,-1), estimate (-0.112, 0.004)±(0.011, 0.010)
  • Component 2: True value (1,1.5), estimate (-0.115, 0.013)±(0.018, 0.012)
  • Component 3: True value (-1,0.5), estimate (0.113, 0.027)±(0.013, 0.010)

Computational Efficiency Comparison

GEM-NPKMLE (single inner loop) versus full EM-NPKMLE:

  • Time: 15.4 minutes vs 115.9 minutes (n=5000)
  • Performance: Essentially equivalent (large sample regime)

Real Data Applications

CO2-GDP Data:

  • Detected 2 main components with weights 0.484 and 0.358
  • Coefficients: (0.022, 0.179) and (-0.070, 0.343)
  • Consistent with main components from CGM method

Music Tone Perception Data:

  • Detected 2 components, consistent with music theory expectations
  • Components correspond to theoretical predictions of y=xy=x and y=2y=2

NPMLE Research

  • Classical Work: Kiefer and Wolfowitz (1956) first described NPMLE for mixture models
  • Recent Progress: Jiang and Zhang (2009), Koenker and Mizera (2014), Jiang and Guntuboyina (2025), etc.

EM Algorithm Development

  • Modern EM: Formalized by Dempster et al. (1977)
  • Mixture Regression: Extended to clustered linear regression by DeSarbo and Cron (1988)
  • Component Number Estimation: Traditional methods based on AIC, BIC and other information criteria

Advantages of This Work

  1. No Need to Preset Component Number: Compared to traditional EM algorithms
  2. Accurate Component Detection: Compared to existing NPMLE methods
  3. Unified Framework: Handles both continuous and discrete distributions simultaneously

Conclusions and Discussion

Main Conclusions

  1. EM-NPKMLE Algorithm can automatically detect the true number of components in discrete distributions, avoiding overestimation problems of traditional methods.
  2. Convergence Guarantees: Both algorithms have theoretical convergence guarantees.
  3. Strong Practicality: Demonstrates good performance on both simulations and real data.
  4. Computational Efficiency: The GEM variant provides a good balance between efficiency and accuracy.

Limitations

  1. Bandwidth Selection: Requires appropriate bandwidth selection strategy; current methods may not be optimal.
  2. Local Optimality: Gradient ascent may get trapped in local optima.
  3. High-Dimensional Challenges: Performance in high-dimensional settings requires further investigation.
  4. Distribution Judgment: In practice, it is difficult to pre-determine whether the distribution is continuous or discrete.

Future Directions

  1. Adaptive Bandwidth: Develop adaptive bandwidth strategies for different iterations or dimensions.
  2. Theoretical Analysis: Deepen investigation of theoretical properties of EM-NPKMLE.
  3. Extended Applications: Generalize to general mixture distribution models.
  4. Computational Optimization: Further improve computational efficiency of the algorithm.

In-Depth Evaluation

Strengths

  1. Strong Methodological Innovation: The kernel density estimation-constrained NPMLE is a novel approach.
  2. High Practical Value: Solves the practical problem of automatic component number detection.
  3. Solid Theoretical Foundation: Provides convergence proofs.
  4. Comprehensive Experiments: Includes both simulation and real data validation.
  5. Clear Writing: Detailed algorithm description and rigorous mathematical derivations.

Weaknesses

  1. Bandwidth Dependence: Algorithm performance is relatively sensitive to bandwidth selection.
  2. Computational Complexity: Nested-loop structure has relatively high computational cost.
  3. High-Dimensional Scalability: Lacks systematic study in high-dimensional settings.
  4. Limited Comparisons: Primarily compares with CGM method; lacks more baselines.

Impact

  1. Theoretical Contribution: Provides new perspectives for nonparametric estimation in mixture regression.
  2. Practical Value: Has direct applications in clustering and distribution estimation.
  3. Reproducibility: Detailed algorithm description facilitates reproduction.
  4. Extensibility: Framework can be extended to other mixture models.

Applicable Scenarios

  1. Market Segmentation: Analysis of behavior patterns across different consumer groups.
  2. Medical Research: Analysis of treatment response in patient subgroups.
  3. Economic Research: Analysis of economic growth patterns along different development paths.
  4. Machine Learning: Clustered regression and semi-supervised learning.

References

  1. Jiang, H. and Guntuboyina, A. (2025). A nonparametric maximum likelihood approach to mixture of regression.
  2. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm.
  3. Kiefer, J. and Wolfowitz, J. (1956). Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters.
  4. Leisch, F. (2004). FlexMix: A general framework for finite mixture models and latent class regression in R.

Overall Assessment: This is a high-quality statistical methodology paper that proposes innovative EM algorithms to address important problems in mixture linear regression. The methods have solid theoretical foundations and demonstrate good practical performance, providing valuable tools for related fields. Despite some limitations, its contributions are significant with good academic and applied value.