2025-11-22T01:28:15.129039

EM Approaches to Nonparametric Estimation for Mixture of Linear Regressions

Welbaum, Qiao

In a mixture of linear regression model, the regression coefficients are treated as random vectors that may follow either a continuous or discrete distribution. We propose two Expectation-Maximization (EM) algorithms to estimate this prior distribution. The first algorithm solves a kernelized version of the nonparametric maximum likelihood estimation (NPMLE). This method not only recovers continuous prior distributions but also accurately estimates the number of clusters when the prior is discrete. The second algorithm, designed to approximate the NPMLE, targets prior distributions with a density. It also performs well for discrete priors when combined with a post-processing step. We study the convergence properties of both algorithms and demonstrate their effectiveness through simulations and applications to real datasets.

academic

EM Approaches to Nonparametric Estimation for Mixture of Linear Regressions

Basic Information

Paper ID: 2510.14890
Title: EM Approaches to Nonparametric Estimation for Mixture of Linear Regressions
Authors: Andrew Welbaum, Wanli Qiao (George Mason University)
Classification: stat.ME stat.ML
Publication Date: October 17, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.14890

Abstract

In mixture linear regression models, regression coefficients are treated as random vectors that may follow continuous or discrete distributions. This paper proposes two Expectation-Maximization (EM) algorithms to estimate such prior distributions. The first algorithm solves a kernelized version of the nonparametric maximum likelihood estimation (NPMLE), which can recover continuous prior distributions and accurately estimate the number of clusters when the prior is discrete. The second algorithm aims to approximate the NPMLE for prior distributions with density. Combined with post-processing steps, it also performs well on discrete priors. The convergence properties of both algorithms are studied, and their effectiveness is demonstrated through simulations and real data applications.

Research Background and Motivation

Problem Definition

Mixture linear regression models extend multivariate linear regression by allowing coefficient vectors to have continuous or discrete prior distributions. This model has broad applications when response variables and covariates may have personalized or clustered linear relationships, including market segmentation, medical research, educational research, and various industrial and economic studies.

Model Setup

Consider n independent observations $(x_1, y_1), \ldots, (x_n, y_n) \in \mathbb{R}^d \times \mathbb{R}$ generated by the following model: $y_i = x_i^T \beta_i + \sigma z_i$ where $\beta_1, \ldots, \beta_n \stackrel{iid}{\sim} G^*$ , $z_1, \ldots, z_n \stackrel{iid}{\sim} N(0,1)$ , $\sigma > 0$ is known, and $G^*$ is an unknown probability distribution on $\mathbb{R}^d$ .

Research Motivation

Limitations of Existing Methods: Traditional EM algorithms require prior knowledge of the number of components K, while NPMLE-based methods (e.g., Jiang and Guntuboyina 2025), though theoretically consistent, often fail to accurately detect the true number of components in practice.
Practical Requirements: There is a need for methods that can handle both continuous distributions and automatically detect the number of components in discrete distributions.
Clustering Applications: When $G^*$ is discrete, clustering observations based on estimated results is necessary.

Core Contributions

Proposes EM-NPMLE Algorithm: Converges to NPMLE for prior distributions with density.
Proposes EM-NPKMLE Algorithm: Through constrained optimization with kernel density estimation, automatically detects the number of components in discrete distributions.
Theoretical Guarantees: Establishes convergence properties for both algorithms.
Post-processing Strategy: Introduces mean shift and SCMS post-processing methods for handling special structures.
Practical Validation: Verifies method effectiveness on simulations and real data.

Methodology Details

Task Definition

Given observed data $\{(x_i, y_i)\}_{i=1}^n$ , the objective is to estimate the unknown prior distribution $G^*$ , and subsequently:

Perform nonparametric estimation for continuous distributions
Automatically determine the number of components and estimate parameters for discrete distributions
Perform clustering based on estimated results

EM-NPMLE Algorithm (Method 1)

Applicable Scenario: $G^*$ has a density function $g^*$

Algorithm Flow:

E-step: Compute posterior density $f_i^{(t+1)}(\beta) = \frac{\phi_\sigma(y_i - x_i^T\beta)g^{(t)}(\beta)}{\int_{\mathbb{R}^d} \phi_\sigma(y_i - x_i^T\beta)g^{(t)}(\beta)d\beta}$
M-step: Update density estimate $g^{(t+1)} = \frac{1}{n}\sum_{i=1}^n f_i^{(t+1)}$

Theoretical Properties:

Theorem 2.1: Under appropriate conditions, $G^{(t)}$ weakly converges to the unique NPMLE $\hat{G}$ .

EM-NPKMLE Algorithm (Method 2)

Core Idea: Restrict optimization to the kernel density estimation set $\mathcal{G}_{kde}$ : $\mathcal{G}_{kde} = \left\{\frac{1}{nh^d}\sum_{\ell=1}^n v\left(\frac{\|\cdot - \tilde{\beta}_\ell\|^2}{h^2}\right) : \tilde{\beta}_1, \ldots, \tilde{\beta}_n \in \mathbb{R}^d\right\}$

Algorithm Structure: Nested-loop EM algorithm

Outer loop: EM iterations update the distribution
Inner loop: Gradient ascent optimizes kernel density estimation parameters

Key Update Formula: $\nu_\ell^{(r+1)} = \xi(\nu_\ell^{(r)}; \beta^{(t)}, x, y) = \frac{A(\nu_\ell^{(r)}; \beta^{(t)}, x, y)}{C(\nu_\ell^{(r)}, \beta^{(t)}, x, y)}$

where $A$ and $C$ are determined by gradient calculations.

Technical Innovations

Adaptive Step Size: Gradient ascent uses adaptive step size $1/C(\nu_\ell^{(r)}, \beta^{(t)}, x, y)$ , eliminating manual tuning.
Bandwidth Selection: Bandwidth selection strategy based on maximum smoothness principle, avoiding spurious modes.
Post-processing Flexibility: Tailored post-processing methods for different prior structures.

Experimental Setup

Simulation Data

Simulation 1: Three-component discrete distribution

Components: $y = 3-x$ , $y = 1+1.5x$ , $y = -1+0.5x$
Weights: (0.3, 0.3, 0.4)
Noise: $\sigma = 0.5$
Sample sizes: 500 to 10,000

Simulation 2: Continuous distribution

Uniform distribution on two concentric circles: $\frac{1}{2} \times \text{Uniform}\{B(1)\} + \frac{1}{2} \times \text{Uniform}\{B(2)\}$

Evaluation Metrics

Adjusted Rand Index (ARI): Clustering quality
Component Detection Accuracy: Proportion of correctly identifying true number of components
Wasserstein-2 Distance: Distribution estimation quality
Bias and Standard Deviation: Parameter estimation precision

Comparison Methods

CGM Method: Conditional gradient method from Jiang and Guntuboyina (2025)
EM-NPMLE + Mean Shift: Post-processed version
Oracle Method: Theoretical upper bound with known true distribution

Implementation Details

Kernel function: Gaussian kernel
Bandwidth: Selected based on maximum smoothness principle
Initialization: Uniform distribution or EM-NPMLE output
Convergence criterion: $L_2$ distance below preset threshold

Experimental Results

Main Results

Simulation 1 Results (sample size 10,000):

EM-NPKMLE: ARI=0.651, component detection rate=99.5%, W-2 distance=0.288
EM-NPMLE+Mean Shift: ARI=0.662, component detection rate=100%, W-2 distance=0.265
CGM: ARI=0.596, component detection rate=0%, average components=7.57

Key Findings:

Both EM-NPKMLE and EM-NPMLE+Mean Shift consistently estimate the true number of components
CGM method systematically overestimates the number of components
As sample size increases, all estimates converge to true values

Parameter Estimation Accuracy

Coefficient estimates for three components (n=10,000):

Component 1: True value (3,-1), estimate (-0.112, 0.004)±(0.011, 0.010)
Component 2: True value (1,1.5), estimate (-0.115, 0.013)±(0.018, 0.012)
Component 3: True value (-1,0.5), estimate (0.113, 0.027)±(0.013, 0.010)

Computational Efficiency Comparison

GEM-NPKMLE (single inner loop) versus full EM-NPKMLE:

Time: 15.4 minutes vs 115.9 minutes (n=5000)
Performance: Essentially equivalent (large sample regime)

Real Data Applications

CO2-GDP Data:

Detected 2 main components with weights 0.484 and 0.358
Coefficients: (0.022, 0.179) and (-0.070, 0.343)
Consistent with main components from CGM method

Music Tone Perception Data:

Detected 2 components, consistent with music theory expectations
Components correspond to theoretical predictions of $y=x$ and $y=2$

NPMLE Research

Classical Work: Kiefer and Wolfowitz (1956) first described NPMLE for mixture models
Recent Progress: Jiang and Zhang (2009), Koenker and Mizera (2014), Jiang and Guntuboyina (2025), etc.

EM Algorithm Development

Modern EM: Formalized by Dempster et al. (1977)
Mixture Regression: Extended to clustered linear regression by DeSarbo and Cron (1988)
Component Number Estimation: Traditional methods based on AIC, BIC and other information criteria

Advantages of This Work

No Need to Preset Component Number: Compared to traditional EM algorithms
Accurate Component Detection: Compared to existing NPMLE methods
Unified Framework: Handles both continuous and discrete distributions simultaneously

Conclusions and Discussion

Main Conclusions

EM-NPKMLE Algorithm can automatically detect the true number of components in discrete distributions, avoiding overestimation problems of traditional methods.
Convergence Guarantees: Both algorithms have theoretical convergence guarantees.
Strong Practicality: Demonstrates good performance on both simulations and real data.
Computational Efficiency: The GEM variant provides a good balance between efficiency and accuracy.

Limitations

Bandwidth Selection: Requires appropriate bandwidth selection strategy; current methods may not be optimal.
Local Optimality: Gradient ascent may get trapped in local optima.
High-Dimensional Challenges: Performance in high-dimensional settings requires further investigation.
Distribution Judgment: In practice, it is difficult to pre-determine whether the distribution is continuous or discrete.

Future Directions

Adaptive Bandwidth: Develop adaptive bandwidth strategies for different iterations or dimensions.
Theoretical Analysis: Deepen investigation of theoretical properties of EM-NPKMLE.
Extended Applications: Generalize to general mixture distribution models.
Computational Optimization: Further improve computational efficiency of the algorithm.

In-Depth Evaluation

Strengths

Strong Methodological Innovation: The kernel density estimation-constrained NPMLE is a novel approach.
High Practical Value: Solves the practical problem of automatic component number detection.
Solid Theoretical Foundation: Provides convergence proofs.
Comprehensive Experiments: Includes both simulation and real data validation.
Clear Writing: Detailed algorithm description and rigorous mathematical derivations.

Weaknesses

Bandwidth Dependence: Algorithm performance is relatively sensitive to bandwidth selection.
Computational Complexity: Nested-loop structure has relatively high computational cost.
High-Dimensional Scalability: Lacks systematic study in high-dimensional settings.
Limited Comparisons: Primarily compares with CGM method; lacks more baselines.

Impact

Theoretical Contribution: Provides new perspectives for nonparametric estimation in mixture regression.
Practical Value: Has direct applications in clustering and distribution estimation.
Reproducibility: Detailed algorithm description facilitates reproduction.
Extensibility: Framework can be extended to other mixture models.

Applicable Scenarios

Market Segmentation: Analysis of behavior patterns across different consumer groups.
Medical Research: Analysis of treatment response in patient subgroups.
Economic Research: Analysis of economic growth patterns along different development paths.
Machine Learning: Clustered regression and semi-supervised learning.

References

Jiang, H. and Guntuboyina, A. (2025). A nonparametric maximum likelihood approach to mixture of regression.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm.
Kiefer, J. and Wolfowitz, J. (1956). Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters.
Leisch, F. (2004). FlexMix: A general framework for finite mixture models and latent class regression in R.

Overall Assessment: This is a high-quality statistical methodology paper that proposes innovative EM algorithms to address important problems in mixture linear regression. The methods have solid theoretical foundations and demonstrate good practical performance, providing valuable tools for related fields. Despite some limitations, its contributions are significant with good academic and applied value.