In a mixture of linear regression model, the regression coefficients are treated as random vectors that may follow either a continuous or discrete distribution. We propose two Expectation-Maximization (EM) algorithms to estimate this prior distribution. The first algorithm solves a kernelized version of the nonparametric maximum likelihood estimation (NPMLE). This method not only recovers continuous prior distributions but also accurately estimates the number of clusters when the prior is discrete. The second algorithm, designed to approximate the NPMLE, targets prior distributions with a density. It also performs well for discrete priors when combined with a post-processing step. We study the convergence properties of both algorithms and demonstrate their effectiveness through simulations and applications to real datasets.
- Paper ID: 2510.14890
- Title: EM Approaches to Nonparametric Estimation for Mixture of Linear Regressions
- Authors: Andrew Welbaum, Wanli Qiao (George Mason University)
- Classification: stat.ME stat.ML
- Publication Date: October 17, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.14890
In mixture linear regression models, regression coefficients are treated as random vectors that may follow continuous or discrete distributions. This paper proposes two Expectation-Maximization (EM) algorithms to estimate such prior distributions. The first algorithm solves a kernelized version of the nonparametric maximum likelihood estimation (NPMLE), which can recover continuous prior distributions and accurately estimate the number of clusters when the prior is discrete. The second algorithm aims to approximate the NPMLE for prior distributions with density. Combined with post-processing steps, it also performs well on discrete priors. The convergence properties of both algorithms are studied, and their effectiveness is demonstrated through simulations and real data applications.
Mixture linear regression models extend multivariate linear regression by allowing coefficient vectors to have continuous or discrete prior distributions. This model has broad applications when response variables and covariates may have personalized or clustered linear relationships, including market segmentation, medical research, educational research, and various industrial and economic studies.
Consider n independent observations (x1,y1),…,(xn,yn)∈Rd×R generated by the following model:
yi=xiTβi+σzi
where β1,…,βn∼iidG∗, z1,…,zn∼iidN(0,1), σ>0 is known, and G∗ is an unknown probability distribution on Rd.
- Limitations of Existing Methods: Traditional EM algorithms require prior knowledge of the number of components K, while NPMLE-based methods (e.g., Jiang and Guntuboyina 2025), though theoretically consistent, often fail to accurately detect the true number of components in practice.
- Practical Requirements: There is a need for methods that can handle both continuous distributions and automatically detect the number of components in discrete distributions.
- Clustering Applications: When G∗ is discrete, clustering observations based on estimated results is necessary.
- Proposes EM-NPMLE Algorithm: Converges to NPMLE for prior distributions with density.
- Proposes EM-NPKMLE Algorithm: Through constrained optimization with kernel density estimation, automatically detects the number of components in discrete distributions.
- Theoretical Guarantees: Establishes convergence properties for both algorithms.
- Post-processing Strategy: Introduces mean shift and SCMS post-processing methods for handling special structures.
- Practical Validation: Verifies method effectiveness on simulations and real data.
Given observed data {(xi,yi)}i=1n, the objective is to estimate the unknown prior distribution G∗, and subsequently:
- Perform nonparametric estimation for continuous distributions
- Automatically determine the number of components and estimate parameters for discrete distributions
- Perform clustering based on estimated results
Applicable Scenario: G∗ has a density function g∗
Algorithm Flow:
- E-step: Compute posterior density
fi(t+1)(β)=∫Rdϕσ(yi−xiTβ)g(t)(β)dβϕσ(yi−xiTβ)g(t)(β)
- M-step: Update density estimate
g(t+1)=n1∑i=1nfi(t+1)
Theoretical Properties:
- Theorem 2.1: Under appropriate conditions, G(t) weakly converges to the unique NPMLE G^.
Core Idea: Restrict optimization to the kernel density estimation set Gkde:
Gkde={nhd1∑ℓ=1nv(h2∥⋅−β~ℓ∥2):β~1,…,β~n∈Rd}
Algorithm Structure: Nested-loop EM algorithm
- Outer loop: EM iterations update the distribution
- Inner loop: Gradient ascent optimizes kernel density estimation parameters
Key Update Formula:
νℓ(r+1)=ξ(νℓ(r);β(t),x,y)=C(νℓ(r),β(t),x,y)A(νℓ(r);β(t),x,y)
where A and C are determined by gradient calculations.
- Adaptive Step Size: Gradient ascent uses adaptive step size 1/C(νℓ(r),β(t),x,y), eliminating manual tuning.
- Bandwidth Selection: Bandwidth selection strategy based on maximum smoothness principle, avoiding spurious modes.
- Post-processing Flexibility: Tailored post-processing methods for different prior structures.
Simulation 1: Three-component discrete distribution
- Components: y=3−x, y=1+1.5x, y=−1+0.5x
- Weights: (0.3, 0.3, 0.4)
- Noise: σ=0.5
- Sample sizes: 500 to 10,000
Simulation 2: Continuous distribution
- Uniform distribution on two concentric circles: 21×Uniform{B(1)}+21×Uniform{B(2)}
- Adjusted Rand Index (ARI): Clustering quality
- Component Detection Accuracy: Proportion of correctly identifying true number of components
- Wasserstein-2 Distance: Distribution estimation quality
- Bias and Standard Deviation: Parameter estimation precision
- CGM Method: Conditional gradient method from Jiang and Guntuboyina (2025)
- EM-NPMLE + Mean Shift: Post-processed version
- Oracle Method: Theoretical upper bound with known true distribution
- Kernel function: Gaussian kernel
- Bandwidth: Selected based on maximum smoothness principle
- Initialization: Uniform distribution or EM-NPMLE output
- Convergence criterion: L2 distance below preset threshold
Simulation 1 Results (sample size 10,000):
- EM-NPKMLE: ARI=0.651, component detection rate=99.5%, W-2 distance=0.288
- EM-NPMLE+Mean Shift: ARI=0.662, component detection rate=100%, W-2 distance=0.265
- CGM: ARI=0.596, component detection rate=0%, average components=7.57
Key Findings:
- Both EM-NPKMLE and EM-NPMLE+Mean Shift consistently estimate the true number of components
- CGM method systematically overestimates the number of components
- As sample size increases, all estimates converge to true values
Coefficient estimates for three components (n=10,000):
- Component 1: True value (3,-1), estimate (-0.112, 0.004)±(0.011, 0.010)
- Component 2: True value (1,1.5), estimate (-0.115, 0.013)±(0.018, 0.012)
- Component 3: True value (-1,0.5), estimate (0.113, 0.027)±(0.013, 0.010)
GEM-NPKMLE (single inner loop) versus full EM-NPKMLE:
- Time: 15.4 minutes vs 115.9 minutes (n=5000)
- Performance: Essentially equivalent (large sample regime)
CO2-GDP Data:
- Detected 2 main components with weights 0.484 and 0.358
- Coefficients: (0.022, 0.179) and (-0.070, 0.343)
- Consistent with main components from CGM method
Music Tone Perception Data:
- Detected 2 components, consistent with music theory expectations
- Components correspond to theoretical predictions of y=x and y=2
- Classical Work: Kiefer and Wolfowitz (1956) first described NPMLE for mixture models
- Recent Progress: Jiang and Zhang (2009), Koenker and Mizera (2014), Jiang and Guntuboyina (2025), etc.
- Modern EM: Formalized by Dempster et al. (1977)
- Mixture Regression: Extended to clustered linear regression by DeSarbo and Cron (1988)
- Component Number Estimation: Traditional methods based on AIC, BIC and other information criteria
- No Need to Preset Component Number: Compared to traditional EM algorithms
- Accurate Component Detection: Compared to existing NPMLE methods
- Unified Framework: Handles both continuous and discrete distributions simultaneously
- EM-NPKMLE Algorithm can automatically detect the true number of components in discrete distributions, avoiding overestimation problems of traditional methods.
- Convergence Guarantees: Both algorithms have theoretical convergence guarantees.
- Strong Practicality: Demonstrates good performance on both simulations and real data.
- Computational Efficiency: The GEM variant provides a good balance between efficiency and accuracy.
- Bandwidth Selection: Requires appropriate bandwidth selection strategy; current methods may not be optimal.
- Local Optimality: Gradient ascent may get trapped in local optima.
- High-Dimensional Challenges: Performance in high-dimensional settings requires further investigation.
- Distribution Judgment: In practice, it is difficult to pre-determine whether the distribution is continuous or discrete.
- Adaptive Bandwidth: Develop adaptive bandwidth strategies for different iterations or dimensions.
- Theoretical Analysis: Deepen investigation of theoretical properties of EM-NPKMLE.
- Extended Applications: Generalize to general mixture distribution models.
- Computational Optimization: Further improve computational efficiency of the algorithm.
- Strong Methodological Innovation: The kernel density estimation-constrained NPMLE is a novel approach.
- High Practical Value: Solves the practical problem of automatic component number detection.
- Solid Theoretical Foundation: Provides convergence proofs.
- Comprehensive Experiments: Includes both simulation and real data validation.
- Clear Writing: Detailed algorithm description and rigorous mathematical derivations.
- Bandwidth Dependence: Algorithm performance is relatively sensitive to bandwidth selection.
- Computational Complexity: Nested-loop structure has relatively high computational cost.
- High-Dimensional Scalability: Lacks systematic study in high-dimensional settings.
- Limited Comparisons: Primarily compares with CGM method; lacks more baselines.
- Theoretical Contribution: Provides new perspectives for nonparametric estimation in mixture regression.
- Practical Value: Has direct applications in clustering and distribution estimation.
- Reproducibility: Detailed algorithm description facilitates reproduction.
- Extensibility: Framework can be extended to other mixture models.
- Market Segmentation: Analysis of behavior patterns across different consumer groups.
- Medical Research: Analysis of treatment response in patient subgroups.
- Economic Research: Analysis of economic growth patterns along different development paths.
- Machine Learning: Clustered regression and semi-supervised learning.
- Jiang, H. and Guntuboyina, A. (2025). A nonparametric maximum likelihood approach to mixture of regression.
- Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm.
- Kiefer, J. and Wolfowitz, J. (1956). Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters.
- Leisch, F. (2004). FlexMix: A general framework for finite mixture models and latent class regression in R.
Overall Assessment: This is a high-quality statistical methodology paper that proposes innovative EM algorithms to address important problems in mixture linear regression. The methods have solid theoretical foundations and demonstrate good practical performance, providing valuable tools for related fields. Despite some limitations, its contributions are significant with good academic and applied value.