2025-11-23T20:10:17.105054

Sampling the Bayesian Elastic Net

Hans, Liu

The Bayesian elastic net regression model is characterized by the regression coefficient prior distribution, the negative log density of which corresponds to the elastic net penalty function. While Markov chain Monte Carlo (MCMC) methods exist for sampling from the posterior of the regression coefficients given the penalty parameters, full Bayesian inference that incorporates uncertainty about the penalty parameters remains a challenge due to an intractable integrable in the posterior density function. Though sampling methods have been proposed that avoid computing this integral, all correctly-specified methods for full Bayesian inference that have appeared in the literature involve at least one "Metropolis-within-Gibbs" update, requiring tuning of proposal distributions. The computational landscape is complicated by the fact that two forms of the Bayesian elastic net prior have been introduced, and two representations (with and without data augmentation) of the prior suggest different MCMC algorithms. We review the forms and representations of the prior, discuss all combinations of these different treatments for the first time, and introduce one combination of form and representation that has yet to appear in the literature. We introduce MCMC algorithms for full Bayesian inference for all treatments of the prior. The algorithms allow for direct sampling of all parameters without any "Metropolis-within-Gibbs" steps. The key to the new approach is a careful transformation of the parameter space and an analysis of the resulting full conditional density functions that allows for efficient rejection sampling. We make empirical comparisons between our approaches and existing MCMC samplers for different data structures.

academic

Sampling the Bayesian Elastic Net

Basic Information

Paper ID: 2501.00594
Title: Sampling the Bayesian Elastic Net
Authors: Christopher M. Hans, Ningyi Liu
Classification: stat.CO stat.ME
Publication Date: December 2024
Paper Link: https://arxiv.org/abs/2501.00594

Abstract

The Bayesian elastic net regression model is characterized through a prior distribution on regression coefficients, whose negative log-density corresponds to the elastic net penalty function. While MCMC methods exist for sampling from the posterior distribution of regression coefficients given penalty parameters, complete Bayesian inference that incorporates uncertainty in the penalty parameters remains challenging due to intractable integrals in the posterior density function. Although sampling methods have been proposed to avoid computing this integral, all correctly specified complete Bayesian inference methods in the literature involve at least one "Metropolis-within-Gibbs" update requiring adjustment of the proposal distribution. Computational complexity is further exacerbated by the existence of two forms of Bayesian elastic net priors in the literature, and two representations of the prior (with and without data augmentation) that suggest different MCMC algorithms. This paper reviews the prior forms and representations, discusses for the first time all combinations of these different treatments, and introduces a combination of form and representation not previously appearing in the literature. We introduce MCMC algorithms for complete Bayesian inference for all prior treatments, allowing direct sampling of all parameters without any "Metropolis-within-Gibbs" steps.

Research Background and Motivation

Core Problem

The Bayesian elastic net regression model has become a popular regression method across many research fields. The model is characterized by a prior distribution on regression coefficients whose negative log-density corresponds to the elastic net penalty function:

$\pi_c(\beta | \sigma^2, \lambda_1, \lambda_2) \propto \exp\left\{-\frac{1}{2\sigma^2}(\lambda_2\beta^T\beta + \lambda_1|\beta|_1)\right\}$

Computational Challenges

Intractable Integrals: The normalizing constant of the prior distribution contains the term $\Phi(-\lambda_1/(2\sigma\sqrt{\lambda_2}))^{-p}$ , where $\Phi(\cdot)$ is the standard normal cumulative distribution function, which is an integral expression without closed-form solution.
Parameterization Complexity: Two different prior parameterization forms exist in the literature:
- Commonly-scaled: Both $\lambda_2\beta^T\beta$ and $\lambda_1|\beta|_1$ are scaled by $2\sigma^2$
- Differentially-scaled: Different terms use different scaling factors
Representation Diversity: Each parameterization form has two representations:
- Direct representation: Without data augmentation
- Data augmentation representation: Introducing latent variables in a hierarchical model

Limitations of Existing Methods

All existing correctly specified methods require at least one Metropolis-Hastings update step, which demands:

Specification and adjustment of the proposal distribution
Selection of step-size parameters for random walks
Potential issues with slow convergence and poor mixing

Core Contributions

Comprehensive Review: First comprehensive review of all combinations of Bayesian elastic net prior forms and representations, introducing a new combination (differentially-scaled direct representation)
Parameter Space Transformation: Proposes clever parameter space transformations that confine the complex $\Phi(\cdot)$ term to a single complete conditional distribution
Tuning-Free MCMC Algorithm: Develops MCMC algorithms requiring no "Metropolis-within-Gibbs" steps, avoiding proposal distribution adjustment issues
Efficient Rejection Sampling: Designs efficient rejection sampling algorithms with automatically-tuned piecewise exponential proposal distributions based on log-concavity analysis
Theoretical Guarantees: Provides theoretical results on log-concavity of key distributions and mode bounds

Methodology Details

Problem Definition

Under the normal linear regression model $y = X\beta + \varepsilon$ (where $\varepsilon \sim N(0, \sigma^2I_n)$ ), conduct complete Bayesian elastic net inference, including modeling uncertainty in penalty parameters $\lambda_1, \lambda_2$ and error variance $\sigma^2$ .

Core Technical Innovations

1. Parameter Space Transformation

Transformation under commonly-scaled prior: $(\sigma^2, \lambda_1, \lambda_2) \rightarrow (u_1 = \sigma^2, u_2 = \sqrt{\lambda_2}/\sigma, \theta = \lambda_1/(2\sigma\sqrt{\lambda_2}))$

Transformation under differentially-scaled prior: $(\lambda_2, \lambda_1) \rightarrow (u_2 = \sqrt{\lambda_2}, \theta = \lambda_1/\sqrt{\lambda_2})$

Key advantages of these transformations:

Concentrate the $\Phi(\cdot)$ term into the complete conditional distribution of a single parameter $\theta$
Yield log-concave complete conditional distributions, facilitating efficient sampling

2. Rejection Sampling Algorithm

Specialized rejection sampling methods designed for densities of the form: $f(x) \propto \Phi(-x)^{-q}x^{a-1}e^{-bx^2-cx-d/x}, \quad x > 0$

Key Theoretical Results:

Proposition 1: When $q \in \{1,2,...\}$ , $a \geq 1$ , $b \geq q/2$ , $c > 0$ , $f(x)$ is integrable and log-concave
Proposition 2: Provides exact bounds on the mode $x^*$ , facilitating construction of envelope functions for rejection sampling

3. Complete Conditional Distributions

The transformed complete conditional distributions include:

Generalized Inverse Gaussian Distribution (GIG): $u_1 | \text{other parameters} \sim \text{GIG}(\alpha, \beta, \gamma)$

Modified Half-Normal Distribution (MHN): $u_2 | \text{other parameters} \sim \text{MHN}(\alpha, \beta, \gamma)$

Distribution with $\Phi(\cdot)$ term: $\pi(\theta | \text{other parameters}) \propto \Phi(-\theta)^{-p}\theta^{L-1}e^{-\theta^2/2-\theta c}$

Algorithm Flow

Initialization: Set initial parameter values
Iterative Sampling:
- Sample GIG distribution using Devroye (2014) method
- Sample MHN distribution using Sun et al. (2023) method or new rejection sampling method
- Sample distribution with $\Phi(\cdot)$ term using adaptive rejection sampling
Regression Coefficient Update: Update $\beta$ according to selected representation (direct or data augmentation)

Experimental Setup

Datasets

Four simulation settings from Zou and Hastie (2005):

Simulation 1: $n=20$ , $p=8$ , $\beta=(3,1.5,0,0,2,0,0,0)^T$ , $\sigma=3$
Simulation 2: $n=20$ , $p=8$ , $\beta_j=0.85$ for $j=1,...,8$ , $\sigma=3$
Simulation 3: $n=100$ , $p=40$ , high-dimensional setting, $\sigma=15$
Simulation 4: $n=100$ , $p=40$ , block-diagonal covariance structure, $\sigma=15$

Fifty datasets generated for each setting for comparison.

Evaluation Metrics

Effective Sample Size (ESS) used as the measure of MCMC algorithm efficiency, computed via R package mcmcse.

Comparison Methods

RS: Rejection sampling method proposed in this paper (weak prior RS-W and strong prior RS-S)
MH: Metropolis-Hastings method from Hans (2011) (MH-W and MH-S)
EX: Exchange algorithm from Wang and Wang (2023) (EX and EX-B)

Implementation Details

MCMC iterations: 10,000 (100 burn-in)
Prior settings:
- Weak prior: $L=\nu_1=R=\nu_2=1$
- Strong prior: $L=6$ , $\nu_L=4$ , $R=2$ , $\nu_R=4$

RS method shows significantly better performance on non-zero regression coefficients, with ESS improvement distribution strongly right-skewed
Similar performance across methods for zero regression coefficients
RS-S achieves average improvement up to 149.86% on $\lambda_1$ parameter

High-Dimensional Settings (Simulations 3 and 4, p=40)

Simulation 3: EX method shows better overall performance, but RS method's ESS reduction is typically modest (<20%)
Simulation 4: RS-S performs comparably or slightly better than EX on non-zero coefficients

Key Findings

Parameter-Specific Performance:
- $\beta$ parameters: RS method shows clear advantages in low dimensions, reasonable performance in high dimensions
- $\sigma^2, \lambda_1, \lambda_2$ : RS-S generally performs well in most cases
Tuning Sensitivity:
- EX-B (poorly tuned exchange algorithm) demonstrates importance of tuning parameters
- RS method completely avoids tuning requirements
Prior Impact:
- Strong prior (RS-S) typically outperforms weak prior (RS-W)
- Particularly evident in sampling efficiency for $\lambda_1$ parameter

Performance Comparison Table (Average ESS Improvement Percentage)

Parameter	Sim 1 RS-S	Sim 2 RS-S	Sim 3 RS-S	Sim 4 RS-S
$\beta_1$	59.73%	5.87%	-15.2%	2.1%
$\sigma^2$	21.79%	19.83%	-40.95%	-42.93%
$\lambda_1$	149.86%	166.75%	90.42%	58.47%
$\lambda_2$	11.9%	18.39%	-53.17%	-39.56%

Development of Bayesian Regularized Regression

Lasso Connection: Tibshirani (1996) first established connection between Bayesian posterior mode and penalized optimization
Elastic Net Extension: Li and Lin (2010), Hans (2011), Kyung et al. (2010) and others developed Bayesian elastic net
Adaptive Methods: Griffin and Brown (2007), Leng et al. (2014) and others studied Bayesian versions of adaptive lasso

Progress in Computational Methods

Data Augmentation: Scale mixture representation from Park and Casella (2008)
Variational Inference: Approximate methods avoiding MCMC
Exchange Algorithm: Clever approach from Wang and Wang (2023) avoiding computation of $\Phi(\cdot)$

Conclusions and Discussion

Main Conclusions

Method Effectiveness: The proposed rejection sampling method successfully eliminates tuning requirements, providing competitive or superior performance in most cases
Theoretical Contribution: Parameter transformation and log-concavity analysis provide new theoretical foundations for Bayesian elastic net computation
Practical Value: The automatic nature of the algorithm makes it more suitable for practical applications

Limitations

High-Dimensional Performance: The relative advantages of the method are less pronounced in some high-dimensional settings compared to low-dimensional cases
Prior Restrictions: Log-concavity requirement of $L \geq 1$ limits use of certain priors
Parameterization Dependence: Performance is sensitive to parameterization choices

Future Directions

Improved High-Dimensional Performance: Combining partial collapse sampling and generalized Gibbs steps
Extension to Other Models: Extending the method to generalized linear models and other regularization methods
Theoretical Optimization: Exploring alternative parameterizations that may improve Markov chain dynamics

In-Depth Evaluation

Strengths

Technical Innovation: Clever parameter transformation and rejection sampling design based on log-concavity demonstrate high originality
Theoretical Rigor: Provides complete mathematical proofs and theoretical guarantees
Practical Value: Elimination of tuning requirements significantly enhances method usability
Comprehensive Comparison: Systematic comparison of all existing methods, filling gaps in the literature

Weaknesses

Complexity Trade-off: While avoiding tuning, the method itself has higher theoretical complexity
Scope of Applicability: Restrictions under certain prior settings may affect method generalizability
High-Dimensional Challenges: Performance in high-dimensional settings still has room for improvement

Impact

Academic Contribution: Provides important advances in computational methods for Bayesian regularized regression
Practical Application: Tuning-free nature makes the method more accessible to practitioners
Methodological Value: Parameter transformation approach may inspire computational methods for other complex Bayesian models

Applicable Scenarios

Elastic net regression analysis requiring complete Bayesian inference
Automated analysis pipelines sensitive to MCMC tuning
Moderate-dimensional regression problems (p < 100)
Applications requiring quantification of penalty parameter uncertainty

References

Key references include:

Li, Q. and Lin, N. (2010). The Bayesian elastic net. Bayesian Analysis, 5, 151-170.
Hans, C. (2011). Elastic net regression modeling with the orthant normal prior. Journal of the American Statistical Association, 106, 1383-1393.
Wang, H.-B. and Wang, J. (2023). An exact sampler for fully Bayesian elastic net. Computational Statistics, 38, 1721-1734.
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, B, 67, 301-320.

Sampling the Bayesian Elastic Net

Sampling the Bayesian Elastic Net

Basic Information

Abstract

Research Background and Motivation

Core Problem

Computational Challenges

Limitations of Existing Methods

Core Contributions

Methodology Details

Problem Definition

Core Technical Innovations

1. Parameter Space Transformation

2. Rejection Sampling Algorithm

3. Complete Conditional Distributions

Algorithm Flow

Experimental Setup

Datasets

Evaluation Metrics

Comparison Methods

Implementation Details

Experimental Results

Main Results

Low-Dimensional Settings (Simulations 1 and 2, p=8)

High-Dimensional Settings (Simulations 3 and 4, p=40)

Key Findings

Performance Comparison Table (Average ESS Improvement Percentage)

Development of Bayesian Regularized Regression

Progress in Computational Methods

Conclusions and Discussion

Main Conclusions

Limitations

Future Directions

In-Depth Evaluation

Strengths

Weaknesses

Impact

Applicable Scenarios

References