2025-11-23T20:10:17.105054

Sampling the Bayesian Elastic Net

Hans, Liu
The Bayesian elastic net regression model is characterized by the regression coefficient prior distribution, the negative log density of which corresponds to the elastic net penalty function. While Markov chain Monte Carlo (MCMC) methods exist for sampling from the posterior of the regression coefficients given the penalty parameters, full Bayesian inference that incorporates uncertainty about the penalty parameters remains a challenge due to an intractable integrable in the posterior density function. Though sampling methods have been proposed that avoid computing this integral, all correctly-specified methods for full Bayesian inference that have appeared in the literature involve at least one "Metropolis-within-Gibbs" update, requiring tuning of proposal distributions. The computational landscape is complicated by the fact that two forms of the Bayesian elastic net prior have been introduced, and two representations (with and without data augmentation) of the prior suggest different MCMC algorithms. We review the forms and representations of the prior, discuss all combinations of these different treatments for the first time, and introduce one combination of form and representation that has yet to appear in the literature. We introduce MCMC algorithms for full Bayesian inference for all treatments of the prior. The algorithms allow for direct sampling of all parameters without any "Metropolis-within-Gibbs" steps. The key to the new approach is a careful transformation of the parameter space and an analysis of the resulting full conditional density functions that allows for efficient rejection sampling. We make empirical comparisons between our approaches and existing MCMC samplers for different data structures.
academic

Sampling the Bayesian Elastic Net

Basic Information

  • Paper ID: 2501.00594
  • Title: Sampling the Bayesian Elastic Net
  • Authors: Christopher M. Hans, Ningyi Liu
  • Classification: stat.CO stat.ME
  • Publication Date: December 2024
  • Paper Link: https://arxiv.org/abs/2501.00594

Abstract

The Bayesian elastic net regression model is characterized through a prior distribution on regression coefficients, whose negative log-density corresponds to the elastic net penalty function. While MCMC methods exist for sampling from the posterior distribution of regression coefficients given penalty parameters, complete Bayesian inference that incorporates uncertainty in the penalty parameters remains challenging due to intractable integrals in the posterior density function. Although sampling methods have been proposed to avoid computing this integral, all correctly specified complete Bayesian inference methods in the literature involve at least one "Metropolis-within-Gibbs" update requiring adjustment of the proposal distribution. Computational complexity is further exacerbated by the existence of two forms of Bayesian elastic net priors in the literature, and two representations of the prior (with and without data augmentation) that suggest different MCMC algorithms. This paper reviews the prior forms and representations, discusses for the first time all combinations of these different treatments, and introduces a combination of form and representation not previously appearing in the literature. We introduce MCMC algorithms for complete Bayesian inference for all prior treatments, allowing direct sampling of all parameters without any "Metropolis-within-Gibbs" steps.

Research Background and Motivation

Core Problem

The Bayesian elastic net regression model has become a popular regression method across many research fields. The model is characterized by a prior distribution on regression coefficients whose negative log-density corresponds to the elastic net penalty function:

πc(βσ2,λ1,λ2)exp{12σ2(λ2βTβ+λ1β1)}\pi_c(\beta | \sigma^2, \lambda_1, \lambda_2) \propto \exp\left\{-\frac{1}{2\sigma^2}(\lambda_2\beta^T\beta + \lambda_1|\beta|_1)\right\}

Computational Challenges

  1. Intractable Integrals: The normalizing constant of the prior distribution contains the term Φ(λ1/(2σλ2))p\Phi(-\lambda_1/(2\sigma\sqrt{\lambda_2}))^{-p}, where Φ()\Phi(\cdot) is the standard normal cumulative distribution function, which is an integral expression without closed-form solution.
  2. Parameterization Complexity: Two different prior parameterization forms exist in the literature:
    • Commonly-scaled: Both λ2βTβ\lambda_2\beta^T\beta and λ1β1\lambda_1|\beta|_1 are scaled by 2σ22\sigma^2
    • Differentially-scaled: Different terms use different scaling factors
  3. Representation Diversity: Each parameterization form has two representations:
    • Direct representation: Without data augmentation
    • Data augmentation representation: Introducing latent variables in a hierarchical model

Limitations of Existing Methods

All existing correctly specified methods require at least one Metropolis-Hastings update step, which demands:

  • Specification and adjustment of the proposal distribution
  • Selection of step-size parameters for random walks
  • Potential issues with slow convergence and poor mixing

Core Contributions

  1. Comprehensive Review: First comprehensive review of all combinations of Bayesian elastic net prior forms and representations, introducing a new combination (differentially-scaled direct representation)
  2. Parameter Space Transformation: Proposes clever parameter space transformations that confine the complex Φ()\Phi(\cdot) term to a single complete conditional distribution
  3. Tuning-Free MCMC Algorithm: Develops MCMC algorithms requiring no "Metropolis-within-Gibbs" steps, avoiding proposal distribution adjustment issues
  4. Efficient Rejection Sampling: Designs efficient rejection sampling algorithms with automatically-tuned piecewise exponential proposal distributions based on log-concavity analysis
  5. Theoretical Guarantees: Provides theoretical results on log-concavity of key distributions and mode bounds

Methodology Details

Problem Definition

Under the normal linear regression model y=Xβ+εy = X\beta + \varepsilon (where εN(0,σ2In)\varepsilon \sim N(0, \sigma^2I_n)), conduct complete Bayesian elastic net inference, including modeling uncertainty in penalty parameters λ1,λ2\lambda_1, \lambda_2 and error variance σ2\sigma^2.

Core Technical Innovations

1. Parameter Space Transformation

Transformation under commonly-scaled prior: (σ2,λ1,λ2)(u1=σ2,u2=λ2/σ,θ=λ1/(2σλ2))(\sigma^2, \lambda_1, \lambda_2) \rightarrow (u_1 = \sigma^2, u_2 = \sqrt{\lambda_2}/\sigma, \theta = \lambda_1/(2\sigma\sqrt{\lambda_2}))

Transformation under differentially-scaled prior: (λ2,λ1)(u2=λ2,θ=λ1/λ2)(\lambda_2, \lambda_1) \rightarrow (u_2 = \sqrt{\lambda_2}, \theta = \lambda_1/\sqrt{\lambda_2})

Key advantages of these transformations:

  • Concentrate the Φ()\Phi(\cdot) term into the complete conditional distribution of a single parameter θ\theta
  • Yield log-concave complete conditional distributions, facilitating efficient sampling

2. Rejection Sampling Algorithm

Specialized rejection sampling methods designed for densities of the form: f(x)Φ(x)qxa1ebx2cxd/x,x>0f(x) \propto \Phi(-x)^{-q}x^{a-1}e^{-bx^2-cx-d/x}, \quad x > 0

Key Theoretical Results:

  • Proposition 1: When q{1,2,...}q \in \{1,2,...\}, a1a \geq 1, bq/2b \geq q/2, c>0c > 0, f(x)f(x) is integrable and log-concave
  • Proposition 2: Provides exact bounds on the mode xx^*, facilitating construction of envelope functions for rejection sampling

3. Complete Conditional Distributions

The transformed complete conditional distributions include:

Generalized Inverse Gaussian Distribution (GIG): u1other parametersGIG(α,β,γ)u_1 | \text{other parameters} \sim \text{GIG}(\alpha, \beta, \gamma)

Modified Half-Normal Distribution (MHN): u2other parametersMHN(α,β,γ)u_2 | \text{other parameters} \sim \text{MHN}(\alpha, \beta, \gamma)

Distribution with Φ()\Phi(\cdot) term: π(θother parameters)Φ(θ)pθL1eθ2/2θc\pi(\theta | \text{other parameters}) \propto \Phi(-\theta)^{-p}\theta^{L-1}e^{-\theta^2/2-\theta c}

Algorithm Flow

  1. Initialization: Set initial parameter values
  2. Iterative Sampling:
    • Sample GIG distribution using Devroye (2014) method
    • Sample MHN distribution using Sun et al. (2023) method or new rejection sampling method
    • Sample distribution with Φ()\Phi(\cdot) term using adaptive rejection sampling
  3. Regression Coefficient Update: Update β\beta according to selected representation (direct or data augmentation)

Experimental Setup

Datasets

Four simulation settings from Zou and Hastie (2005):

  1. Simulation 1: n=20n=20, p=8p=8, β=(3,1.5,0,0,2,0,0,0)T\beta=(3,1.5,0,0,2,0,0,0)^T, σ=3\sigma=3
  2. Simulation 2: n=20n=20, p=8p=8, βj=0.85\beta_j=0.85 for j=1,...,8j=1,...,8, σ=3\sigma=3
  3. Simulation 3: n=100n=100, p=40p=40, high-dimensional setting, σ=15\sigma=15
  4. Simulation 4: n=100n=100, p=40p=40, block-diagonal covariance structure, σ=15\sigma=15

Fifty datasets generated for each setting for comparison.

Evaluation Metrics

Effective Sample Size (ESS) used as the measure of MCMC algorithm efficiency, computed via R package mcmcse.

Comparison Methods

  1. RS: Rejection sampling method proposed in this paper (weak prior RS-W and strong prior RS-S)
  2. MH: Metropolis-Hastings method from Hans (2011) (MH-W and MH-S)
  3. EX: Exchange algorithm from Wang and Wang (2023) (EX and EX-B)

Implementation Details

  • MCMC iterations: 10,000 (100 burn-in)
  • Prior settings:
    • Weak prior: L=ν1=R=ν2=1L=\nu_1=R=\nu_2=1
    • Strong prior: L=6L=6, νL=4\nu_L=4, R=2R=2, νR=4\nu_R=4

Experimental Results

Main Results

Low-Dimensional Settings (Simulations 1 and 2, p=8)

  • RS method shows significantly better performance on non-zero regression coefficients, with ESS improvement distribution strongly right-skewed
  • Similar performance across methods for zero regression coefficients
  • RS-S achieves average improvement up to 149.86% on λ1\lambda_1 parameter

High-Dimensional Settings (Simulations 3 and 4, p=40)

  • Simulation 3: EX method shows better overall performance, but RS method's ESS reduction is typically modest (<20%)
  • Simulation 4: RS-S performs comparably or slightly better than EX on non-zero coefficients

Key Findings

  1. Parameter-Specific Performance:
    • β\beta parameters: RS method shows clear advantages in low dimensions, reasonable performance in high dimensions
    • σ2,λ1,λ2\sigma^2, \lambda_1, \lambda_2: RS-S generally performs well in most cases
  2. Tuning Sensitivity:
    • EX-B (poorly tuned exchange algorithm) demonstrates importance of tuning parameters
    • RS method completely avoids tuning requirements
  3. Prior Impact:
    • Strong prior (RS-S) typically outperforms weak prior (RS-W)
    • Particularly evident in sampling efficiency for λ1\lambda_1 parameter

Performance Comparison Table (Average ESS Improvement Percentage)

ParameterSim 1 RS-SSim 2 RS-SSim 3 RS-SSim 4 RS-S
β1\beta_159.73%5.87%-15.2%2.1%
σ2\sigma^221.79%19.83%-40.95%-42.93%
λ1\lambda_1149.86%166.75%90.42%58.47%
λ2\lambda_211.9%18.39%-53.17%-39.56%

Development of Bayesian Regularized Regression

  1. Lasso Connection: Tibshirani (1996) first established connection between Bayesian posterior mode and penalized optimization
  2. Elastic Net Extension: Li and Lin (2010), Hans (2011), Kyung et al. (2010) and others developed Bayesian elastic net
  3. Adaptive Methods: Griffin and Brown (2007), Leng et al. (2014) and others studied Bayesian versions of adaptive lasso

Progress in Computational Methods

  • Data Augmentation: Scale mixture representation from Park and Casella (2008)
  • Variational Inference: Approximate methods avoiding MCMC
  • Exchange Algorithm: Clever approach from Wang and Wang (2023) avoiding computation of Φ()\Phi(\cdot)

Conclusions and Discussion

Main Conclusions

  1. Method Effectiveness: The proposed rejection sampling method successfully eliminates tuning requirements, providing competitive or superior performance in most cases
  2. Theoretical Contribution: Parameter transformation and log-concavity analysis provide new theoretical foundations for Bayesian elastic net computation
  3. Practical Value: The automatic nature of the algorithm makes it more suitable for practical applications

Limitations

  1. High-Dimensional Performance: The relative advantages of the method are less pronounced in some high-dimensional settings compared to low-dimensional cases
  2. Prior Restrictions: Log-concavity requirement of L1L \geq 1 limits use of certain priors
  3. Parameterization Dependence: Performance is sensitive to parameterization choices

Future Directions

  1. Improved High-Dimensional Performance: Combining partial collapse sampling and generalized Gibbs steps
  2. Extension to Other Models: Extending the method to generalized linear models and other regularization methods
  3. Theoretical Optimization: Exploring alternative parameterizations that may improve Markov chain dynamics

In-Depth Evaluation

Strengths

  1. Technical Innovation: Clever parameter transformation and rejection sampling design based on log-concavity demonstrate high originality
  2. Theoretical Rigor: Provides complete mathematical proofs and theoretical guarantees
  3. Practical Value: Elimination of tuning requirements significantly enhances method usability
  4. Comprehensive Comparison: Systematic comparison of all existing methods, filling gaps in the literature

Weaknesses

  1. Complexity Trade-off: While avoiding tuning, the method itself has higher theoretical complexity
  2. Scope of Applicability: Restrictions under certain prior settings may affect method generalizability
  3. High-Dimensional Challenges: Performance in high-dimensional settings still has room for improvement

Impact

  1. Academic Contribution: Provides important advances in computational methods for Bayesian regularized regression
  2. Practical Application: Tuning-free nature makes the method more accessible to practitioners
  3. Methodological Value: Parameter transformation approach may inspire computational methods for other complex Bayesian models

Applicable Scenarios

  • Elastic net regression analysis requiring complete Bayesian inference
  • Automated analysis pipelines sensitive to MCMC tuning
  • Moderate-dimensional regression problems (p < 100)
  • Applications requiring quantification of penalty parameter uncertainty

References

Key references include:

  • Li, Q. and Lin, N. (2010). The Bayesian elastic net. Bayesian Analysis, 5, 151-170.
  • Hans, C. (2011). Elastic net regression modeling with the orthant normal prior. Journal of the American Statistical Association, 106, 1383-1393.
  • Wang, H.-B. and Wang, J. (2023). An exact sampler for fully Bayesian elastic net. Computational Statistics, 38, 1721-1734.
  • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, B, 67, 301-320.