2025-11-23T19:58:17.144226

Bayesian Double Descent

Polson, Sokolov
Double descent is a phenomenon of over-parameterized statistical models such as deep neural networks which have a re-descending property in their risk function. As the complexity of the model increases, risk exhibits a U-shaped region due to the traditional bias-variance trade-off, then as the number of parameters equals the number of observations and the model becomes one of interpolation where the risk can be unbounded and finally, in the over-parameterized region, it re-descends -- the double descent effect. Our goal is to show that this has a natural Bayesian interpretation. We also show that this is not in conflict with the traditional Occam's razor -- simpler models are preferred to complex ones, all else being equal. Our theoretical foundations use Bayesian model selection, the Dickey-Savage density ratio, and connect generalized ridge regression and global-local shrinkage methods with double descent. We illustrate our approach for high dimensional neural networks and provide detailed treatments of infinite Gaussian means models and non-parametric regression. Finally, we conclude with directions for future research.
academic

Bayesian Double Descent

Basic Information

  • Paper ID: 2507.07338
  • Title: Bayesian Double Descent
  • Authors: Nick Polson (University of Chicago Booth School), Vadim Sokolov (George Mason University)
  • Classification: stat.ML cs.LG stat.CO
  • Publication Date: First Draft: December 25, 2024; This Draft: October 16, 2025
  • Paper Link: https://arxiv.org/abs/2507.07338

Abstract

Double descent is a re-descent phenomenon exhibited in the risk function of overparameterized statistical models (such as deep neural networks). As model complexity increases, the risk function exhibits a U-shaped region due to the traditional bias-variance tradeoff. When the number of parameters equals the number of observations, the model becomes an interpolating model with potentially unbounded risk. Subsequently, the risk decreases again in the overparameterized region—this is the double descent effect. This paper aims to demonstrate that this phenomenon has a natural Bayesian interpretation and prove that it does not conflict with the traditional Occam's razor principle. The theoretical foundation employs Bayesian model selection, Dickey-Savage density ratios, and connects generalized ridge regression and global-local shrinkage methods to double descent.

Research Background and Motivation

Core Problems

  1. Missing Bayesian Interpretation of Double Descent: The double descent phenomenon has been primarily studied from a frequentist perspective, lacking a systematic Bayesian theoretical framework
  2. Apparent Conflict Between Occam's Razor and Double Descent: Bayesian methods favor simpler models, while double descent suggests that complex models may perform better
  3. Insufficient Theoretical Understanding of Overparameterized Models: When the number of parameters exceeds the sample size, traditional statistical theory breaks down

Research Significance

  1. Theoretical Unification: Provides a unified Bayesian theoretical framework for the double descent phenomenon
  2. Practical Guidance: Offers theoretical support for modern machine learning methods such as deep learning
  3. Methodological Contribution: Bridges classical statistical theory and modern machine learning practice

Limitations of Existing Methods

  1. Limitations of Frequentist Perspective: Existing research primarily focuses on minimum L2-norm estimators, overlooking the role of prior regularization
  2. Failure of BIC Approximation: When p > n, Laplace approximation (BIC) performs poorly
  3. Invalidity of Empirical Risk Bounds: For interpolators, empirical risk is zero, rendering traditional bounds meaningless

Core Contributions

  1. Establishes Bayesian Theoretical Framework for Double Descent: Proves that the conditional prior p(θ_M|M) is the key driver of the double descent phenomenon
  2. Resolves Occam's Razor Paradox: Demonstrates that Bayesian Occam's razor does not conflict with the double descent phenomenon
  3. Connects Classical Methods with Modern Techniques: Links generalized ridge regression and global-local shrinkage methods to double descent
  4. Provides Computational Equivalence Theorem: Achieves computational equivalence of nested models through Dickey-Savage density ratios
  5. Extends to Neural Networks: Applies the theoretical framework to high-dimensional neural network regression

Methodology Details

Task Definition

Studies the behavior of risk functions in overparameterized regression models, particularly the double descent phenomenon of Bayesian risk R(M) as model complexity M varies:

Bayesian Double Descent Definition: Let R(M) = E_{y,θ|M}(θ̂_M(y) - θ)² be the conditional prior Bayesian risk of the estimator under model M. When M > n, R(M) exhibits re-descent behavior.

Theoretical Framework

1. Bayesian Model Complexity Framework

Joint Posterior Decomposition:

P(θ_M, M | D) = P(θ_M | M, D)P(M | D)

Evidence (Marginal Likelihood):

p(D|M) = ∫_{Θ_M} p(D | θ_M, M)p(θ_M|M)dθ_M

Key Insight: The conditional prior p(θ_M|M) influences Bayesian risk through the marginalization process, acting as implicit regularization in the overparameterized region.

2. Model Nesting and Computational Equivalence Theorem

Theorem 3.1 (Model Nesting and Computational Equivalence): Under consistency conditions:

  • p(θ_m|m) = p(θ_m|θ_{m+1:M} = 0, y)
  • p(y|θ_m, m) = p(y|θ_m, θ_{m+1:M} = 0)

The function estimation of submodel m can be computed from the overparameterized complete model M:

f̂_m(x) = E[f̂_m(x)|θ_{m+1:M} = 0, M, y]

Dickey-Savage Density Ratio:

p(y|m)/p(y|M) = p(θ_{m+1:M} = 0|y, M)/p(θ_{m+1:M} = 0|M)

3. Limitations of BIC Approximation

When p < n, Laplace approximation yields:

log p(D|M) ≈ log p(D|θ̂, M) - (k/2)log n

However, when p > n, this approximation fails, and the influence of the prior p(θ|M) on Bayesian risk becomes significant.

Generalized Ridge Regression Connection

Orthogonal Decomposition Representation

SVD decomposition of design matrix X: PX^T XQ = Λ², yielding:

γ*_i = (λ²_i)/(λ²_i + k_i) γ̂_i

where k_i is the local shrinkage parameter, corresponding to the local scale of the global-local shrinkage model.

Optimal Shrinkage Parameters

By optimizing the marginal likelihood z_i|k_i, σ²:

k̂_i = (λ²_i σ²)/(z²_i - σ²) for z²_i > σ²

Neural Network Extension

Hierarchical Bayesian Specification:

y_i = Σ_{j=1}^M θ_j φ_j(x_i; w) + ε_i
θ_j ~ N(0, σ²_j)
w ~ p(w)
σ²_j ~ p(σ²_j)

This allows adaptive learning of basis functions while maintaining the Bayesian model selection framework.

Experimental Setup

Polynomial Regression Experiment

Data Generation:

  • True function: y_i = sin(5x_i) + ε, ε ~ N(0, 0.3²)
  • Sample size: n = 20
  • Model complexity: d = 1, 2, ..., 50

Basis Function Selection: Uses Legendre polynomial basis functions, providing numerically stable orthogonal bases.

Estimation Method: Uses Moore-Penrose pseudoinverse, providing minimum-norm solutions in the overparameterized regime.

Bayesian Polynomial Regression

Young Method:

  • Prior: C = diag(δ², τ²/λ²₁, ..., τ²/λ²_q)
  • Posterior: θ | D, σ², C ~ N(θ̂_post, Σ_post)

Deaton Method:

  • Ordering constraint: σ²₀ ≥ σ²₁ ≥ ... ≥ σ²_p
  • Pool Adjacent Violators Algorithm (PAVA) adjustment of unconstrained MAP estimates

Experimental Results

Double Descent Phenomenon Verification

Three Stages:

  1. Classical Region (d < 5): Increasing complexity reduces bias and test error
  2. Interpolation Crisis (d ≈ n = 20): Test error reaches peak, model perfectly fits training data but generalizes poorly
  3. Overparameterized Region (d > 30): Test error decreases again, extreme overparameterization improves generalization

Key Findings

  1. Implicit Regularization Effect: Minimum-norm solutions exhibit implicit bias toward simple functions in overparameterized settings
  2. Bayesian Advantage: Through appropriate prior specification, Bayesian methods perform well across all regions
  3. Computational Efficiency: Can directly use the largest possible model, avoiding time-consuming model selection

Marginal Likelihood Behavior

For models with true polynomial degree p_true = 10, marginal likelihood peaks at the corresponding complexity, validating the effectiveness of Bayesian Occam's razor.

Frequentist Research

  1. Belkin et al. (2019): First observation of double descent in linear regression
  2. Bach (2024): Extension to stochastic regression models
  3. Hastie et al. (2022): Study of interpolator properties

Bayesian Methods

  1. MacKay (1992): Bayesian interpolation and hyperparameter regularization
  2. Polson & Scott (2012): Global-local shrinkage framework
  3. Young (1977), Deaton (1980): Bayesian methods for polynomial regression

Bias-Variance Tradeoff

  1. Geman et al. (1992): Bias-variance tradeoff in neural networks
  2. Efron & Morris (1973): Advantages of shrinkage estimators

Conclusions and Discussion

Main Conclusions

  1. Theoretical Unification: The double descent phenomenon has a natural Bayesian interpretation, driven by the conditional prior p(θ_M|M)
  2. Occam's Razor Compatibility: Marginal likelihood still favors simpler models, but conditional priors can provide good risk properties in the overparameterized region
  3. Practical Guidance: Recommends using the largest possible model, relying on automatic regularization of the Bayesian framework

Limitations

  1. Prior Specification Challenge: Requires specifying joint parameter priors over complex spaces
  2. Computational Complexity: Marginal likelihood computation for neural network basis functions is difficult
  3. Theoretical Gap: Complete theoretical analysis in high dimensions remains to be developed

Future Directions

  1. Adaptive Priors: Develop prior specifications that automatically adjust to data structure
  2. Deep Learning Extension: Extend the framework to deep learning where parameter numbers far exceed sample size
  3. Computational Methods: Develop efficient approximate inference techniques for high-dimensional settings

In-Depth Evaluation

Strengths

  1. Theoretical Innovation: First to provide a systematic Bayesian theoretical framework for the double descent phenomenon
  2. Problem Resolution: Elegantly resolves the apparent conflict between Occam's razor and double descent
  3. Method Connection: Successfully bridges classical statistical methods and modern machine learning
  4. Sufficient Experiments: Clearly demonstrates theoretical predictions through polynomial regression

Weaknesses

  1. Application Limitations: Primarily limited to relatively simple regression settings; deep learning applications require further development
  2. Computational Challenges: Practical computation in high dimensions remains difficult
  3. Prior Sensitivity: Method success highly depends on appropriate prior selection

Impact

  1. Theoretical Contribution: Provides important Bayesian perspective for understanding modern machine learning phenomena
  2. Practical Value: Offers theoretical support for using overparameterized models
  3. Research Inspiration: Opens new directions for applying Bayesian methods in modern machine learning

Applicable Scenarios

  1. Regression Problems: Particularly high-dimensional regression and function approximation
  2. Model Selection: Scenarios requiring selection among multiple complexity levels
  3. Uncertainty Quantification: Applications requiring both prediction and uncertainty estimation

References

This paper cites numerous important works, including:

  • Belkin et al. (2019): Pioneering work on double descent phenomenon
  • MacKay (1992): Classical literature on Bayesian interpolation
  • Polson & Scott (2012): Global-local shrinkage methods
  • Young (1977), Deaton (1980): Early work on Bayesian polynomial regression

This paper is theoretically significant, providing a new Bayesian perspective for understanding the double descent phenomenon in modern machine learning. While challenges remain in practical applications, it establishes a solid theoretical foundation for future research.