2025-11-23T19:58:17.144226

Bayesian Double Descent

Polson, Sokolov

Double descent is a phenomenon of over-parameterized statistical models such as deep neural networks which have a re-descending property in their risk function. As the complexity of the model increases, risk exhibits a U-shaped region due to the traditional bias-variance trade-off, then as the number of parameters equals the number of observations and the model becomes one of interpolation where the risk can be unbounded and finally, in the over-parameterized region, it re-descends -- the double descent effect. Our goal is to show that this has a natural Bayesian interpretation. We also show that this is not in conflict with the traditional Occam's razor -- simpler models are preferred to complex ones, all else being equal. Our theoretical foundations use Bayesian model selection, the Dickey-Savage density ratio, and connect generalized ridge regression and global-local shrinkage methods with double descent. We illustrate our approach for high dimensional neural networks and provide detailed treatments of infinite Gaussian means models and non-parametric regression. Finally, we conclude with directions for future research.

academic

Bayesian Double Descent

Basic Information

Paper ID: 2507.07338
Title: Bayesian Double Descent
Authors: Nick Polson (University of Chicago Booth School), Vadim Sokolov (George Mason University)
Classification: stat.ML cs.LG stat.CO
Publication Date: First Draft: December 25, 2024; This Draft: October 16, 2025
Paper Link: https://arxiv.org/abs/2507.07338

Abstract

Double descent is a re-descent phenomenon exhibited in the risk function of overparameterized statistical models (such as deep neural networks). As model complexity increases, the risk function exhibits a U-shaped region due to the traditional bias-variance tradeoff. When the number of parameters equals the number of observations, the model becomes an interpolating model with potentially unbounded risk. Subsequently, the risk decreases again in the overparameterized region—this is the double descent effect. This paper aims to demonstrate that this phenomenon has a natural Bayesian interpretation and prove that it does not conflict with the traditional Occam's razor principle. The theoretical foundation employs Bayesian model selection, Dickey-Savage density ratios, and connects generalized ridge regression and global-local shrinkage methods to double descent.

Research Background and Motivation

Core Problems

Missing Bayesian Interpretation of Double Descent: The double descent phenomenon has been primarily studied from a frequentist perspective, lacking a systematic Bayesian theoretical framework
Apparent Conflict Between Occam's Razor and Double Descent: Bayesian methods favor simpler models, while double descent suggests that complex models may perform better
Insufficient Theoretical Understanding of Overparameterized Models: When the number of parameters exceeds the sample size, traditional statistical theory breaks down

Research Significance

Theoretical Unification: Provides a unified Bayesian theoretical framework for the double descent phenomenon
Practical Guidance: Offers theoretical support for modern machine learning methods such as deep learning
Methodological Contribution: Bridges classical statistical theory and modern machine learning practice

Limitations of Existing Methods

Limitations of Frequentist Perspective: Existing research primarily focuses on minimum L2-norm estimators, overlooking the role of prior regularization
Failure of BIC Approximation: When p > n, Laplace approximation (BIC) performs poorly
Invalidity of Empirical Risk Bounds: For interpolators, empirical risk is zero, rendering traditional bounds meaningless

Core Contributions

Establishes Bayesian Theoretical Framework for Double Descent: Proves that the conditional prior p(θ_M|M) is the key driver of the double descent phenomenon
Resolves Occam's Razor Paradox: Demonstrates that Bayesian Occam's razor does not conflict with the double descent phenomenon
Connects Classical Methods with Modern Techniques: Links generalized ridge regression and global-local shrinkage methods to double descent
Provides Computational Equivalence Theorem: Achieves computational equivalence of nested models through Dickey-Savage density ratios
Extends to Neural Networks: Applies the theoretical framework to high-dimensional neural network regression

Methodology Details

Task Definition

Studies the behavior of risk functions in overparameterized regression models, particularly the double descent phenomenon of Bayesian risk R(M) as model complexity M varies:

Bayesian Double Descent Definition: Let R(M) = E_{y,θ|M}(θ̂_M(y) - θ)² be the conditional prior Bayesian risk of the estimator under model M. When M > n, R(M) exhibits re-descent behavior.

Theoretical Framework

1. Bayesian Model Complexity Framework

Joint Posterior Decomposition:

P(θ_M, M | D) = P(θ_M | M, D)P(M | D)

Evidence (Marginal Likelihood):

p(D|M) = ∫_{Θ_M} p(D | θ_M, M)p(θ_M|M)dθ_M

Key Insight: The conditional prior p(θ_M|M) influences Bayesian risk through the marginalization process, acting as implicit regularization in the overparameterized region.

2. Model Nesting and Computational Equivalence Theorem

Theorem 3.1 (Model Nesting and Computational Equivalence): Under consistency conditions:

p(θ_m|m) = p(θ_m|θ_{m+1:M} = 0, y)
p(y|θ_m, m) = p(y|θ_m, θ_{m+1:M} = 0)

The function estimation of submodel m can be computed from the overparameterized complete model M:

f̂_m(x) = E[f̂_m(x)|θ_{m+1:M} = 0, M, y]

Dickey-Savage Density Ratio:

p(y|m)/p(y|M) = p(θ_{m+1:M} = 0|y, M)/p(θ_{m+1:M} = 0|M)

3. Limitations of BIC Approximation

When p < n, Laplace approximation yields:

log p(D|M) ≈ log p(D|θ̂, M) - (k/2)log n

However, when p > n, this approximation fails, and the influence of the prior p(θ|M) on Bayesian risk becomes significant.

Generalized Ridge Regression Connection

Orthogonal Decomposition Representation

SVD decomposition of design matrix X: PX^T XQ = Λ², yielding:

γ*_i = (λ²_i)/(λ²_i + k_i) γ̂_i

where k_i is the local shrinkage parameter, corresponding to the local scale of the global-local shrinkage model.

Optimal Shrinkage Parameters

By optimizing the marginal likelihood z_i|k_i, σ²:

k̂_i = (λ²_i σ²)/(z²_i - σ²) for z²_i > σ²

Neural Network Extension

Hierarchical Bayesian Specification:

y_i = Σ_{j=1}^M θ_j φ_j(x_i; w) + ε_i
θ_j ~ N(0, σ²_j)
w ~ p(w)
σ²_j ~ p(σ²_j)

This allows adaptive learning of basis functions while maintaining the Bayesian model selection framework.

Experimental Setup

Polynomial Regression Experiment

Data Generation:

True function: y_i = sin(5x_i) + ε, ε ~ N(0, 0.3²)
Sample size: n = 20
Model complexity: d = 1, 2, ..., 50

Basis Function Selection: Uses Legendre polynomial basis functions, providing numerically stable orthogonal bases.

Estimation Method: Uses Moore-Penrose pseudoinverse, providing minimum-norm solutions in the overparameterized regime.

Bayesian Polynomial Regression

Young Method:

Prior: C = diag(δ², τ²/λ²₁, ..., τ²/λ²_q)
Posterior: θ | D, σ², C ~ N(θ̂_post, Σ_post)

Deaton Method:

Ordering constraint: σ²₀ ≥ σ²₁ ≥ ... ≥ σ²_p
Pool Adjacent Violators Algorithm (PAVA) adjustment of unconstrained MAP estimates

Experimental Results

Double Descent Phenomenon Verification

Three Stages:

Classical Region (d < 5): Increasing complexity reduces bias and test error
Interpolation Crisis (d ≈ n = 20): Test error reaches peak, model perfectly fits training data but generalizes poorly
Overparameterized Region (d > 30): Test error decreases again, extreme overparameterization improves generalization

Key Findings

Implicit Regularization Effect: Minimum-norm solutions exhibit implicit bias toward simple functions in overparameterized settings
Bayesian Advantage: Through appropriate prior specification, Bayesian methods perform well across all regions
Computational Efficiency: Can directly use the largest possible model, avoiding time-consuming model selection

Marginal Likelihood Behavior

For models with true polynomial degree p_true = 10, marginal likelihood peaks at the corresponding complexity, validating the effectiveness of Bayesian Occam's razor.

Frequentist Research

Belkin et al. (2019): First observation of double descent in linear regression
Bach (2024): Extension to stochastic regression models
Hastie et al. (2022): Study of interpolator properties

Bayesian Methods

MacKay (1992): Bayesian interpolation and hyperparameter regularization
Polson & Scott (2012): Global-local shrinkage framework
Young (1977), Deaton (1980): Bayesian methods for polynomial regression

Bias-Variance Tradeoff

Geman et al. (1992): Bias-variance tradeoff in neural networks
Efron & Morris (1973): Advantages of shrinkage estimators

Conclusions and Discussion

Main Conclusions

Theoretical Unification: The double descent phenomenon has a natural Bayesian interpretation, driven by the conditional prior p(θ_M|M)
Occam's Razor Compatibility: Marginal likelihood still favors simpler models, but conditional priors can provide good risk properties in the overparameterized region
Practical Guidance: Recommends using the largest possible model, relying on automatic regularization of the Bayesian framework

Limitations

Prior Specification Challenge: Requires specifying joint parameter priors over complex spaces
Computational Complexity: Marginal likelihood computation for neural network basis functions is difficult
Theoretical Gap: Complete theoretical analysis in high dimensions remains to be developed

Future Directions

Adaptive Priors: Develop prior specifications that automatically adjust to data structure
Deep Learning Extension: Extend the framework to deep learning where parameter numbers far exceed sample size
Computational Methods: Develop efficient approximate inference techniques for high-dimensional settings

In-Depth Evaluation

Strengths

Theoretical Innovation: First to provide a systematic Bayesian theoretical framework for the double descent phenomenon
Problem Resolution: Elegantly resolves the apparent conflict between Occam's razor and double descent
Method Connection: Successfully bridges classical statistical methods and modern machine learning
Sufficient Experiments: Clearly demonstrates theoretical predictions through polynomial regression

Weaknesses

Application Limitations: Primarily limited to relatively simple regression settings; deep learning applications require further development
Computational Challenges: Practical computation in high dimensions remains difficult
Prior Sensitivity: Method success highly depends on appropriate prior selection

Impact

Theoretical Contribution: Provides important Bayesian perspective for understanding modern machine learning phenomena
Practical Value: Offers theoretical support for using overparameterized models
Research Inspiration: Opens new directions for applying Bayesian methods in modern machine learning

Applicable Scenarios

Regression Problems: Particularly high-dimensional regression and function approximation
Model Selection: Scenarios requiring selection among multiple complexity levels
Uncertainty Quantification: Applications requiring both prediction and uncertainty estimation

References

This paper cites numerous important works, including:

Belkin et al. (2019): Pioneering work on double descent phenomenon
MacKay (1992): Classical literature on Bayesian interpolation
Polson & Scott (2012): Global-local shrinkage methods
Young (1977), Deaton (1980): Early work on Bayesian polynomial regression

This paper is theoretically significant, providing a new Bayesian perspective for understanding the double descent phenomenon in modern machine learning. While challenges remain in practical applications, it establishes a solid theoretical foundation for future research.