Double descent is a phenomenon of over-parameterized statistical models such as deep neural networks which have a re-descending property in their risk function. As the complexity of the model increases, risk exhibits a U-shaped region due to the traditional bias-variance trade-off, then as the number of parameters equals the number of observations and the model becomes one of interpolation where the risk can be unbounded and finally, in the over-parameterized region, it re-descends -- the double descent effect. Our goal is to show that this has a natural Bayesian interpretation. We also show that this is not in conflict with the traditional Occam's razor -- simpler models are preferred to complex ones, all else being equal. Our theoretical foundations use Bayesian model selection, the Dickey-Savage density ratio, and connect generalized ridge regression and global-local shrinkage methods with double descent. We illustrate our approach for high dimensional neural networks and provide detailed treatments of infinite Gaussian means models and non-parametric regression. Finally, we conclude with directions for future research.
Double descent is a re-descent phenomenon exhibited in the risk function of overparameterized statistical models (such as deep neural networks). As model complexity increases, the risk function exhibits a U-shaped region due to the traditional bias-variance tradeoff. When the number of parameters equals the number of observations, the model becomes an interpolating model with potentially unbounded risk. Subsequently, the risk decreases again in the overparameterized region—this is the double descent effect. This paper aims to demonstrate that this phenomenon has a natural Bayesian interpretation and prove that it does not conflict with the traditional Occam's razor principle. The theoretical foundation employs Bayesian model selection, Dickey-Savage density ratios, and connects generalized ridge regression and global-local shrinkage methods to double descent.
Missing Bayesian Interpretation of Double Descent: The double descent phenomenon has been primarily studied from a frequentist perspective, lacking a systematic Bayesian theoretical framework
Apparent Conflict Between Occam's Razor and Double Descent: Bayesian methods favor simpler models, while double descent suggests that complex models may perform better
Insufficient Theoretical Understanding of Overparameterized Models: When the number of parameters exceeds the sample size, traditional statistical theory breaks down
Limitations of Frequentist Perspective: Existing research primarily focuses on minimum L2-norm estimators, overlooking the role of prior regularization
Failure of BIC Approximation: When p > n, Laplace approximation (BIC) performs poorly
Invalidity of Empirical Risk Bounds: For interpolators, empirical risk is zero, rendering traditional bounds meaningless
Establishes Bayesian Theoretical Framework for Double Descent: Proves that the conditional prior p(θ_M|M) is the key driver of the double descent phenomenon
Resolves Occam's Razor Paradox: Demonstrates that Bayesian Occam's razor does not conflict with the double descent phenomenon
Connects Classical Methods with Modern Techniques: Links generalized ridge regression and global-local shrinkage methods to double descent
Provides Computational Equivalence Theorem: Achieves computational equivalence of nested models through Dickey-Savage density ratios
Extends to Neural Networks: Applies the theoretical framework to high-dimensional neural network regression
Studies the behavior of risk functions in overparameterized regression models, particularly the double descent phenomenon of Bayesian risk R(M) as model complexity M varies:
Bayesian Double Descent Definition: Let R(M) = E_{y,θ|M}(θ̂_M(y) - θ)² be the conditional prior Bayesian risk of the estimator under model M. When M > n, R(M) exhibits re-descent behavior.
Key Insight: The conditional prior p(θ_M|M) influences Bayesian risk through the marginalization process, acting as implicit regularization in the overparameterized region.
For models with true polynomial degree p_true = 10, marginal likelihood peaks at the corresponding complexity, validating the effectiveness of Bayesian Occam's razor.
Theoretical Unification: The double descent phenomenon has a natural Bayesian interpretation, driven by the conditional prior p(θ_M|M)
Occam's Razor Compatibility: Marginal likelihood still favors simpler models, but conditional priors can provide good risk properties in the overparameterized region
Practical Guidance: Recommends using the largest possible model, relying on automatic regularization of the Bayesian framework
This paper cites numerous important works, including:
Belkin et al. (2019): Pioneering work on double descent phenomenon
MacKay (1992): Classical literature on Bayesian interpolation
Polson & Scott (2012): Global-local shrinkage methods
Young (1977), Deaton (1980): Early work on Bayesian polynomial regression
This paper is theoretically significant, providing a new Bayesian perspective for understanding the double descent phenomenon in modern machine learning. While challenges remain in practical applications, it establishes a solid theoretical foundation for future research.