Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models
Atanasov, Bordelon, Zavatone-Veth et al.
We derive a novel deterministic equivalence for the two-point function of a random matrix resolvent. Using this result, we give a unified derivation of the performance of a wide variety of high-dimensional linear models trained with stochastic gradient descent. This includes high-dimensional linear regression, kernel regression, and linear random feature models. Our results include previously known asymptotics as well as novel ones.
academic
Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models
Title: Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models
Authors: Alexander Atanasov, Blake Bordelon, Jacob A. Zavatone-Veth, Courtney Paquette, Cengiz Pehlevan (from Harvard University, McGill University, and other institutions)
This paper proposes a novel deterministic equivalence theory for two-point functions of random matrix analytic resolvents. Based on this result, the authors provide a unified derivation of performance characteristics for various high-dimensional linear models under stochastic gradient descent (SGD) training, including high-dimensional linear regression, kernel regression, and linear random feature models. The results encompass known asymptotic behaviors as well as new theoretical findings.
A core phenomenon in modern deep learning is that model performance exhibits predictable power-law behavior (neural scaling laws) as data scale, model size, and computational resources increase. Understanding the theoretical foundations of this scaling behavior is an important challenge in machine learning theory.
Need for a Unified Theoretical Framework: Existing work has separately studied the effects of finite width, finite data, and SGD noise using different methods (such as dynamic mean-field theory DMFT and deterministic equivalence techniques), lacking a unified framework.
Understanding Dynamics: Most theoretical analyses focus on static (infinite time) limits, with insufficient understanding of the training dynamics process.
Non-Commutativity Challenge: When the data covariance matrix Σ, empirical covariance Σ̂, and random feature matrix FF^T do not commute, traditional single-point deterministic equivalence methods fail.
Single-Point Deterministic Equivalence: Can only handle cases where matrices commute (e.g., infinite data P→∞ or linear regression without random features).
DMFT Method: While capable of handling general cases, it has high technical complexity and lacks direct connection to random matrix theory.
Scattered Results: Different works use different techniques to obtain partial results, lacking a unified mathematical framework.
This paper aims to develop a two-point deterministic equivalence theory to provide a unified mathematical framework for analyzing the complete dynamic behavior of SGD in high-dimensional linear models, including the joint effects of finite data, finite model size, and SGD noise.
Novel Two-Point Deterministic Equivalence Theory: First systematic derivation of deterministic equivalence formulas for two-point functions of random matrix analytic resolvents at different parameters (λ, λ').
Unified Dynamic Analysis Framework: Decomposes SGD dynamics into a forcing term (gradient flow) and an SGD kernel term, with analysis in the frequency domain via Fourier transform.
Recovery and Extension of Existing Results:
Recovers results obtained by Bordelon et al. 16 through DMFT
Recovers results by Paquette et al. 17 using single-point deterministic equivalence
Extends to new scenarios such as covariate shift
Connection to Free Probability Theory: Reveals a new interpretation of the S-transform as a response function in dynamic systems, establishing a bridge between deterministic equivalence and DMFT.
Planar Graph Expansion Technique: Systematically derives two-point equivalence formulas using planar graph expansions and free cumulants.
For random matrices (λ+AB)−1M(λ′+BA)−1, where A, M are deterministic matrices and B is a white Wishart matrix free from A, there exists a deterministic equivalence:
New Interpretation of S-Transform: Reveals the S-transform as a physical response function in dynamic systems, connecting free probability theory with dynamical systems theory.
Hierarchical Renormalization: In random feature models, frequencies undergo successive renormalizations ω→ω1→ω2, each corresponding to a random source.
Soft Limit Recovery of Statics: Elegantly recovers static results via limt→∞F(t)=limω,ω′→0(iω)(iω′)F(ω,ω′).
Note: This is a purely theoretical work, with verification primarily through mathematical derivation. Experimental validation mainly references numerical experiments from related works 16, 17.
Unified Framework: Two-point deterministic equivalence provides a unified mathematical framework for analyzing finite data, finite model size, and SGD noise jointly.
Theoretical Completeness: Recovers all known results (static ridge regression, DMFT dynamics, single-point deterministic equivalence) and extends to new scenarios (dynamics of covariate shift).
Methodological Contribution: The combination of planar graph expansion and free probability theory provides new computational tools for random matrix theory.
Physical Insights: Reveals the deep meaning of S-transform as a response function and establishes a bridge between deterministic equivalence and DMFT.
16 B. Bordelon, A. Atanasov, and C. Pehlevan, "A dynamical model of neural scaling laws," ICML 2024.
17 E. Paquette, C. Paquette, L. Xiao, and J. Pennington, "4 + 3 phases of compute-optimal neural scaling laws," arXiv:2405.15074, 2024.
20 A. Atanasov, J. A. Zavatone-Veth, and C. Pehlevan, "Scaling and renormalization in high-dimensional regression," arXiv:2405.00592, 2024.
24 M. Potters and J.-P. Bouchaud, "A first course in random matrix theory," Cambridge University Press, 2020.
26 A. Atanasov, J. A. Zavatone-Veth, and C. Pehlevan, "Risk and cross validation in ridge regression with correlated samples," arXiv:2408.04607, 2024.
Overall Assessment: This is an excellent paper with exceptional theoretical depth, providing a unified and elegant mathematical framework for SGD dynamics in high-dimensional linear models. The derivation of two-point deterministic equivalence is an important theoretical contribution, and the planar graph method demonstrates strong technical prowess. While direct applications are limited and readability presents challenges, the work has significant long-term value for machine learning theory development. Subsequent work should supplement numerical verification, provide practical algorithms, and explore extensions to non-linear models.