2025-11-29T05:16:19.247534

Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models

Atanasov, Bordelon, Zavatone-Veth et al.

We derive a novel deterministic equivalence for the two-point function of a random matrix resolvent. Using this result, we give a unified derivation of the performance of a wide variety of high-dimensional linear models trained with stochastic gradient descent. This includes high-dimensional linear regression, kernel regression, and linear random feature models. Our results include previously known asymptotics as well as novel ones.

academic

Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models

Basic Information

Paper ID: 2502.05074
Title: Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models
Authors: Alexander Atanasov, Blake Bordelon, Jacob A. Zavatone-Veth, Courtney Paquette, Cengiz Pehlevan (from Harvard University, McGill University, and other institutions)
Classification: cond-mat.dis-nn, cs.LG, stat.ML
Publication Date: arXiv v3, November 10, 2025
Paper Link: https://arxiv.org/abs/2502.05074v3

Abstract

This paper proposes a novel deterministic equivalence theory for two-point functions of random matrix analytic resolvents. Based on this result, the authors provide a unified derivation of performance characteristics for various high-dimensional linear models under stochastic gradient descent (SGD) training, including high-dimensional linear regression, kernel regression, and linear random feature models. The results encompass known asymptotic behaviors as well as new theoretical findings.

Research Background and Motivation

Problems to Address

A core phenomenon in modern deep learning is that model performance exhibits predictable power-law behavior (neural scaling laws) as data scale, model size, and computational resources increase. Understanding the theoretical foundations of this scaling behavior is an important challenge in machine learning theory.

Importance of the Problem

Need for a Unified Theoretical Framework: Existing work has separately studied the effects of finite width, finite data, and SGD noise using different methods (such as dynamic mean-field theory DMFT and deterministic equivalence techniques), lacking a unified framework.
Understanding Dynamics: Most theoretical analyses focus on static (infinite time) limits, with insufficient understanding of the training dynamics process.
Non-Commutativity Challenge: When the data covariance matrix Σ, empirical covariance Σ̂, and random feature matrix FF^T do not commute, traditional single-point deterministic equivalence methods fail.

Limitations of Existing Methods

Single-Point Deterministic Equivalence: Can only handle cases where matrices commute (e.g., infinite data P→∞ or linear regression without random features).
DMFT Method: While capable of handling general cases, it has high technical complexity and lacks direct connection to random matrix theory.
Scattered Results: Different works use different techniques to obtain partial results, lacking a unified mathematical framework.

Research Motivation

This paper aims to develop a two-point deterministic equivalence theory to provide a unified mathematical framework for analyzing the complete dynamic behavior of SGD in high-dimensional linear models, including the joint effects of finite data, finite model size, and SGD noise.

Core Contributions

Novel Two-Point Deterministic Equivalence Theory: First systematic derivation of deterministic equivalence formulas for two-point functions of random matrix analytic resolvents at different parameters (λ, λ').
Unified Dynamic Analysis Framework: Decomposes SGD dynamics into a forcing term (gradient flow) and an SGD kernel term, with analysis in the frequency domain via Fourier transform.
Recovery and Extension of Existing Results:
- Recovers results obtained by Bordelon et al. 16 through DMFT
- Recovers results by Paquette et al. 17 using single-point deterministic equivalence
- Extends to new scenarios such as covariate shift
Connection to Free Probability Theory: Reveals a new interpretation of the S-transform as a response function in dynamic systems, establishing a bridge between deterministic equivalence and DMFT.
Planar Graph Expansion Technique: Systematically derives two-point equivalence formulas using planar graph expansions and free cumulants.

Detailed Methodology

Task Definition

Consider two classes of models:

1. Linear Regression: $f(x) = x^\top w$

2. Linear Random Feature Model: $f(x) = x^\top Fv = w^\top x, \quad w = Fv$

Where:

Input $x \in \mathbb{R}^D \sim \mathcal{N}(0, \Sigma)$
Random feature matrix $F \in \mathbb{R}^{D \times N}$ with i.i.d. elements $\sim \mathcal{N}(0, 1/N)$
Labels generated by a teacher model: $y_\mu = \bar{w}^\top x_\mu + \epsilon_\mu$ , where $\epsilon_\mu \sim \mathcal{N}(0, \sigma_\epsilon^2)$

Training Objective: Minimize empirical risk $\hat{R} = \frac{1}{P}\sum_{\mu=1}^P (y_\mu - f(x_\mu))^2$

via SGD updates (batch size B, learning rate η): $v_{t+1} = v_t - \eta \nabla_v \hat{R}_{B_t}$

Performance Metrics:

Training loss: $\hat{R}_t = \Delta w_t^\top \hat{\Sigma} \Delta w_t$
Test loss: $R_t = \Delta w_t^\top \Sigma \Delta w_t$
Where $\Delta w_t = \bar{w} - w_t$

Core Theoretical Framework

1. Simplified Model of SGD Dynamics

By tracking second-order moments of weight differences $C_t = \mathbb{E}_{B_t}[\Delta w_t \Delta w_t^\top]$ , a Volterra integral equation is obtained in the continuous time limit:

$C_t \simeq e^{-\eta t FF^\top \hat{\Sigma}} \bar{w}\bar{w}^\top e^{-\eta t \hat{\Sigma}FF^\top} + \chi \int_0^t e^{-2(t-s)FF^\top\hat{\Sigma}} FF^\top \hat{\Sigma} FF^\top \text{Tr}[C_s\hat{\Sigma}]ds$

where $\chi = \eta/B$ is the SGD temperature parameter.

2. Forcing Term and Kernel Term Decomposition

Test loss can be decomposed as:

$R_t = \underbrace{\bar{w}^\top e^{-t\hat{\Sigma}FF^\top} \Sigma e^{-tFF^\top\hat{\Sigma}} \bar{w}}_{F(t) \text{ - forcing term}} + \underbrace{\chi \int_0^t \text{tr}[e^{-2(t-s)FF^\top\hat{\Sigma}}FF^\top\hat{\Sigma}FF^\top\Sigma]}_{K(t-s) \text{ - SGD kernel term}} \hat{R}_s ds$

Key Insight: In Fourier space, all randomness enters through products of analytic resolvents:

$F(\omega, \omega') = \bar{w}^\top (\hat{\Sigma}FF^\top + i\omega)^{-1} \Sigma (FF^\top\hat{\Sigma} + i\omega')^{-1} \bar{w}$

When matrices do not commute, evaluation of two-point functions at different frequencies $(\omega, \omega')$ is required.

Derivation of Two-Point Deterministic Equivalence

Core Theorem

For random matrices $(λ+AB)^{-1}M(λ'+BA)^{-1}$ , where A, M are deterministic matrices and B is a white Wishart matrix free from A, there exists a deterministic equivalence:

$(λ+AB)^{-1}M(λ'+BA)^{-1} \simeq S_B S'_B \left[ G_A M G'_A + G_A A G'_A \frac{q \text{tr}[AG_A M G'_A]}{1 - q \text{df}_2(\kappa, \kappa')} \right]$

where:

$S_B = S_B(\text{df}_1^{AB}(λ))$ is the S-transform of B
$G_A = (\kappa + A)^{-1}$ , $\kappa = λS_B$ is the signal capture threshold
$\text{df}_2(\kappa, \kappa') = \text{tr}[A^2 G_A G'_A]$ is the second-order degree of freedom
$q = N/P$ is the Wishart parameter

Derivation Strategy (Planar Graph Expansion)

Orthogonal Averaging: Write $B = OB'O^\top$ (B' diagonal) and average over the orthogonal group O.
Irreducible Graph Expansion: Expand the analytic resolvent as chains of irreducible graphs connected through A/λ:

Diagram (simplified):
[1/S_B] --A/λ--> [1/S_B] --A/λ--> ...

Connected Graph Summation: Each irreducible graph is a sum of fully connected graphs involving free cumulants $\kappa_B^{(n)}$ :

$\frac{1}{S_B} = \sum_{n=1}^\infty \kappa_B^{(n)} \text{tr}[G_A BA]^{n-1}$

Handling M Insertion: Terms containing M produce self-consistent equations:

$X_M = S_B S'_B R_B[g, g'] \left( \text{tr}[G_A M G'_A] + X_M \text{tr}[G_A A^2 G'_A] \right)$

where the mixed R-transform $R_B[g, g'] = \sum_{n=1}^\infty \sum_{a+b=n} \kappa_B^{(n)} g^{a-1} g'^{b-1}$

Wishart Case Simplification: Due to $\kappa_B^{(a+b)} = q\kappa_B^{(a)}\kappa_B^{(b)}$ , the mixed R-transform factorizes.

Application to Linear Models

Linear Regression (Without Random Features)

Forcing Term (Dual Frequency): $F(\omega, \omega') = \frac{S_W S'_W}{1-\gamma(\omega_1, \omega'_1)} \bar{w}^\top (i\omega_1 + \Sigma)^{-1} \Sigma (i\omega'_1 + \Sigma)^{-1} \bar{w}$

where:

$S_W = 1/(1 - \frac{D}{P}\text{df}_1)$ is the S-transform of the Wishart matrix
$\omega_1 = S_W \omega$ is the renormalized frequency
$\gamma = \frac{D}{P}\text{df}_2(\omega_1, \omega'_1)$

SGD Kernel Term (Single Frequency Sufficient): $K(\omega) \simeq \text{Tr}[\Sigma^2(\Sigma + i\omega_1)^{-1}]$

Linear Random Feature Model

Requires two applications of deterministic equivalence (first for data, then for features):

Forcing Term: $F(\omega, \omega') \simeq \frac{SS'}{1-\gamma_1} \left[ \bar{w}^\top (i\omega_2+\Sigma)^{-1}\Sigma(i\omega'_2+\Sigma)^{-1}\bar{w} + \text{correction term} \right]$

where $\omega_2 = S_{FF^\top} S_W \omega$ undergoes two renormalizations.

Key Technique: Uses the push-through identity $A(BA+λ)^{-1} = (AB+λ)^{-1}A$ to simplify expressions.

Technical Innovations

Dual-Frequency Analysis: First systematic treatment of joint dependence on $(\omega, \omega')$ , capturing non-commutativity effects.
Planar Graph Method: Clearly organizes complex matrix averaging calculations through graph-theoretic language.
New Interpretation of S-Transform: Reveals the S-transform as a physical response function in dynamic systems, connecting free probability theory with dynamical systems theory.
Hierarchical Renormalization: In random feature models, frequencies undergo successive renormalizations $\omega \to \omega_1 \to \omega_2$ , each corresponding to a random source.
Soft Limit Recovery of Statics: Elegantly recovers static results via $\lim_{t\to\infty} F(t) = \lim_{\omega,\omega'\to 0} (i\omega)(i\omega')F(\omega,\omega')$ .

Experimental Setup

Note: This is a purely theoretical work, with verification primarily through mathematical derivation. Experimental validation mainly references numerical experiments from related works 16, 17.

Theoretical Verification Strategy

Comparison with Known Results:
- Verify recovery of known single-point deterministic equivalence in special cases (e.g., λ=λ')
- Verify static limit recovers known results for ridge regression 20
Internal Consistency Checks:
- Verify that results obtained by differentiating single-point formulas match two-point formulas at λ=λ'
- Verify that different derivation paths (single-frequency vs. dual-frequency) yield identical results
Comparison with DMFT Results:
- Confirm that formulas in this paper exactly match DMFT results by Bordelon et al. 16
- Establish correspondence between response functions and S-transforms

Theoretical Applicability Range

Asymptotic Regime: $D, N, P \to \infty$ with fixed ratios $D/N, D/P$
Data Structure: $\text{Tr}(\Sigma) = \Theta(D^\zeta)$ , $0 \leq \zeta \leq 1$
Batch Size Scaling: $B = \Theta(D^\zeta)$ to maintain stable dynamics
Learning Rate: $\eta = \Theta(1)$ independent of dimension

Experimental Results

Main Theoretical Results

1. Consistency Verification

Recovery of Single-Point Limit (Appendix A.1): For $\hat{\Sigma}(λ+\hat{\Sigma})^{-2}$ , taking $λ=λ'$ in the two-point formula yields:

$\hat{\Sigma}(\hat{\Sigma}+λ)^{-2} \simeq \frac{d\kappa}{dλ} \Sigma(\Sigma+\kappa)^{-2}$

This is completely consistent with differentiating the single-point formula $\hat{\Sigma}(\hat{\Sigma}+λ)^{-1} \simeq S\Sigma(\Sigma+\kappa)^{-1}$ .

2. Recovery of Static Limit

In the $t \to \infty$ limit (corresponding to $\omega, \omega' \to 0$ ), the forcing term recovers known results for ridge regression:

$\lim_{t\to\infty} R_t = \kappa^2 \bar{w}^\top \Sigma (\Sigma+\kappa)^{-2} \bar{w} + \sigma_\epsilon^2$

where $\kappa$ satisfies the self-consistent equation $\kappa = \lim_{\omega\to 0} S_B(\text{df}_1^\Sigma(\kappa)) \cdot \omega$

3. Covariate Shift Results

For cases where the test distribution $\Sigma'$ differs from the training distribution $\Sigma$ , the static generalization error is:

$E_{\Sigma',\bar{w}}^{OOD} \simeq \kappa^2 \left[ \bar{w}^\top (\Sigma+\kappa)^{-1}\Sigma'(\Sigma+\kappa)^{-1}\bar{w} + \bar{w}^\top \Sigma(\Sigma+\kappa)^{-2}\bar{w} \frac{\gamma'}{1-\gamma} \right] + \sigma_\epsilon^2 \frac{\gamma'}{1-\gamma}$

where $\gamma' = \frac{D}{P}\text{tr}[\Sigma(\Sigma+\kappa)^{-1}\Sigma'(\Sigma+\kappa)^{-1}]$

This recovers and extends results by Patil et al. 40 and Canatar et al. 41 to the dynamic case.

Comparison with Existing Work

Method	Finite P	Finite N	Dynamics	Covariate Shift	Technical Approach
Bordelon et al. 16	✓	✓	✓	✗	DMFT
Paquette et al. 17	✓	✗	✓	✗	Single-point DE
This Work	✓	✓	✓	✓	Two-point DE

Key Theoretical Findings

Structure of SGD Kernel Term:
- Training kernel $\hat{K}$ and test kernel $K$ differ only by an additional term
- This additional term is non-negative as $\omega \to 0$ , explaining SGD's additional regularization effect on training loss
Dynamic Generalization of GCV:
- Empirical loss and population loss differ by factor $S_W S'_W$ under gradient flow
- This is the natural dynamic generalization of generalized cross-validation (GCV)
Physical Meaning of Response Functions:
- Response functions $R_1, R_3$ in DMFT correspond to $1/S_W, 1/S_{FF^\top}$
- S-transforms encode the system's response to frequency perturbations
Multi-Scale Renormalization:
- Frequencies are successively renormalized by randomness in data and features
- Each layer of randomness introduces an S-transform factor

Random Matrix Theory and Deterministic Equivalence

Single-Point Deterministic Equivalence:
- Knowles & Yin 29: Established anisotropic local law
- Louart et al. 30: Application to neural network analysis
- Bach 28: Analysis of double descent phenomenon
- Atanasov et al. 20: Systematic review of scaling and renormalization in high-dimensional regression
Free Probability Theory:
- Potters & Bouchaud 24: Random matrix theory textbook
- Properties of S-transform: $S_{A*B} = S_A S_B$ (free convolution)

Neural Scaling Laws

Empirical Observations:
- Kaplan et al. 2: Scaling laws for language models
- Hoffmann et al. 3: Chinchilla optimal training
- Hestness et al. 1: Predictability of deep learning scaling
Theoretical Analysis:
- Bordelon et al. 16: DMFT analysis of random feature model scaling
- Paquette et al. 17: Identification of 4+3 compute-optimal phases
- Lin et al. 18: Scaling laws in linear regression

SGD Dynamics Analysis

Kernel Methods:
- Lin & Rosasco 13: Optimal rates for multi-pass SGD
- Pillaud-Vivien et al. 14: Statistical optimality for hard learning problems
Simplified Models:
- Bordelon & Pehlevan 21: Learning curves on structured features
- Paquette et al. 35-37: Exact risk trajectories for high-dimensional SGD
- Canatar et al. 34: Spectral bias and task-model alignment

High-Dimensional Statistics

Ridge Regression:
- Hastie et al. 25: Surprising phenomena in high-dimensional ridgeless interpolation
- Defilippis et al. 32: Dimension-free deterministic equivalence
- Misiakiewicz & Saeed 33: Non-asymptotic theory
Covariate Shift:
- Patil et al. 40: Optimal ridge regularization for OOD prediction
- Canatar et al. 41: OOD generalization in kernel regression

Conclusions and Discussion

Main Conclusions

Unified Framework: Two-point deterministic equivalence provides a unified mathematical framework for analyzing finite data, finite model size, and SGD noise jointly.
Theoretical Completeness: Recovers all known results (static ridge regression, DMFT dynamics, single-point deterministic equivalence) and extends to new scenarios (dynamics of covariate shift).
Methodological Contribution: The combination of planar graph expansion and free probability theory provides new computational tools for random matrix theory.
Physical Insights: Reveals the deep meaning of S-transform as a response function and establishes a bridge between deterministic equivalence and DMFT.

Limitations

Asymptotic Nature:
- Results are exact in the $D, N, P \to \infty$ limit
- Error bounds for finite dimensions not provided (though numerical experiments 16,17 show good approximations)
- Non-planar graphs (corresponding to fluctuations and subleading corrections) not analyzed
Model Restrictions:
- Applicable only to linear models and linear random features
- Feature matrix F must be Gaussian random
- Data covariance Σ must satisfy certain spectral conditions
Technical Assumptions:
- Requires discarding certain SGD terms (middle term in Eq III.1)
- Batch size must scale as $B = \Theta(D^\zeta)$
- Learning rate must remain $\eta = \Theta(1)$
Rigor:
- Equivalence of simplified model (Eq III.2) not rigorously proven, mainly citing prior work 21, 35-37
- Derivation of quantitative error bounds left for future work

Future Directions

Extension to Non-Linear Models:
- Two-point equivalence for shallow neural networks
- Non-linear versions of kernel methods
Finite-Dimension Corrections:
- Derive 1/N, 1/P correction terms
- Establish quantitative error bounds 24, 29-33
More General Randomness:
- Non-Gaussian feature matrices
- Structured random matrices (e.g., circulant, Toeplitz)
Optimization Algorithms:
- Extension to momentum, Adam, and other optimizers
- Analysis of adaptive learning rates
Practical Applications:
- Use theory to guide hyperparameter selection
- Predict performance of large-scale models

In-Depth Evaluation

Strengths

Theoretical Depth:
- First systematic derivation of two-point deterministic equivalence, filling an important gap in random matrix theory
- Planar graph method elegantly organizes complex calculations with strong extensibility
- Establishes profound connections between multiple mathematical fields (random matrices, free probability, dynamical systems, statistical physics)
Unification:
- Single framework unifies multiple previously independent results
- Equivalence of different technical approaches (DMFT vs. deterministic equivalence) clarified
- Smooth transitions from static to dynamic, finite to infinite
Technical Innovation:
- Introduction of mixed R-transform cleverly handles coupling of two parameters
- Hierarchical renormalization idea clearly demonstrates effects of multiple random sources
- Fourier space analysis transforms complex temporal evolution into algebraic problems
Completeness:
- Exhaustive appendix contains all variant formulas
- Multiple consistency checks verify theoretical correctness
- Clear notation system and diagrams aid understanding
Potential Impact:
- Provides toolbox for analyzing more complex models
- May inspire new numerical algorithms (fast simulation based on deterministic equivalence)
- Provides theoretical foundation for understanding deep learning scaling laws

Weaknesses

Readability Challenges:
- Requires deep background in random matrix theory
- Complex notation system (multi-level subscripts, multiple S-transforms)
- Main results (Eq IV.2, VI.2) have complex form, difficult intuitive understanding
Insufficient Experimental Verification:
- Paper provides no new numerical experiments
- Completely relies on verification from cited works 16, 17
- Lacks systematic assessment of theoretical prediction accuracy (e.g., errors under different D, N, P)
Limited Application Guidance:
- Theoretical results require solving complex self-consistent equations (e.g., computing κ)
- No practical algorithms or code implementations provided
- Unclear practical implications for actual deep learning
Reasonableness of Technical Assumptions:
- Argument for discarding middle term in Eq III.1 not sufficiently rigorous (especially for ζ=0 case)
- Applicable conditions of simplified model not fully characterized
- Assumptions on data structure (spectral decay rate) relatively strong
Generalization Limitations:
- Gaussian assumption often violated in practice
- Large gap between linear models and actual neural networks
- Batch size scaling requirements may be impractical

Impact Assessment

Contribution to Academia:

Theoretical Foundation: Provides new tools for high-dimensional statistics and machine learning theory, expected to be widely cited
Methodology: Planar graph method and two-point technique may inspire research on other problems
Unified Perspective: Connects multiple research communities (statistical physics, random matrices, machine learning theory)

Practical Value:

Short-term: Primarily theoretical value, limited direct applications
Medium-term: May guide model design and hyperparameter selection (e.g., optimal P/N ratio)
Long-term: Provides theoretical foundation for understanding and predicting large-scale model behavior

Reproducibility:

Theoretical derivations detailed, in principle fully reproducible
Lack of code implementation lowers practical application threshold
Numerical verification depends on prior work, independent verification requires additional effort

Applicable Scenarios

Most Suitable Scenarios:

High-Dimensional Linear Models: Regression problems with large P, N, D and fixed ratios
Theoretical Analysis: Theoretical research requiring precise asymptotic behavior
Scaling Law Prediction: Predicting model performance trends with scale changes
Covariate Shift: Scenarios where training and test distributions differ

Less Suitable Scenarios:

Small Sample Problems: Asymptotic theory inapplicable
Non-Linear Deep Networks: Requires further theory extension
Non-Gaussian Data: Theory assumptions violated
Real-Time Applications: Self-consistent equation solving may be slow

Potential Application Directions:

Performance prediction in neural architecture search
Data acquisition strategy optimization (when to stop collecting data)
Theoretical guidance for model compression and knowledge distillation
Theoretical foundation for transfer learning and domain adaptation

Selected References

16 B. Bordelon, A. Atanasov, and C. Pehlevan, "A dynamical model of neural scaling laws," ICML 2024.

17 E. Paquette, C. Paquette, L. Xiao, and J. Pennington, "4 + 3 phases of compute-optimal neural scaling laws," arXiv:2405.15074, 2024.

20 A. Atanasov, J. A. Zavatone-Veth, and C. Pehlevan, "Scaling and renormalization in high-dimensional regression," arXiv:2405.00592, 2024.

24 M. Potters and J.-P. Bouchaud, "A first course in random matrix theory," Cambridge University Press, 2020.

26 A. Atanasov, J. A. Zavatone-Veth, and C. Pehlevan, "Risk and cross validation in ridge regression with correlated samples," arXiv:2408.04607, 2024.

Overall Assessment: This is an excellent paper with exceptional theoretical depth, providing a unified and elegant mathematical framework for SGD dynamics in high-dimensional linear models. The derivation of two-point deterministic equivalence is an important theoretical contribution, and the planar graph method demonstrates strong technical prowess. While direct applications are limited and readability presents challenges, the work has significant long-term value for machine learning theory development. Subsequent work should supplement numerical verification, provide practical algorithms, and explore extensions to non-linear models.