2025-11-29T05:16:19.247534

Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models

Atanasov, Bordelon, Zavatone-Veth et al.
We derive a novel deterministic equivalence for the two-point function of a random matrix resolvent. Using this result, we give a unified derivation of the performance of a wide variety of high-dimensional linear models trained with stochastic gradient descent. This includes high-dimensional linear regression, kernel regression, and linear random feature models. Our results include previously known asymptotics as well as novel ones.
academic

Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models

Basic Information

  • Paper ID: 2502.05074
  • Title: Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models
  • Authors: Alexander Atanasov, Blake Bordelon, Jacob A. Zavatone-Veth, Courtney Paquette, Cengiz Pehlevan (from Harvard University, McGill University, and other institutions)
  • Classification: cond-mat.dis-nn, cs.LG, stat.ML
  • Publication Date: arXiv v3, November 10, 2025
  • Paper Link: https://arxiv.org/abs/2502.05074v3

Abstract

This paper proposes a novel deterministic equivalence theory for two-point functions of random matrix analytic resolvents. Based on this result, the authors provide a unified derivation of performance characteristics for various high-dimensional linear models under stochastic gradient descent (SGD) training, including high-dimensional linear regression, kernel regression, and linear random feature models. The results encompass known asymptotic behaviors as well as new theoretical findings.

Research Background and Motivation

Problems to Address

A core phenomenon in modern deep learning is that model performance exhibits predictable power-law behavior (neural scaling laws) as data scale, model size, and computational resources increase. Understanding the theoretical foundations of this scaling behavior is an important challenge in machine learning theory.

Importance of the Problem

  1. Need for a Unified Theoretical Framework: Existing work has separately studied the effects of finite width, finite data, and SGD noise using different methods (such as dynamic mean-field theory DMFT and deterministic equivalence techniques), lacking a unified framework.
  2. Understanding Dynamics: Most theoretical analyses focus on static (infinite time) limits, with insufficient understanding of the training dynamics process.
  3. Non-Commutativity Challenge: When the data covariance matrix Σ, empirical covariance Σ̂, and random feature matrix FF^T do not commute, traditional single-point deterministic equivalence methods fail.

Limitations of Existing Methods

  • Single-Point Deterministic Equivalence: Can only handle cases where matrices commute (e.g., infinite data P→∞ or linear regression without random features).
  • DMFT Method: While capable of handling general cases, it has high technical complexity and lacks direct connection to random matrix theory.
  • Scattered Results: Different works use different techniques to obtain partial results, lacking a unified mathematical framework.

Research Motivation

This paper aims to develop a two-point deterministic equivalence theory to provide a unified mathematical framework for analyzing the complete dynamic behavior of SGD in high-dimensional linear models, including the joint effects of finite data, finite model size, and SGD noise.

Core Contributions

  1. Novel Two-Point Deterministic Equivalence Theory: First systematic derivation of deterministic equivalence formulas for two-point functions of random matrix analytic resolvents at different parameters (λ, λ').
  2. Unified Dynamic Analysis Framework: Decomposes SGD dynamics into a forcing term (gradient flow) and an SGD kernel term, with analysis in the frequency domain via Fourier transform.
  3. Recovery and Extension of Existing Results:
    • Recovers results obtained by Bordelon et al. 16 through DMFT
    • Recovers results by Paquette et al. 17 using single-point deterministic equivalence
    • Extends to new scenarios such as covariate shift
  4. Connection to Free Probability Theory: Reveals a new interpretation of the S-transform as a response function in dynamic systems, establishing a bridge between deterministic equivalence and DMFT.
  5. Planar Graph Expansion Technique: Systematically derives two-point equivalence formulas using planar graph expansions and free cumulants.

Detailed Methodology

Task Definition

Consider two classes of models:

1. Linear Regression: f(x)=xwf(x) = x^\top w

2. Linear Random Feature Model: f(x)=xFv=wx,w=Fvf(x) = x^\top Fv = w^\top x, \quad w = Fv

Where:

  • Input xRDN(0,Σ)x \in \mathbb{R}^D \sim \mathcal{N}(0, \Sigma)
  • Random feature matrix FRD×NF \in \mathbb{R}^{D \times N} with i.i.d. elements N(0,1/N)\sim \mathcal{N}(0, 1/N)
  • Labels generated by a teacher model: yμ=wˉxμ+ϵμy_\mu = \bar{w}^\top x_\mu + \epsilon_\mu, where ϵμN(0,σϵ2)\epsilon_\mu \sim \mathcal{N}(0, \sigma_\epsilon^2)

Training Objective: Minimize empirical risk R^=1Pμ=1P(yμf(xμ))2\hat{R} = \frac{1}{P}\sum_{\mu=1}^P (y_\mu - f(x_\mu))^2

via SGD updates (batch size B, learning rate η): vt+1=vtηvR^Btv_{t+1} = v_t - \eta \nabla_v \hat{R}_{B_t}

Performance Metrics:

  • Training loss: R^t=ΔwtΣ^Δwt\hat{R}_t = \Delta w_t^\top \hat{\Sigma} \Delta w_t
  • Test loss: Rt=ΔwtΣΔwtR_t = \Delta w_t^\top \Sigma \Delta w_t
  • Where Δwt=wˉwt\Delta w_t = \bar{w} - w_t

Core Theoretical Framework

1. Simplified Model of SGD Dynamics

By tracking second-order moments of weight differences Ct=EBt[ΔwtΔwt]C_t = \mathbb{E}_{B_t}[\Delta w_t \Delta w_t^\top], a Volterra integral equation is obtained in the continuous time limit:

CteηtFFΣ^wˉwˉeηtΣ^FF+χ0te2(ts)FFΣ^FFΣ^FFTr[CsΣ^]dsC_t \simeq e^{-\eta t FF^\top \hat{\Sigma}} \bar{w}\bar{w}^\top e^{-\eta t \hat{\Sigma}FF^\top} + \chi \int_0^t e^{-2(t-s)FF^\top\hat{\Sigma}} FF^\top \hat{\Sigma} FF^\top \text{Tr}[C_s\hat{\Sigma}]ds

where χ=η/B\chi = \eta/B is the SGD temperature parameter.

2. Forcing Term and Kernel Term Decomposition

Test loss can be decomposed as:

Rt=wˉetΣ^FFΣetFFΣ^wˉF(t) - forcing term+χ0ttr[e2(ts)FFΣ^FFΣ^FFΣ]K(ts) - SGD kernel termR^sdsR_t = \underbrace{\bar{w}^\top e^{-t\hat{\Sigma}FF^\top} \Sigma e^{-tFF^\top\hat{\Sigma}} \bar{w}}_{F(t) \text{ - forcing term}} + \underbrace{\chi \int_0^t \text{tr}[e^{-2(t-s)FF^\top\hat{\Sigma}}FF^\top\hat{\Sigma}FF^\top\Sigma]}_{K(t-s) \text{ - SGD kernel term}} \hat{R}_s ds

Key Insight: In Fourier space, all randomness enters through products of analytic resolvents:

F(ω,ω)=wˉ(Σ^FF+iω)1Σ(FFΣ^+iω)1wˉF(\omega, \omega') = \bar{w}^\top (\hat{\Sigma}FF^\top + i\omega)^{-1} \Sigma (FF^\top\hat{\Sigma} + i\omega')^{-1} \bar{w}

When matrices do not commute, evaluation of two-point functions at different frequencies (ω,ω)(\omega, \omega') is required.

Derivation of Two-Point Deterministic Equivalence

Core Theorem

For random matrices (λ+AB)1M(λ+BA)1(λ+AB)^{-1}M(λ'+BA)^{-1}, where A, M are deterministic matrices and B is a white Wishart matrix free from A, there exists a deterministic equivalence:

(λ+AB)1M(λ+BA)1SBSB[GAMGA+GAAGAqtr[AGAMGA]1qdf2(κ,κ)](λ+AB)^{-1}M(λ'+BA)^{-1} \simeq S_B S'_B \left[ G_A M G'_A + G_A A G'_A \frac{q \text{tr}[AG_A M G'_A]}{1 - q \text{df}_2(\kappa, \kappa')} \right]

where:

  • SB=SB(df1AB(λ))S_B = S_B(\text{df}_1^{AB}(λ)) is the S-transform of B
  • GA=(κ+A)1G_A = (\kappa + A)^{-1}, κ=λSB\kappa = λS_B is the signal capture threshold
  • df2(κ,κ)=tr[A2GAGA]\text{df}_2(\kappa, \kappa') = \text{tr}[A^2 G_A G'_A] is the second-order degree of freedom
  • q=N/Pq = N/P is the Wishart parameter

Derivation Strategy (Planar Graph Expansion)

  1. Orthogonal Averaging: Write B=OBOB = OB'O^\top (B' diagonal) and average over the orthogonal group O.
  2. Irreducible Graph Expansion: Expand the analytic resolvent as chains of irreducible graphs connected through A/λ:
Diagram (simplified):
[1/S_B] --A/λ--> [1/S_B] --A/λ--> ...
  1. Connected Graph Summation: Each irreducible graph is a sum of fully connected graphs involving free cumulants κB(n)\kappa_B^{(n)}:

1SB=n=1κB(n)tr[GABA]n1\frac{1}{S_B} = \sum_{n=1}^\infty \kappa_B^{(n)} \text{tr}[G_A BA]^{n-1}

  1. Handling M Insertion: Terms containing M produce self-consistent equations:

XM=SBSBRB[g,g](tr[GAMGA]+XMtr[GAA2GA])X_M = S_B S'_B R_B[g, g'] \left( \text{tr}[G_A M G'_A] + X_M \text{tr}[G_A A^2 G'_A] \right)

where the mixed R-transform RB[g,g]=n=1a+b=nκB(n)ga1gb1R_B[g, g'] = \sum_{n=1}^\infty \sum_{a+b=n} \kappa_B^{(n)} g^{a-1} g'^{b-1}

  1. Wishart Case Simplification: Due to κB(a+b)=qκB(a)κB(b)\kappa_B^{(a+b)} = q\kappa_B^{(a)}\kappa_B^{(b)}, the mixed R-transform factorizes.

Application to Linear Models

Linear Regression (Without Random Features)

Forcing Term (Dual Frequency): F(ω,ω)=SWSW1γ(ω1,ω1)wˉ(iω1+Σ)1Σ(iω1+Σ)1wˉF(\omega, \omega') = \frac{S_W S'_W}{1-\gamma(\omega_1, \omega'_1)} \bar{w}^\top (i\omega_1 + \Sigma)^{-1} \Sigma (i\omega'_1 + \Sigma)^{-1} \bar{w}

where:

  • SW=1/(1DPdf1)S_W = 1/(1 - \frac{D}{P}\text{df}_1) is the S-transform of the Wishart matrix
  • ω1=SWω\omega_1 = S_W \omega is the renormalized frequency
  • γ=DPdf2(ω1,ω1)\gamma = \frac{D}{P}\text{df}_2(\omega_1, \omega'_1)

SGD Kernel Term (Single Frequency Sufficient): K(ω)Tr[Σ2(Σ+iω1)1]K(\omega) \simeq \text{Tr}[\Sigma^2(\Sigma + i\omega_1)^{-1}]

Linear Random Feature Model

Requires two applications of deterministic equivalence (first for data, then for features):

Forcing Term: F(ω,ω)SS1γ1[wˉ(iω2+Σ)1Σ(iω2+Σ)1wˉ+correction term]F(\omega, \omega') \simeq \frac{SS'}{1-\gamma_1} \left[ \bar{w}^\top (i\omega_2+\Sigma)^{-1}\Sigma(i\omega'_2+\Sigma)^{-1}\bar{w} + \text{correction term} \right]

where ω2=SFFSWω\omega_2 = S_{FF^\top} S_W \omega undergoes two renormalizations.

Key Technique: Uses the push-through identity A(BA+λ)1=(AB+λ)1AA(BA+λ)^{-1} = (AB+λ)^{-1}A to simplify expressions.

Technical Innovations

  1. Dual-Frequency Analysis: First systematic treatment of joint dependence on (ω,ω)(\omega, \omega'), capturing non-commutativity effects.
  2. Planar Graph Method: Clearly organizes complex matrix averaging calculations through graph-theoretic language.
  3. New Interpretation of S-Transform: Reveals the S-transform as a physical response function in dynamic systems, connecting free probability theory with dynamical systems theory.
  4. Hierarchical Renormalization: In random feature models, frequencies undergo successive renormalizations ωω1ω2\omega \to \omega_1 \to \omega_2, each corresponding to a random source.
  5. Soft Limit Recovery of Statics: Elegantly recovers static results via limtF(t)=limω,ω0(iω)(iω)F(ω,ω)\lim_{t\to\infty} F(t) = \lim_{\omega,\omega'\to 0} (i\omega)(i\omega')F(\omega,\omega').

Experimental Setup

Note: This is a purely theoretical work, with verification primarily through mathematical derivation. Experimental validation mainly references numerical experiments from related works 16, 17.

Theoretical Verification Strategy

  1. Comparison with Known Results:
    • Verify recovery of known single-point deterministic equivalence in special cases (e.g., λ=λ')
    • Verify static limit recovers known results for ridge regression 20
  2. Internal Consistency Checks:
    • Verify that results obtained by differentiating single-point formulas match two-point formulas at λ=λ'
    • Verify that different derivation paths (single-frequency vs. dual-frequency) yield identical results
  3. Comparison with DMFT Results:
    • Confirm that formulas in this paper exactly match DMFT results by Bordelon et al. 16
    • Establish correspondence between response functions and S-transforms

Theoretical Applicability Range

  • Asymptotic Regime: D,N,PD, N, P \to \infty with fixed ratios D/N,D/PD/N, D/P
  • Data Structure: Tr(Σ)=Θ(Dζ)\text{Tr}(\Sigma) = \Theta(D^\zeta), 0ζ10 \leq \zeta \leq 1
  • Batch Size Scaling: B=Θ(Dζ)B = \Theta(D^\zeta) to maintain stable dynamics
  • Learning Rate: η=Θ(1)\eta = \Theta(1) independent of dimension

Experimental Results

Main Theoretical Results

1. Consistency Verification

Recovery of Single-Point Limit (Appendix A.1): For Σ^(λ+Σ^)2\hat{\Sigma}(λ+\hat{\Sigma})^{-2}, taking λ=λλ=λ' in the two-point formula yields:

Σ^(Σ^+λ)2dκdλΣ(Σ+κ)2\hat{\Sigma}(\hat{\Sigma}+λ)^{-2} \simeq \frac{d\kappa}{dλ} \Sigma(\Sigma+\kappa)^{-2}

This is completely consistent with differentiating the single-point formula Σ^(Σ^+λ)1SΣ(Σ+κ)1\hat{\Sigma}(\hat{\Sigma}+λ)^{-1} \simeq S\Sigma(\Sigma+\kappa)^{-1}.

2. Recovery of Static Limit

In the tt \to \infty limit (corresponding to ω,ω0\omega, \omega' \to 0), the forcing term recovers known results for ridge regression:

limtRt=κ2wˉΣ(Σ+κ)2wˉ+σϵ2\lim_{t\to\infty} R_t = \kappa^2 \bar{w}^\top \Sigma (\Sigma+\kappa)^{-2} \bar{w} + \sigma_\epsilon^2

where κ\kappa satisfies the self-consistent equation κ=limω0SB(df1Σ(κ))ω\kappa = \lim_{\omega\to 0} S_B(\text{df}_1^\Sigma(\kappa)) \cdot \omega

3. Covariate Shift Results

For cases where the test distribution Σ\Sigma' differs from the training distribution Σ\Sigma, the static generalization error is:

EΣ,wˉOODκ2[wˉ(Σ+κ)1Σ(Σ+κ)1wˉ+wˉΣ(Σ+κ)2wˉγ1γ]+σϵ2γ1γE_{\Sigma',\bar{w}}^{OOD} \simeq \kappa^2 \left[ \bar{w}^\top (\Sigma+\kappa)^{-1}\Sigma'(\Sigma+\kappa)^{-1}\bar{w} + \bar{w}^\top \Sigma(\Sigma+\kappa)^{-2}\bar{w} \frac{\gamma'}{1-\gamma} \right] + \sigma_\epsilon^2 \frac{\gamma'}{1-\gamma}

where γ=DPtr[Σ(Σ+κ)1Σ(Σ+κ)1]\gamma' = \frac{D}{P}\text{tr}[\Sigma(\Sigma+\kappa)^{-1}\Sigma'(\Sigma+\kappa)^{-1}]

This recovers and extends results by Patil et al. 40 and Canatar et al. 41 to the dynamic case.

Comparison with Existing Work

MethodFinite PFinite NDynamicsCovariate ShiftTechnical Approach
Bordelon et al. 16DMFT
Paquette et al. 17Single-point DE
This WorkTwo-point DE

Key Theoretical Findings

  1. Structure of SGD Kernel Term:
    • Training kernel K^\hat{K} and test kernel KK differ only by an additional term
    • This additional term is non-negative as ω0\omega \to 0, explaining SGD's additional regularization effect on training loss
  2. Dynamic Generalization of GCV:
    • Empirical loss and population loss differ by factor SWSWS_W S'_W under gradient flow
    • This is the natural dynamic generalization of generalized cross-validation (GCV)
  3. Physical Meaning of Response Functions:
    • Response functions R1,R3R_1, R_3 in DMFT correspond to 1/SW,1/SFF1/S_W, 1/S_{FF^\top}
    • S-transforms encode the system's response to frequency perturbations
  4. Multi-Scale Renormalization:
    • Frequencies are successively renormalized by randomness in data and features
    • Each layer of randomness introduces an S-transform factor

Random Matrix Theory and Deterministic Equivalence

  1. Single-Point Deterministic Equivalence:
    • Knowles & Yin 29: Established anisotropic local law
    • Louart et al. 30: Application to neural network analysis
    • Bach 28: Analysis of double descent phenomenon
    • Atanasov et al. 20: Systematic review of scaling and renormalization in high-dimensional regression
  2. Free Probability Theory:
    • Potters & Bouchaud 24: Random matrix theory textbook
    • Properties of S-transform: SAB=SASBS_{A*B} = S_A S_B (free convolution)

Neural Scaling Laws

  1. Empirical Observations:
    • Kaplan et al. 2: Scaling laws for language models
    • Hoffmann et al. 3: Chinchilla optimal training
    • Hestness et al. 1: Predictability of deep learning scaling
  2. Theoretical Analysis:
    • Bordelon et al. 16: DMFT analysis of random feature model scaling
    • Paquette et al. 17: Identification of 4+3 compute-optimal phases
    • Lin et al. 18: Scaling laws in linear regression

SGD Dynamics Analysis

  1. Kernel Methods:
    • Lin & Rosasco 13: Optimal rates for multi-pass SGD
    • Pillaud-Vivien et al. 14: Statistical optimality for hard learning problems
  2. Simplified Models:
    • Bordelon & Pehlevan 21: Learning curves on structured features
    • Paquette et al. 35-37: Exact risk trajectories for high-dimensional SGD
    • Canatar et al. 34: Spectral bias and task-model alignment

High-Dimensional Statistics

  1. Ridge Regression:
    • Hastie et al. 25: Surprising phenomena in high-dimensional ridgeless interpolation
    • Defilippis et al. 32: Dimension-free deterministic equivalence
    • Misiakiewicz & Saeed 33: Non-asymptotic theory
  2. Covariate Shift:
    • Patil et al. 40: Optimal ridge regularization for OOD prediction
    • Canatar et al. 41: OOD generalization in kernel regression

Conclusions and Discussion

Main Conclusions

  1. Unified Framework: Two-point deterministic equivalence provides a unified mathematical framework for analyzing finite data, finite model size, and SGD noise jointly.
  2. Theoretical Completeness: Recovers all known results (static ridge regression, DMFT dynamics, single-point deterministic equivalence) and extends to new scenarios (dynamics of covariate shift).
  3. Methodological Contribution: The combination of planar graph expansion and free probability theory provides new computational tools for random matrix theory.
  4. Physical Insights: Reveals the deep meaning of S-transform as a response function and establishes a bridge between deterministic equivalence and DMFT.

Limitations

  1. Asymptotic Nature:
    • Results are exact in the D,N,PD, N, P \to \infty limit
    • Error bounds for finite dimensions not provided (though numerical experiments 16,17 show good approximations)
    • Non-planar graphs (corresponding to fluctuations and subleading corrections) not analyzed
  2. Model Restrictions:
    • Applicable only to linear models and linear random features
    • Feature matrix F must be Gaussian random
    • Data covariance Σ must satisfy certain spectral conditions
  3. Technical Assumptions:
    • Requires discarding certain SGD terms (middle term in Eq III.1)
    • Batch size must scale as B=Θ(Dζ)B = \Theta(D^\zeta)
    • Learning rate must remain η=Θ(1)\eta = \Theta(1)
  4. Rigor:
    • Equivalence of simplified model (Eq III.2) not rigorously proven, mainly citing prior work 21, 35-37
    • Derivation of quantitative error bounds left for future work

Future Directions

  1. Extension to Non-Linear Models:
    • Two-point equivalence for shallow neural networks
    • Non-linear versions of kernel methods
  2. Finite-Dimension Corrections:
    • Derive 1/N, 1/P correction terms
    • Establish quantitative error bounds 24, 29-33
  3. More General Randomness:
    • Non-Gaussian feature matrices
    • Structured random matrices (e.g., circulant, Toeplitz)
  4. Optimization Algorithms:
    • Extension to momentum, Adam, and other optimizers
    • Analysis of adaptive learning rates
  5. Practical Applications:
    • Use theory to guide hyperparameter selection
    • Predict performance of large-scale models

In-Depth Evaluation

Strengths

  1. Theoretical Depth:
    • First systematic derivation of two-point deterministic equivalence, filling an important gap in random matrix theory
    • Planar graph method elegantly organizes complex calculations with strong extensibility
    • Establishes profound connections between multiple mathematical fields (random matrices, free probability, dynamical systems, statistical physics)
  2. Unification:
    • Single framework unifies multiple previously independent results
    • Equivalence of different technical approaches (DMFT vs. deterministic equivalence) clarified
    • Smooth transitions from static to dynamic, finite to infinite
  3. Technical Innovation:
    • Introduction of mixed R-transform cleverly handles coupling of two parameters
    • Hierarchical renormalization idea clearly demonstrates effects of multiple random sources
    • Fourier space analysis transforms complex temporal evolution into algebraic problems
  4. Completeness:
    • Exhaustive appendix contains all variant formulas
    • Multiple consistency checks verify theoretical correctness
    • Clear notation system and diagrams aid understanding
  5. Potential Impact:
    • Provides toolbox for analyzing more complex models
    • May inspire new numerical algorithms (fast simulation based on deterministic equivalence)
    • Provides theoretical foundation for understanding deep learning scaling laws

Weaknesses

  1. Readability Challenges:
    • Requires deep background in random matrix theory
    • Complex notation system (multi-level subscripts, multiple S-transforms)
    • Main results (Eq IV.2, VI.2) have complex form, difficult intuitive understanding
  2. Insufficient Experimental Verification:
    • Paper provides no new numerical experiments
    • Completely relies on verification from cited works 16, 17
    • Lacks systematic assessment of theoretical prediction accuracy (e.g., errors under different D, N, P)
  3. Limited Application Guidance:
    • Theoretical results require solving complex self-consistent equations (e.g., computing κ)
    • No practical algorithms or code implementations provided
    • Unclear practical implications for actual deep learning
  4. Reasonableness of Technical Assumptions:
    • Argument for discarding middle term in Eq III.1 not sufficiently rigorous (especially for ζ=0 case)
    • Applicable conditions of simplified model not fully characterized
    • Assumptions on data structure (spectral decay rate) relatively strong
  5. Generalization Limitations:
    • Gaussian assumption often violated in practice
    • Large gap between linear models and actual neural networks
    • Batch size scaling requirements may be impractical

Impact Assessment

Contribution to Academia:

  • Theoretical Foundation: Provides new tools for high-dimensional statistics and machine learning theory, expected to be widely cited
  • Methodology: Planar graph method and two-point technique may inspire research on other problems
  • Unified Perspective: Connects multiple research communities (statistical physics, random matrices, machine learning theory)

Practical Value:

  • Short-term: Primarily theoretical value, limited direct applications
  • Medium-term: May guide model design and hyperparameter selection (e.g., optimal P/N ratio)
  • Long-term: Provides theoretical foundation for understanding and predicting large-scale model behavior

Reproducibility:

  • Theoretical derivations detailed, in principle fully reproducible
  • Lack of code implementation lowers practical application threshold
  • Numerical verification depends on prior work, independent verification requires additional effort

Applicable Scenarios

Most Suitable Scenarios:

  1. High-Dimensional Linear Models: Regression problems with large P, N, D and fixed ratios
  2. Theoretical Analysis: Theoretical research requiring precise asymptotic behavior
  3. Scaling Law Prediction: Predicting model performance trends with scale changes
  4. Covariate Shift: Scenarios where training and test distributions differ

Less Suitable Scenarios:

  1. Small Sample Problems: Asymptotic theory inapplicable
  2. Non-Linear Deep Networks: Requires further theory extension
  3. Non-Gaussian Data: Theory assumptions violated
  4. Real-Time Applications: Self-consistent equation solving may be slow

Potential Application Directions:

  • Performance prediction in neural architecture search
  • Data acquisition strategy optimization (when to stop collecting data)
  • Theoretical guidance for model compression and knowledge distillation
  • Theoretical foundation for transfer learning and domain adaptation

Selected References

16 B. Bordelon, A. Atanasov, and C. Pehlevan, "A dynamical model of neural scaling laws," ICML 2024.

17 E. Paquette, C. Paquette, L. Xiao, and J. Pennington, "4 + 3 phases of compute-optimal neural scaling laws," arXiv:2405.15074, 2024.

20 A. Atanasov, J. A. Zavatone-Veth, and C. Pehlevan, "Scaling and renormalization in high-dimensional regression," arXiv:2405.00592, 2024.

24 M. Potters and J.-P. Bouchaud, "A first course in random matrix theory," Cambridge University Press, 2020.

26 A. Atanasov, J. A. Zavatone-Veth, and C. Pehlevan, "Risk and cross validation in ridge regression with correlated samples," arXiv:2408.04607, 2024.


Overall Assessment: This is an excellent paper with exceptional theoretical depth, providing a unified and elegant mathematical framework for SGD dynamics in high-dimensional linear models. The derivation of two-point deterministic equivalence is an important theoretical contribution, and the planar graph method demonstrates strong technical prowess. While direct applications are limited and readability presents challenges, the work has significant long-term value for machine learning theory development. Subsequent work should supplement numerical verification, provide practical algorithms, and explore extensions to non-linear models.