2025-11-16T03:28:12.300331

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

Abreu, Vyas, Kakade et al.
Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much performance is forfeited by these approximations? To probe this question, we establish a practical upper bound on iteration complexity by applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters. Our experiments show that full GN updates yield substantial gains over existing optimizers, achieving a 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon. Furthermore, we find that a precise layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method. Collectively, our results suggest: (1) the GN approximation is highly effective for preconditioning, implying higher-order loss terms may not be critical for convergence speed; (2) the layerwise Hessian structure contains sufficient information to achieve most of these potential gains; and (3) a significant performance gap exists between current approximate methods and an idealized layerwise oracle.
academic

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

Basic Information

  • Paper ID: 2510.09378
  • Title: The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton
  • Authors: Natalie Abreu (Harvard), Nikhil Vyas (Harvard/OpenAI), Sham Kakade (Harvard), Depen Morwani (Harvard)
  • Classification: cs.LG cs.AI
  • Publication Date: October 10, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.09378

Abstract

This paper investigates the performance loss incurred by computationally efficient approximations of existing second-order optimization methods in large language model (LLM) pretraining. The authors establish practical upper bounds on iteration complexity by applying full Gauss-Newton (GN) preconditioning to a 150M-parameter Transformer model. Experiments demonstrate that full GN updates achieve a 5.4× reduction in training iterations compared to strong baselines such as SOAP and Muon. Furthermore, the exact layer-wise GN preconditioner, which ignores cross-layer information, nearly achieves the performance of the full GN method.

Research Background and Motivation

Problem Definition

As computational demands for LLMs continue to grow, improvements in optimization methods have become a core strategy for enhancing training efficiency. While first-order methods (such as SGD and Adam) are widely used, second-order methods theoretically offer faster convergence rates and superior large-batch scaling capabilities.

Research Motivation

  1. Limitations of Existing Second-Order Methods: Current second-order optimizers (such as Shampoo, SOAP, and Muon) employ Hessian approximations to maintain computational feasibility, yet the performance loss from these approximations remains unclear.
  2. Theory-Practice Gap: Although second-order methods are theoretically superior, the prohibitive storage and computational costs of the full Hessian necessitate the use of approximations in practice.
  3. Core Research Questions: "What are the fundamental performance limits of second-order optimization in LLMs? Which structural properties of the Hessian are necessary to achieve these limits?"

Core Contributions

  1. Establishing Performance Bounds: Establishes practical performance upper bounds for second-order optimization through the full Gauss-Newton method, achieving a 5.4× improvement in iteration complexity compared to SOAP.
  2. Revealing Key Structures: Discovers that layer-wise Hessian structure contains sufficient information to achieve most performance gains, with limited importance of cross-layer curvature information.
  3. Theoretical Insights: Demonstrates that GN approximation is highly effective for preconditioning, suggesting that higher-order loss terms may not be critical for convergence speed.
  4. Batch Size Scaling: Significantly extends the critical batch size, demonstrating near-optimal scaling performance.

Methodology Details

Task Definition

Given model parameters θ, input x, and labels y, define the loss function L(f(θ,x), y). The objective is to minimize expected loss, with focus on iteration complexity (the number of steps required to reach target loss).

Gauss-Newton Method Principles

Mathematical Foundation

The complete Hessian matrix can be decomposed as:

∇²θL(θ) = ∇θf(θ)ᵀ∇²zL(θ)∇θf(θ) + Σₐ(δL/δzₐ)∇²θ[f(θ)]ₐ

where the first term is the Gauss-Newton matrix G, and the second term represents model curvature.

Algorithm Implementation

Algorithm 1: Gauss-Newton Method

  1. Perform first-order Taylor expansion of the model: f⁽¹⁾θₜ(θ,x) := f(θₜ,x) + ∇f(θₜ,x)ᵀ(θ-θₜ)
  2. Convexify the loss: L̃θₜ(θ) := (1/b)Σ₍ₓ,ᵧ₎∈B ℓ(f⁽¹⁾θₜ(θ,x), y)
  3. Construct second-order Taylor approximation: L̃⁽²⁾θₜ(θ)
  4. Solve the least-squares problem: θ̂ = argminθ L̃⁽²⁾θₜ(θ)
  5. Line search: θₜ₊₁ ← θₜ + α*(θ̂ - θₜ)

Memory-Feasible Implementation

To avoid explicit Hessian storage, Jacobian-vector products (JVPs) are employed to implement a functionally equivalent method. The core idea is to optimize a second-order Taylor approximation of the loss function L and a first-order Taylor approximation of the model f.

Variant Methods

GN-prox-linear Method

Directly minimizes loss on the linearized model: θ* = argminθ L̃θₜ(θ), used to investigate the impact of higher-order loss terms.

Layer-wise Gauss-Newton

For each layer l independently:

  1. Compute first-order Taylor expansion for that layer: f⁽¹⁾θₗ,ₜ(θₗ)
  2. Solve: θₗ,ₜ₊₁ = argminθₗ L̃⁽²⁾θₗ,ₜ(θₗ)
  3. Merge updates from all layers and apply line search

Experimental Setup

Datasets and Models

  • Models: 45M and 150M parameter LLaMA architecture
  • Dataset: C4 dataset
  • Sequence Length: 1024

Baseline Methods

  • AdamW: The most widely used LLM optimizer
  • Muon: Method using Newton-Schulz orthogonalization
  • SOAP: Latest variant of Shampoo

Experimental Configuration

  • Inner Optimizer: Muon used to solve least-squares problems
  • Batch Size: Controlled via gradient accumulation, bᵢₙₙₑᵣ = 32(45M) / 128(150M)
  • Learning Rate Schedule: Three strategies—global cosine, global+inner cosine, and constant+inner cosine
  • Regularization: Multiple strategies including weight decay and line search

Experimental Results

Main Results

Iteration Complexity

In experiments reaching loss 3.25:

  • Gauss-Newton: 54 steps
  • SOAP: 292 steps (5.4× difference)
  • Muon: ~16× difference
  • Layer-wise GN: 78 steps (only 1.4× difference)

Batch Size Scaling

In fixed 3B token training:

  • Gauss-Newton maintains good performance at 120M batch size (loss 3.45)
  • AdamW severely degrades at the same batch size (loss >4.4)
  • Critical batch size significantly extended, approaching optimal scaling trends

Ablation Studies

GN vs GN-prox-linear

Both methods show nearly identical performance, indicating limited contribution of higher-order loss terms to performance improvement.

Full GN vs Layer-wise GN

Layer-wise methods approach full GN performance in most settings, demonstrating limited importance of cross-layer curvature information.

Key Findings

  1. Importance of Learning Rate Schedule: Global cosine scheduling performs best at small to medium batch sizes
  2. Necessity of Line Search: Critical for stable convergence of GN methods
  3. Inner Optimizer Selection: Muon outperforms AdamW as inner optimizer

Second-Order Optimization Methods

  • Hessian-free Optimization: Conjugate gradient methods proposed by Martens (2010)
  • Diagonal Hessian Approximation: Lightweight methods such as AdaHessian and Sophia
  • Layer-wise Approximation: Core concept of Shampoo family methods

LLM Optimizer Development

  • Traditional Methods: SGD, Adam family
  • Modern Second-Order Methods: Shampoo outperformed competitors by 28% in AlgoPerf competition
  • Specialized Methods: Muon, SOAP, and other methods designed for LLMs

Conclusions and Discussion

Main Conclusions

  1. Performance Bounds Established: Full GN method provides clear performance targets for second-order optimization
  2. Structural Importance: Layer-wise Hessian structure contains sufficient information to achieve most gains
  3. Approximation Effectiveness: Significant performance gap exists between current approximation methods and idealized layer-wise oracle

Limitations

  1. Computational Overhead: Current implementation is 4-5× slower than standard training
  2. Scale Constraints: Experiments limited to 150M parameter models
  3. Practical Applicability: Primarily serves as an analytical tool rather than a direct practical optimizer

Future Directions

  1. Efficient Implementation: Develop computationally efficient exact second-order methods
  2. Improved Approximations: Enhance layer-wise Hessian approximation methods
  3. Scale Extension: Validate findings on larger models

In-Depth Evaluation

Strengths

  1. Theoretical Depth: Provides important theoretical insights into performance limits of second-order optimization
  2. Experimental Rigor: Extensive hyperparameter search and multiple regularization strategies
  3. Practical Value: Provides clear targets for improving existing second-order methods
  4. Methodological Innovation: Cleverly uses JVPs to avoid explicit Hessian storage

Weaknesses

  1. Computational Cost: High computational overhead limits practical application
  2. Scale Limitations: Not validated on truly large-scale LLMs
  3. Theoretical Analysis: Lacks deep theoretical explanation for why layer-wise approximation is so effective

Impact

  1. Academic Contribution: Provides important benchmarks for second-order optimization research
  2. Practical Guidance: Indicates directions for improving existing methods
  3. Methodological Value: Establishes new framework for evaluating second-order methods

Applicable Scenarios

  • Theoretical analysis of second-order optimization methods
  • Performance benchmarking of new optimization algorithms
  • Optimization choices for large-batch training scenarios

References

This paper cites important works in the optimization field, including:

  • Martens (2010): Pioneering work on Hessian-free optimization
  • Gupta et al. (2018): Shampoo optimizer
  • Jordan et al. (2024): Muon optimizer
  • Vyas et al. (2025): SOAP optimizer

Overall Assessment: This is a high-quality research paper that rigorously establishes performance bounds for second-order optimization in LLM training through careful experimentation, providing important theoretical insights and practical guidance for the field. Despite computational cost and scale limitations, its academic value and guidance for future research are significant.