2025-11-16T03:28:12.300331

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

Abreu, Vyas, Kakade et al.

Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much performance is forfeited by these approximations? To probe this question, we establish a practical upper bound on iteration complexity by applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters. Our experiments show that full GN updates yield substantial gains over existing optimizers, achieving a 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon. Furthermore, we find that a precise layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method. Collectively, our results suggest: (1) the GN approximation is highly effective for preconditioning, implying higher-order loss terms may not be critical for convergence speed; (2) the layerwise Hessian structure contains sufficient information to achieve most of these potential gains; and (3) a significant performance gap exists between current approximate methods and an idealized layerwise oracle.

academic

The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton

Basic Information

Paper ID: 2510.09378
Title: The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton
Authors: Natalie Abreu (Harvard), Nikhil Vyas (Harvard/OpenAI), Sham Kakade (Harvard), Depen Morwani (Harvard)
Classification: cs.LG cs.AI
Publication Date: October 10, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.09378

Abstract

This paper investigates the performance loss incurred by computationally efficient approximations of existing second-order optimization methods in large language model (LLM) pretraining. The authors establish practical upper bounds on iteration complexity by applying full Gauss-Newton (GN) preconditioning to a 150M-parameter Transformer model. Experiments demonstrate that full GN updates achieve a 5.4× reduction in training iterations compared to strong baselines such as SOAP and Muon. Furthermore, the exact layer-wise GN preconditioner, which ignores cross-layer information, nearly achieves the performance of the full GN method.

Research Background and Motivation

Problem Definition

As computational demands for LLMs continue to grow, improvements in optimization methods have become a core strategy for enhancing training efficiency. While first-order methods (such as SGD and Adam) are widely used, second-order methods theoretically offer faster convergence rates and superior large-batch scaling capabilities.

Research Motivation

Limitations of Existing Second-Order Methods: Current second-order optimizers (such as Shampoo, SOAP, and Muon) employ Hessian approximations to maintain computational feasibility, yet the performance loss from these approximations remains unclear.
Theory-Practice Gap: Although second-order methods are theoretically superior, the prohibitive storage and computational costs of the full Hessian necessitate the use of approximations in practice.
Core Research Questions: "What are the fundamental performance limits of second-order optimization in LLMs? Which structural properties of the Hessian are necessary to achieve these limits?"

Core Contributions

Establishing Performance Bounds: Establishes practical performance upper bounds for second-order optimization through the full Gauss-Newton method, achieving a 5.4× improvement in iteration complexity compared to SOAP.
Revealing Key Structures: Discovers that layer-wise Hessian structure contains sufficient information to achieve most performance gains, with limited importance of cross-layer curvature information.
Theoretical Insights: Demonstrates that GN approximation is highly effective for preconditioning, suggesting that higher-order loss terms may not be critical for convergence speed.
Batch Size Scaling: Significantly extends the critical batch size, demonstrating near-optimal scaling performance.

Methodology Details

Task Definition

Given model parameters θ, input x, and labels y, define the loss function L(f(θ,x), y). The objective is to minimize expected loss, with focus on iteration complexity (the number of steps required to reach target loss).

Gauss-Newton Method Principles

Mathematical Foundation

The complete Hessian matrix can be decomposed as:

∇²θL(θ) = ∇θf(θ)ᵀ∇²zL(θ)∇θf(θ) + Σₐ(δL/δzₐ)∇²θ[f(θ)]ₐ

where the first term is the Gauss-Newton matrix G, and the second term represents model curvature.

Algorithm Implementation

Algorithm 1: Gauss-Newton Method

Perform first-order Taylor expansion of the model: f⁽¹⁾θₜ(θ,x) := f(θₜ,x) + ∇f(θₜ,x)ᵀ(θ-θₜ)
Convexify the loss: L̃θₜ(θ) := (1/b)Σ₍ₓ,ᵧ₎∈B ℓ(f⁽¹⁾θₜ(θ,x), y)
Construct second-order Taylor approximation: L̃⁽²⁾θₜ(θ)
Solve the least-squares problem: θ̂ = argminθ L̃⁽²⁾θₜ(θ)
Line search: θₜ₊₁ ← θₜ + α*(θ̂ - θₜ)

Memory-Feasible Implementation

To avoid explicit Hessian storage, Jacobian-vector products (JVPs) are employed to implement a functionally equivalent method. The core idea is to optimize a second-order Taylor approximation of the loss function L and a first-order Taylor approximation of the model f.

Variant Methods

GN-prox-linear Method

Directly minimizes loss on the linearized model: θ* = argminθ L̃θₜ(θ), used to investigate the impact of higher-order loss terms.

Layer-wise Gauss-Newton

For each layer l independently:

Compute first-order Taylor expansion for that layer: f⁽¹⁾θₗ,ₜ(θₗ)
Solve: θₗ,ₜ₊₁ = argminθₗ L̃⁽²⁾θₗ,ₜ(θₗ)
Merge updates from all layers and apply line search

Experimental Setup

Datasets and Models

Models: 45M and 150M parameter LLaMA architecture
Dataset: C4 dataset
Sequence Length: 1024

Baseline Methods

AdamW: The most widely used LLM optimizer
Muon: Method using Newton-Schulz orthogonalization
SOAP: Latest variant of Shampoo

Experimental Configuration

Inner Optimizer: Muon used to solve least-squares problems
Batch Size: Controlled via gradient accumulation, bᵢₙₙₑᵣ = 32(45M) / 128(150M)
Learning Rate Schedule: Three strategies—global cosine, global+inner cosine, and constant+inner cosine
Regularization: Multiple strategies including weight decay and line search

Experimental Results

Main Results

Iteration Complexity

In experiments reaching loss 3.25:

Gauss-Newton: 54 steps
SOAP: 292 steps (5.4× difference)
Muon: ~16× difference
Layer-wise GN: 78 steps (only 1.4× difference)

Batch Size Scaling

In fixed 3B token training:

Gauss-Newton maintains good performance at 120M batch size (loss 3.45)
AdamW severely degrades at the same batch size (loss >4.4)
Critical batch size significantly extended, approaching optimal scaling trends

Ablation Studies

GN vs GN-prox-linear

Both methods show nearly identical performance, indicating limited contribution of higher-order loss terms to performance improvement.

Full GN vs Layer-wise GN

Layer-wise methods approach full GN performance in most settings, demonstrating limited importance of cross-layer curvature information.

Key Findings

Importance of Learning Rate Schedule: Global cosine scheduling performs best at small to medium batch sizes
Necessity of Line Search: Critical for stable convergence of GN methods
Inner Optimizer Selection: Muon outperforms AdamW as inner optimizer

Second-Order Optimization Methods

Hessian-free Optimization: Conjugate gradient methods proposed by Martens (2010)
Diagonal Hessian Approximation: Lightweight methods such as AdaHessian and Sophia
Layer-wise Approximation: Core concept of Shampoo family methods

LLM Optimizer Development

Traditional Methods: SGD, Adam family
Modern Second-Order Methods: Shampoo outperformed competitors by 28% in AlgoPerf competition
Specialized Methods: Muon, SOAP, and other methods designed for LLMs

Conclusions and Discussion

Main Conclusions

Performance Bounds Established: Full GN method provides clear performance targets for second-order optimization
Structural Importance: Layer-wise Hessian structure contains sufficient information to achieve most gains
Approximation Effectiveness: Significant performance gap exists between current approximation methods and idealized layer-wise oracle

Limitations

Computational Overhead: Current implementation is 4-5× slower than standard training
Scale Constraints: Experiments limited to 150M parameter models
Practical Applicability: Primarily serves as an analytical tool rather than a direct practical optimizer

Future Directions

Efficient Implementation: Develop computationally efficient exact second-order methods
Improved Approximations: Enhance layer-wise Hessian approximation methods
Scale Extension: Validate findings on larger models

In-Depth Evaluation

Strengths

Theoretical Depth: Provides important theoretical insights into performance limits of second-order optimization
Experimental Rigor: Extensive hyperparameter search and multiple regularization strategies
Practical Value: Provides clear targets for improving existing second-order methods
Methodological Innovation: Cleverly uses JVPs to avoid explicit Hessian storage

Weaknesses

Computational Cost: High computational overhead limits practical application
Scale Limitations: Not validated on truly large-scale LLMs
Theoretical Analysis: Lacks deep theoretical explanation for why layer-wise approximation is so effective

Impact

Academic Contribution: Provides important benchmarks for second-order optimization research
Practical Guidance: Indicates directions for improving existing methods
Methodological Value: Establishes new framework for evaluating second-order methods

Applicable Scenarios

Theoretical analysis of second-order optimization methods
Performance benchmarking of new optimization algorithms
Optimization choices for large-batch training scenarios

References

This paper cites important works in the optimization field, including:

Martens (2010): Pioneering work on Hessian-free optimization
Gupta et al. (2018): Shampoo optimizer
Jordan et al. (2024): Muon optimizer
Vyas et al. (2025): SOAP optimizer

Overall Assessment: This is a high-quality research paper that rigorously establishes performance bounds for second-order optimization in LLM training through careful experimentation, providing important theoretical insights and practical guidance for the field. Despite computational cost and scale limitations, its academic value and guidance for future research are significant.