The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton
Abreu, Vyas, Kakade et al.
Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much performance is forfeited by these approximations? To probe this question, we establish a practical upper bound on iteration complexity by applying full Gauss-Newton (GN) preconditioning to transformer models of up to 150M parameters. Our experiments show that full GN updates yield substantial gains over existing optimizers, achieving a 5.4x reduction in training iterations compared to strong baselines like SOAP and Muon. Furthermore, we find that a precise layerwise GN preconditioner, which ignores cross-layer information, nearly matches the performance of the full GN method. Collectively, our results suggest: (1) the GN approximation is highly effective for preconditioning, implying higher-order loss terms may not be critical for convergence speed; (2) the layerwise Hessian structure contains sufficient information to achieve most of these potential gains; and (3) a significant performance gap exists between current approximate methods and an idealized layerwise oracle.
academic
The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton
This paper investigates the performance loss incurred by computationally efficient approximations of existing second-order optimization methods in large language model (LLM) pretraining. The authors establish practical upper bounds on iteration complexity by applying full Gauss-Newton (GN) preconditioning to a 150M-parameter Transformer model. Experiments demonstrate that full GN updates achieve a 5.4× reduction in training iterations compared to strong baselines such as SOAP and Muon. Furthermore, the exact layer-wise GN preconditioner, which ignores cross-layer information, nearly achieves the performance of the full GN method.
As computational demands for LLMs continue to grow, improvements in optimization methods have become a core strategy for enhancing training efficiency. While first-order methods (such as SGD and Adam) are widely used, second-order methods theoretically offer faster convergence rates and superior large-batch scaling capabilities.
Limitations of Existing Second-Order Methods: Current second-order optimizers (such as Shampoo, SOAP, and Muon) employ Hessian approximations to maintain computational feasibility, yet the performance loss from these approximations remains unclear.
Theory-Practice Gap: Although second-order methods are theoretically superior, the prohibitive storage and computational costs of the full Hessian necessitate the use of approximations in practice.
Core Research Questions: "What are the fundamental performance limits of second-order optimization in LLMs? Which structural properties of the Hessian are necessary to achieve these limits?"
Establishing Performance Bounds: Establishes practical performance upper bounds for second-order optimization through the full Gauss-Newton method, achieving a 5.4× improvement in iteration complexity compared to SOAP.
Revealing Key Structures: Discovers that layer-wise Hessian structure contains sufficient information to achieve most performance gains, with limited importance of cross-layer curvature information.
Theoretical Insights: Demonstrates that GN approximation is highly effective for preconditioning, suggesting that higher-order loss terms may not be critical for convergence speed.
Given model parameters θ, input x, and labels y, define the loss function L(f(θ,x), y). The objective is to minimize expected loss, with focus on iteration complexity (the number of steps required to reach target loss).
To avoid explicit Hessian storage, Jacobian-vector products (JVPs) are employed to implement a functionally equivalent method. The core idea is to optimize a second-order Taylor approximation of the loss function L and a first-order Taylor approximation of the model f.
This paper cites important works in the optimization field, including:
Martens (2010): Pioneering work on Hessian-free optimization
Gupta et al. (2018): Shampoo optimizer
Jordan et al. (2024): Muon optimizer
Vyas et al. (2025): SOAP optimizer
Overall Assessment: This is a high-quality research paper that rigorously establishes performance bounds for second-order optimization in LLM training through careful experimentation, providing important theoretical insights and practical guidance for the field. Despite computational cost and scale limitations, its academic value and guidance for future research are significant.