Nonlinear discretizations and Newton's method: characterizing stationary points of regression objectives
Rowan
Second-order methods are emerging as promising alternatives to standard first-order optimizers such as gradient descent and ADAM for training neural networks. Though the advantages of including curvature information in computing optimization steps have been celebrated in the scientific machine learning literature, the only second-order methods that have been studied are quasi-Newton, meaning that the Hessian matrix of the objective function is approximated. Though one would expect only to gain from using the true Hessian in place of its approximation, we show that neural network training reliably fails when relying on exact curvature information. The failure modes provide insight both into the geometry of nonlinear discretizations as well as the distribution of stationary points in the loss landscape, leading us to question the conventional wisdom that the loss landscape is replete with local minima.
academic
Nonlinear discretizations and Newton's method: characterizing stationary points of regression objectives
Second-order optimization methods are emerging as promising alternatives to first-order optimizers such as gradient descent and ADAM. Although the advantages of incorporating curvature information to compute optimization steps are widely praised in the scientific machine learning literature, all studied second-order methods are quasi-Newton methods that approximate the Hessian matrix of the objective function. While one might expect that replacing the approximation with the true Hessian would only yield benefits, this paper demonstrates that neural network training reliably fails when relying on exact curvature information. These failure modes provide insights into the geometric properties of nonlinear discretizations and the distribution of stationary points in the loss landscape, prompting us to question the conventional wisdom that loss landscapes are filled with local minima.
First-order vs. second-order optimization: Traditionally, neural network training relies primarily on first-order optimization methods such as ADAM, which iteratively update parameters through the steepest descent direction.
Theoretical advantages of second-order methods: Second-order methods use local quadratic approximations of the objective function to determine both the direction and magnitude of optimization steps, offering natural step size suggestions and avoiding oscillations in ill-conditioned regions.
Limitations of existing research: All second-order methods in the scientific machine learning (SciML) literature are quasi-Newton methods (e.g., BFGS, L-BFGS) that use Hessian approximations rather than exact Hessians.
The author questions a fundamental assumption: is using the exact Hessian truly better than using an approximation? Through theoretical analysis and numerical experiments, the author discovers that exact Newton's method exhibits pathological behavior in neural network training, providing new perspectives for understanding the geometry of nonlinear discretizations and the structure of loss landscapes.
Geometric interpretation: Discusses regression problems on manifolds and provides geometric interpretation of stationary points
Conceptual framework: Conceptualizes neural networks as approximation manifolds that simultaneously construct basis functions and coefficients
Trivial solution identification: Identifies special stationary points of neural network regression objectives—trivial zero solutions
Numerical findings: Demonstrates through experiments that exact Newton's method reliably converges to trivial solutions, even on simple one-dimensional problems
Mechanism explanation: Analyzes differences between quasi-Newton and exact Newton methods, explaining why the former succeeds
Consider a discrete regression problem where a target vector v needs to be approximated by a parameterized vector N(θ), where θ are parameters to be determined. The standard quadratic error objective and its stationary point condition are:
Linear discretization: Parameter scaling of fixed basis vectors satisfies the Galerkin optimality condition, guaranteeing a unique solution that is a minimum.
Nonlinear discretization: Defines an approximation manifold embedded in high-dimensional space; the stationary point condition requires the error vector to be orthogonal to the tangent space of the approximation space.
This example demonstrates eight stationary points: 2 minima, 2 maxima, and 4 saddle points, proving that Newton's method shows no preference for different types of stationary points.
Reformulating an MLP neural network as:
N(x,θ)=∑k=1∣θO∣θkOhk(x;θI)
where θ=[θI,θO] is decomposed into "inner" and "outer" parameters, with inner parameters defining basis functions and outer parameters serving as scaling coefficients.
All existing second-order methods in the literature are quasi-Newton methods; this paper is the first to systematically study the behavior of exact Newton's method in neural network training.
Failure of exact Newton's method: Exact Hessian information causes neural network training to reliably fail, converging to trivial saddle point solutions
Success mechanism of quasi-Newton methods: Quasi-Newton methods succeed not because of Hessian approximation, but because of built-in ascent protection mechanisms
Loss landscape characteristics: Saddle points dominate in high-dimensional neural network loss landscapes, questioning the conventional view that "local minima are abundant"
Geometric insights: Nonlinear discretizations create embedded manifolds where stationary point conditions have clear geometric interpretations
Theoretical innovation: First systematic study of pathological behavior of exact Newton's method in neural network training, challenging conventional wisdom
Geometric insights: Provides geometric interpretation of nonlinear discretizations and stationary points, enhancing understanding of loss landscapes
Experimental sufficiency: Clear hierarchical experimental design from simple geometric examples to complex neural networks
Practical value: Explains the true reasons for quasi-Newton method success, providing guidance for optimizer design
Physics-informed neural networks (Raissi et al., various PINNs applications)
Neural network theory (spectral bias, SIREN, Fourier features)
High-dimensional optimization theory (saddle point problems, Dauphin et al.)
Overall Assessment: This is an excellent paper with profound theoretical insights that challenges the conventional wisdom that exact Hessians are necessarily superior through counterintuitive findings, providing new perspectives for understanding the geometric nature of neural network optimization. While the experimental scale is relatively limited, its theoretical contributions and explanations of optimizer design principles possess significant academic value.