2025-11-24T20:55:23.989588

Nonlinear discretizations and Newton's method: characterizing stationary points of regression objectives

Rowan
Second-order methods are emerging as promising alternatives to standard first-order optimizers such as gradient descent and ADAM for training neural networks. Though the advantages of including curvature information in computing optimization steps have been celebrated in the scientific machine learning literature, the only second-order methods that have been studied are quasi-Newton, meaning that the Hessian matrix of the objective function is approximated. Though one would expect only to gain from using the true Hessian in place of its approximation, we show that neural network training reliably fails when relying on exact curvature information. The failure modes provide insight both into the geometry of nonlinear discretizations as well as the distribution of stationary points in the loss landscape, leading us to question the conventional wisdom that the loss landscape is replete with local minima.
academic

Nonlinear discretizations and Newton's method: characterizing stationary points of regression objectives

Basic Information

  • Paper ID: 2510.11987
  • Title: Nonlinear discretizations and Newton's method: characterizing stationary points of regression objectives
  • Author: Conor Rowan (University of Colorado Boulder)
  • Category: cs.LG (Machine Learning)
  • Publication Date: October 13, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.11987

Abstract

Second-order optimization methods are emerging as promising alternatives to first-order optimizers such as gradient descent and ADAM. Although the advantages of incorporating curvature information to compute optimization steps are widely praised in the scientific machine learning literature, all studied second-order methods are quasi-Newton methods that approximate the Hessian matrix of the objective function. While one might expect that replacing the approximation with the true Hessian would only yield benefits, this paper demonstrates that neural network training reliably fails when relying on exact curvature information. These failure modes provide insights into the geometric properties of nonlinear discretizations and the distribution of stationary points in the loss landscape, prompting us to question the conventional wisdom that loss landscapes are filled with local minima.

Research Background and Motivation

Problem Background

  1. First-order vs. second-order optimization: Traditionally, neural network training relies primarily on first-order optimization methods such as ADAM, which iteratively update parameters through the steepest descent direction.
  2. Theoretical advantages of second-order methods: Second-order methods use local quadratic approximations of the objective function to determine both the direction and magnitude of optimization steps, offering natural step size suggestions and avoiding oscillations in ill-conditioned regions.
  3. Limitations of existing research: All second-order methods in the scientific machine learning (SciML) literature are quasi-Newton methods (e.g., BFGS, L-BFGS) that use Hessian approximations rather than exact Hessians.

Research Motivation

The author questions a fundamental assumption: is using the exact Hessian truly better than using an approximation? Through theoretical analysis and numerical experiments, the author discovers that exact Newton's method exhibits pathological behavior in neural network training, providing new perspectives for understanding the geometry of nonlinear discretizations and the structure of loss landscapes.

Core Contributions

  1. Geometric interpretation: Discusses regression problems on manifolds and provides geometric interpretation of stationary points
  2. Conceptual framework: Conceptualizes neural networks as approximation manifolds that simultaneously construct basis functions and coefficients
  3. Trivial solution identification: Identifies special stationary points of neural network regression objectives—trivial zero solutions
  4. Numerical findings: Demonstrates through experiments that exact Newton's method reliably converges to trivial solutions, even on simple one-dimensional problems
  5. Mechanism explanation: Analyzes differences between quasi-Newton and exact Newton methods, explaining why the former succeeds

Methodology Details

Task Definition

Consider a discrete regression problem where a target vector v needs to be approximated by a parameterized vector N(θ), where θ are parameters to be determined. The standard quadratic error objective and its stationary point condition are:

L(θ)=N(θ)v2,Lθk=(N(θ)v)Nθk=0L(\theta) = \|N(\theta) - v\|^2, \quad \frac{\partial L}{\partial \theta_k} = (N(\theta) - v) \cdot \frac{\partial N}{\partial \theta_k} = 0

Geometric Understanding of Nonlinear Discretizations

Linear vs. Nonlinear Discretization Comparison

Linear discretization: Parameter scaling of fixed basis vectors satisfies the Galerkin optimality condition, guaranteeing a unique solution that is a minimum.

Nonlinear discretization: Defines an approximation manifold embedded in high-dimensional space; the stationary point condition requires the error vector to be orthogonal to the tangent space of the approximation space.

Geometric Example Analysis

Unit circle example: N(θ)=[cos(θ)sin(θ)],v=[22]N(\theta) = \begin{bmatrix} \cos(\theta) \\ \sin(\theta) \end{bmatrix}, \quad v = \begin{bmatrix} 2 \\ 2 \end{bmatrix}

Stationary point condition: Lθ=2(sin(θ)cos(θ))=0\frac{\partial L}{\partial \theta} = 2(\sin(\theta) - \cos(\theta)) = 0

Solutions: θ=π/4,5π/4\theta = \pi/4, 5\pi/4, where the former is a minimum and the latter is a maximum.

Elliptic torus example: N(θ)=[(R+rcos(θ2))cos(θ1)(R+rcos(θ2))sin(θ1)rsin(θ2)]N(\theta) = \begin{bmatrix} (R + r\cos(\theta_2))\cos(\theta_1) \\ (R + r\cos(\theta_2))\sin(\theta_1) \\ r\sin(\theta_2) \end{bmatrix}

This example demonstrates eight stationary points: 2 minima, 2 maxima, and 4 saddle points, proving that Newton's method shows no preference for different types of stationary points.

Neural Network Regression Analysis

MLP Structure Interpretation

Reformulating an MLP neural network as: N(x,θ)=k=1θOθkOhk(x;θI)N(x, \theta) = \sum_{k=1}^{|\theta^O|} \theta^O_k h_k(x; \theta^I)

where θ=[θI,θO]\theta = [\theta^I, \theta^O] is decomposed into "inner" and "outer" parameters, with inner parameters defining basis functions and outer parameters serving as scaling coefficients.

Theoretical Analysis of Trivial Solutions

When N(x;θ)=0N(x; \theta) = 0, the stationary point condition becomes: Lθ=01v(x)Nθdx=0\frac{\partial L}{\partial \theta} = \int_0^1 v(x) \frac{\partial N}{\partial \theta} dx = 0

This can be satisfied in two ways:

  1. Fitting basis functions orthogonal to the target function
  2. Setting outer parameters θO=0\theta^O = 0

Experimental Setup

Experimental Configuration

  • Network architecture: Two-layer hidden MLP with 10 neurons per layer
  • Activation functions: Hyperbolic tangent / Sine functions for SIREN networks
  • Parameter initialization: PyTorch built-in Xavier initialization
  • Optimization algorithm: Modified Newton's method (Levenberg-Marquardt algorithm)
  • Numerical integration: Uniform grid with 100 equally-spaced points

Modified Newton's Method

θk+1=θkη(2Lθθ+ϵI)1(Lθ)\theta_{k+1} = \theta_k - \eta \left(\frac{\partial^2 L}{\partial \theta \partial \theta} + \epsilon I\right)^{-1} \left(\frac{\partial L}{\partial \theta}\right)

where 0<η<10 < \eta < 1 is a step size relaxation parameter and ϵ>0\epsilon > 0 introduces convexity to avoid excessively large steps.

Experimental Results

Standard MLP Regression Experiments

Target function: v(x)=2sin(4πx)v(x) = 2\sin(4\pi x)Parameter settings: η=ϵ=5×102\eta = \epsilon = 5 \times 10^{-2}, T=1×105T = 1 \times 10^{-5}

Main findings:

  • Newton's method converges to trivial solutions, learning basis functions orthogonal to the target function
  • 9 out of 10 runs obtain trivial solutions
  • Basis functions are primarily constant functions and forms like sin(πx)+c\sin(\pi x) + c
  • Hessian eigenvalue analysis confirms saddle point solutions

SIREN Network Experiments

Network configuration: Sine activation functions with ω0=4\omega_0 = 4Parameter settings: η=5×102\eta = 5 \times 10^{-2}, ϵ=1×101\epsilon = 1 \times 10^{-1}

Results:

  • Still converges to trivial solutions, but basis functions become high-frequency non-redundant functions
  • 4 out of 5 runs obtain trivial solutions
  • Demonstrates that spectral bias cannot avoid the trivial solution problem

Fourier Feature Embedding Experiments

Input layer: γ(x)=[sin(2πBx),cos(2πBx)]T\gamma(x) = [\sin(2\pi Bx), \cos(2\pi Bx)]^TParameters: σ2=1.5\sigma^2 = 1.5, f=10f = 10

Results:

  • Approximately half of the runs converge to trivial solutions
  • Most remaining runs fail to converge
  • High-frequency basis functions still cannot avoid the problem

Physics-Informed Neural Networks (PINNs) Experiments

One-dimensional boundary value problem

2ux2+v(x)=0,u(0)=u(1)=0\frac{\partial^2 u}{\partial x^2} + v(x) = 0, \quad u(0) = u(1) = 0

Strong form loss: L(θ)=1201(2N(x;θ)x2+v(x))2dxL(\theta) = \frac{1}{2} \int_0^1 \left(\frac{\partial^2 N(x; \theta)}{\partial x^2} + v(x)\right)^2 dx

Results: All 5 runs converge to trivial solutions, learning basis functions whose second derivatives are orthogonal to the source term.

Two-dimensional diffusion-reaction problem

2u+u+v(x)=0,x[0,1]2\nabla^2 u + u + v(x) = 0, \quad x \in [0,1]^2

Comparative experiments: Newton's method converges to trivial solutions, while ADAM successfully solves the differential equation.

Hessian Eigenvalue Statistical Analysis

By randomly generating 10510^5 Hessian matrices of size 140×140 (independent standard normal distributions), the study found:

  • None of the matrices possess purely positive or purely negative eigenvalues
  • Supports the hypothesis that saddle points dominate in high-dimensional loss landscapes
  • Explains the phenomenon of Newton's method reliably converging to saddle points

Application of Quasi-Newton Methods in SciML

  1. L-BFGS applications: Airfoil geometry optimization while learning flow distributions
  2. Hybrid optimizers: Hybrid methods combining L-BFGS and ADAM
  3. BFGS family comparisons: Performance improvements of self-scaling BFGS variants
  4. Gradient conflict resolution: Quasi-Newton methods naturally resolve gradient conflicts between different terms in loss functions
  5. Preconditioning strategies: Novel quasi-Newton preconditioning methods

Comparison with Exact Newton's Method

All existing second-order methods in the literature are quasi-Newton methods; this paper is the first to systematically study the behavior of exact Newton's method in neural network training.

Conclusions and Discussion

Main Conclusions

  1. Failure of exact Newton's method: Exact Hessian information causes neural network training to reliably fail, converging to trivial saddle point solutions
  2. Success mechanism of quasi-Newton methods: Quasi-Newton methods succeed not because of Hessian approximation, but because of built-in ascent protection mechanisms
  3. Loss landscape characteristics: Saddle points dominate in high-dimensional neural network loss landscapes, questioning the conventional view that "local minima are abundant"
  4. Geometric insights: Nonlinear discretizations create embedded manifolds where stationary point conditions have clear geometric interpretations

Key Insights

True advantages of quasi-Newton methods:

  • BFGS/L-BFGS enforce curvature conditions, maintaining positive-definite Hessian approximations
  • Avoid the explicit rejection of negative curvature directions in Newton's method for saddle points
  • Only utilize curvature information that aids minimization, ignoring negative curvature

Limitations

  1. Simple examples: Numerical experiments are relatively simple; behavior on complex practical problems may differ
  2. Theoretical depth: Theoretical explanations for non-uniqueness of trivial solutions and specific convergence mechanisms warrant deeper investigation
  3. Practical applicability: Primarily theoretical insights with limited direct guidance for practical applications

Future Directions

  1. Loss landscape theory: Deeper understanding of the geometric structure of neural network loss landscapes
  2. Optimizer design: Novel second-order optimizers based on negative curvature handling
  3. Convergence analysis: Convergence theory for different optimizers on high-dimensional non-convex problems
  4. Practical applications: Verification of findings on more complex scientific computing problems

In-Depth Evaluation

Strengths

  1. Theoretical innovation: First systematic study of pathological behavior of exact Newton's method in neural network training, challenging conventional wisdom
  2. Geometric insights: Provides geometric interpretation of nonlinear discretizations and stationary points, enhancing understanding of loss landscapes
  3. Experimental sufficiency: Clear hierarchical experimental design from simple geometric examples to complex neural networks
  4. Practical value: Explains the true reasons for quasi-Newton method success, providing guidance for optimizer design

Weaknesses

  1. Experimental scale: Neural network experiments are relatively simple, lacking validation on large-scale practical applications
  2. Theoretical depth: Theoretical analysis of convergence mechanisms to trivial solutions could be more rigorous
  3. Solution proposals: Primarily identifies problems with limited exploration of improvement methods
  4. Generalizability: Universal applicability of conclusions requires broader verification

Impact

  1. Academic contribution: Provides new perspectives for optimization theory and neural network training
  2. Practical guidance: Explains design principles of second-order optimization methods
  3. Research inspiration: Initiates deeper investigation into the geometric structure of loss landscapes

Applicable Scenarios

  1. Scientific machine learning: Physics-informed neural networks and other scientific computing applications
  2. Optimizer research: Theoretical analysis and improvement of second-order optimization methods
  3. Educational research: Teaching cases for optimization theory and neural network geometry

References

The paper cites 30 relevant references, covering:

  • Classical optimization theory textbooks (Nocedal & Wright, Ruszczynski)
  • Neural network optimization methods (ADAM, BFGS family)
  • Physics-informed neural networks (Raissi et al., various PINNs applications)
  • Neural network theory (spectral bias, SIREN, Fourier features)
  • High-dimensional optimization theory (saddle point problems, Dauphin et al.)

Overall Assessment: This is an excellent paper with profound theoretical insights that challenges the conventional wisdom that exact Hessians are necessarily superior through counterintuitive findings, providing new perspectives for understanding the geometric nature of neural network optimization. While the experimental scale is relatively limited, its theoretical contributions and explanations of optimizer design principles possess significant academic value.