2025-11-24T20:55:23.989588

Nonlinear discretizations and Newton's method: characterizing stationary points of regression objectives

Rowan

Second-order methods are emerging as promising alternatives to standard first-order optimizers such as gradient descent and ADAM for training neural networks. Though the advantages of including curvature information in computing optimization steps have been celebrated in the scientific machine learning literature, the only second-order methods that have been studied are quasi-Newton, meaning that the Hessian matrix of the objective function is approximated. Though one would expect only to gain from using the true Hessian in place of its approximation, we show that neural network training reliably fails when relying on exact curvature information. The failure modes provide insight both into the geometry of nonlinear discretizations as well as the distribution of stationary points in the loss landscape, leading us to question the conventional wisdom that the loss landscape is replete with local minima.

academic

Nonlinear discretizations and Newton's method: characterizing stationary points of regression objectives

基本信息

论文ID: 2510.11987
标题: Nonlinear discretizations and Newton's method: characterizing stationary points of regression objectives
作者: Conor Rowan (University of Colorado Boulder)
分类: cs.LG (Machine Learning)
发表时间: 2025年10月13日 (arXiv预印本)
论文链接: https://arxiv.org/abs/2510.11987

摘要

二阶优化方法作为梯度下降和ADAM等一阶优化器的有前途替代方案正在兴起。尽管在科学机器学习文献中，包含曲率信息来计算优化步骤的优势备受赞誉，但所研究的二阶方法都是拟牛顿法，即对目标函数的Hessian矩阵进行近似。虽然人们期望使用真实Hessian替代其近似只会带来收益，但本文表明，当依赖精确曲率信息时，神经网络训练会可靠地失败。这些失败模式为非线性离散化的几何特性以及损失景观中驻点分布提供了洞察，使我们质疑损失景观充满局部极小值的传统观念。

研究背景与动机

问题背景

一阶vs二阶优化: 传统上，神经网络训练主要依赖ADAM等一阶优化方法，这些方法通过最陡下降方向迭代更新参数。
二阶方法的理论优势: 二阶方法使用目标函数的局部二次近似来确定步长方向和大小，具有自然建议步长、避免病态区域振荡等优势。
现有研究局限: 科学机器学习(SciML)文献中的所有二阶方法都是拟牛顿法(如BFGS、L-BFGS)，使用Hessian近似而非精确Hessian。

研究动机

作者质疑一个基本假设：使用精确Hessian是否真的比近似更好？通过理论分析和数值实验，作者发现精确牛顿法在神经网络训练中表现出病理行为，这为理解非线性离散化几何和损失景观结构提供了新视角。

核心贡献

几何解释: 讨论流形上的回归问题，展示驻点的几何解释
概念框架: 将神经网络概念化为同时构建基函数和系数的近似流形
平凡解识别: 识别出神经网络回归目标的特殊驻点——平凡零解
数值发现: 通过实验证明精确牛顿法可靠地收敛到平凡解，即使在简单一维问题上
机制解释: 分析拟牛顿法与精确牛顿法的差异，解释前者成功的原因

方法详解

任务定义

考虑离散回归问题，目标向量v需要由参数化向量N(θ)近似，其中θ为待确定参数。标准二次误差目标及其驻点条件为：

$L(\theta) = \|N(\theta) - v\|^2, \quad \frac{\partial L}{\partial \theta_k} = (N(\theta) - v) \cdot \f�{\partial N}{\partial \theta_k} = 0$

非线性离散化的几何理解

线性vs非线性离散化对比

线性离散化：参数缩放固定基向量，满足Galerkin最优性条件，保证唯一解且为最小值。

非线性离散化：定义嵌入在高维空间中的流形近似，驻点条件要求误差向量正交于近似空间的切空间。

几何示例分析

单位圆示例： $N(\theta) = \begin{bmatrix} \cos(\theta) \\ \sin(\theta) \end{bmatrix}, \quad v = \begin{bmatrix} 2 \\ 2 \end{bmatrix}$

驻点条件： $\frac{\partial L}{\partial \theta} = 2(\sin(\theta) - \cos(\theta)) = 0$

解得 $\theta = \pi/4, 5\pi/4$ ，其中前者为最小值，后者为最大值。

椭圆环面示例： $N(\theta) = \begin{bmatrix} (R + r\cos(\theta_2))\cos(\theta_1) \\ (R + r\cos(\theta_2))e\sin(\theta_1) \\ r\sin(\theta_2) \end{bmatrix}$

该示例展示了8个驻点：2个最小值、2个最大值、4个鞍点，证明牛顿法对不同类型驻点无偏好。

神经网络回归分析

MLP结构解释

将MLP神经网络重新表述为： $N(x, \theta) = \sum_{k=1}^{|\theta^O|} \theta^O_k h_k(x; \theta^I)$

其中 $\theta = [\theta^I, \theta^O]$ 分解为"内部"和"外部"参数，内部参数定义基函数，外部参数作为缩放系数。

平凡解的理论分析

当 $N(x; \theta) = 0$ 时，驻点条件变为： $\frac{\partial L}{\partial \theta} = \int_0^1 v(x) \frac{\partial N}{\partial \theta} dx = 0$

可通过两种方式满足：

拟合与目标函数正交的基函数
设置外部参数 $\theta^O = 0$

实验设置

实验配置

网络架构: 两层隐藏层MLP，每层10个神经元
激活函数: 双曲正切函数 / SIREN网络的正弦函数
参数初始化: PyTorch内置Xavier初始化
优化算法: 修正牛顿法（Levenberg-Marquardt算法）
数值积分: 100个等间距点的均匀网格

修正牛顿法

$\theta_{k+1} = \theta_k - \eta \left(\frac{\partial^2 L}{\partial \theta \partial \theta} + \epsilon I\right)^{-1} \left(\frac{\partial L}{\partial \theta}\right)$