2025-11-19T10:07:13.697330

Nonlinearly Preconditioned Gradient Methods: Momentum and Stochastic Analysis

Oikonomidis, Quan, Patrinos

We study nonlinearly preconditioned gradient methods for smooth nonconvex optimization problems, focusing on sigmoid preconditioners that inherently perform a form of gradient clipping akin to the widely used gradient clipping technique. Building upon this idea, we introduce a novel heavy ball-type algorithm and provide convergence guarantees under a generalized smoothness condition that is less restrictive than traditional Lipschitz smoothness, thus covering a broader class of functions. Additionally, we develop a stochastic variant of the base method and study its convergence properties under different noise assumptions. We compare the proposed algorithms with baseline methods on diverse tasks from machine learning including neural network training.

academic

Nonlinearly Preconditioned Gradient Methods: Momentum and Stochastic Analysis

基本信息

论文ID: 2510.11312
标题: Nonlinearly Preconditioned Gradient Methods: Momentum and Stochastic Analysis
作者: Konstantinos Oikonomidis, Jan Quan, Panagiotis Patrinos (KU Leuven)
分类: math.OC (Optimization and Control)
发表会议: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
论文链接: https://arxiv.org/abs/2510.11312

摘要

本文研究了用于光滑非凸优化问题的非线性预条件梯度方法，重点关注本质上执行类似于广泛使用的梯度裁剪技术的sigmoid预条件器。基于这一思想，作者引入了一种新颖的重球型算法，并在比传统Lipschitz光滑性限制更宽松的广义光滑性条件下提供了收敛保证，从而覆盖了更广泛的函数类别。此外，作者还开发了基础方法的随机变体，并在不同的噪声假设下研究了其收敛性质。

研究背景与动机

要解决的问题：传统的梯度下降(GD)和随机梯度下降(SGD)方法在处理不满足全局Lipschitz梯度假设的现代机器学习应用中需要谨慎调参或昂贵的线搜索策略。
问题重要性：现代深度学习应用中的大多数成本函数都不满足传统的Lipschitz梯度假设，而梯度裁剪技术已成为语言模型等任务的标准实践，用于稳定神经网络训练。
现有方法局限性：
- 标准GD/SGD方法在处理超出Lipschitz光滑性的问题时收敛困难
- 现有梯度裁剪方法的理论分析主要局限于特定的光滑性条件
- 缺乏在更一般设置下的动量方法分析
研究动机：将梯度裁剪方法统一到非线性预条件框架中，并扩展到包含动量和随机变体的更一般理论分析。

核心贡献

扩展了各向异性梯度下降方法：通过在基础迭代中加入重球动量，在一般非凸设置下研究了收敛保证。
提出了随机扩展：在不同噪声假设下分析了基础方法的随机版本，包括比有界方差更宽松的条件。
理论分析贡献：
- 在各向异性下降不等式下证明了动量算法的收敛性
- 在广义PL条件下证明了线性收敛率
- 在新的噪声假设下分析了随机方法
实验验证：在多种机器学习任务上展示了所提方法的良好性能，包括神经网络训练和矩阵分解。

方法详解

任务定义

考虑一般最小化问题： $\min_{x \in \mathbb{R}^n} f(x)$ 其中 $f: \mathbb{R}^n \to \mathbb{R}$ 是光滑且可能非凸的函数。

核心框架：非线性预条件梯度方法

基础方法： $x^{k+1} = x^k - \gamma \nabla \phi^*(\nabla f(x^k))$

其中 $\phi: \mathbb{R}^n \to \mathbb{R}$ 是凸参考函数， $\phi^*$ 是其凸共轭， $\nabla \phi^*$ 生成预条件器。

关键思想：通过选择强凸且有界域的参考函数 $\phi$ ，映射 $\nabla \phi^*$ 将 $\mathbb{R}^n$ 映射到单位 $n$ -球，自然地实现梯度裁剪。

算法1：带动量的非线性预条件梯度方法 (m-NPGM)

输入：选择 x⁰ ∈ ℝⁿ, γ, β > 0，设置 m⁻¹ = 0ⁿ
重复 k = 0, 1, ... 直到收敛：
1. 计算 mᵏ = βmᵏ⁻¹ + (1-β)∇φ*(∇f(xᵏ))
2. 计算 xᵏ⁺¹ = xᵏ - γmᵏ

等价形式： $x^{k+1} = x^k - (1-\beta)\gamma\nabla\phi^*(\nabla f(x^k)) + \beta(x^k - x^{k-1})$

各向异性下降不等式

定义：函数 $f$ 相对于 $\phi$ 满足各向异性下降性质，如果对所有 $x, \bar{x} \in \mathbb{R}^n$ ： $f(x) \leq f(\bar{x}) + \frac{1}{L} \star \phi(x - \bar{y}) - \frac{1}{L} \star \phi(\bar{x} - \bar{y})$ 其中 $\bar{y} = \bar{x} - \frac{1}{L}\nabla\phi^*(\nabla f(\bar{x}))$ 。