2025-11-21T05:43:14.438076

An Adaptive Algorithm for Bilevel Optimization on Riemannian Manifolds

Shi, Xiao, Jiang

Existing methods for solving Riemannian bilevel optimization (RBO) problems require prior knowledge of the problem's first- and second-order information and curvature parameter of the Riemannian manifold to determine step sizes, which poses practical limitations when these parameters are unknown or computationally infeasible to obtain. In this paper, we introduce the Adaptive Riemannian Hypergradient Descent (AdaRHD) algorithm for solving RBO problems. To our knowledge, AdaRHD is the first method to incorporate a fully adaptive step size strategy that eliminates the need for problem-specific parameters in RBO. We prove that AdaRHD achieves an $\mathcal{O}(1/Îµ)$ iteration complexity for finding an $Îµ$-stationary point, thus matching the complexity of existing non-adaptive methods. Furthermore, we demonstrate that substituting exponential mappings with retraction mappings maintains the same complexity bound. Experiments demonstrate that AdaRHD achieves comparable performance to existing non-adaptive approaches while exhibiting greater robustness.

academic

An Adaptive Algorithm for Bilevel Optimization on Riemannian Manifolds

Basic Information

Paper ID: 2504.06042
Title: An Adaptive Algorithm for Bilevel Optimization on Riemannian Manifolds
Authors: Xu Shi, Rufeng Xiao, Rujun Jiang (School of Data Science, Fudan University)
Classification: math.OC (Optimization and Control)
Conference: NeurIPS 2025 (39th Conference on Neural Information Processing Systems)
Paper Link: https://arxiv.org/abs/2504.06042

Abstract

Existing methods for solving Riemannian bilevel optimization (RBO) problems require prior knowledge of first-order and second-order information of the problem, as well as curvature parameters of the Riemannian manifold to determine step sizes, which imposes practical limitations when these parameters are unknown or computationally infeasible. This paper proposes the Adaptive Riemannian Hypergradient Descent (AdaRHD) algorithm to solve RBO problems. To the best of our knowledge, AdaRHD is the first method to employ a fully adaptive step-size strategy in RBO, eliminating the need for problem-specific parameters. We prove that AdaRHD achieves O(1/ε) iteration complexity for finding an ε-stationary point, matching the complexity of existing non-adaptive methods. Furthermore, we demonstrate that replacing the exponential map with a contraction map maintains the same complexity bound. Experiments show that AdaRHD exhibits stronger robustness while achieving comparable performance to existing non-adaptive methods.

Research Background and Motivation

Problem Background

Bilevel optimization problems have broad applications in machine learning, including reinforcement learning, meta-learning, hyperparameter optimization, and adversarial learning. Riemannian bilevel optimization (RBO) extends bilevel optimization to Riemannian manifolds with the general form:

$\min_{x \in \mathcal{M}_x} F(x) := f(x, y^*(x))$ $\text{s.t. } y^*(x) = \arg\min_{y \in \mathcal{M}_y} g(x,y)$

where $\mathcal{M}_x, \mathcal{M}_y$ are Riemannian manifolds, $f,g$ are smooth functions, and $g(x,y)$ is geodesically strongly convex with respect to $y$ .

Limitations of Existing Methods

Parameter Dependency: Existing RBO methods (such as RHGD, RieBO) require prior knowledge of strong convexity parameters, Lipschitz constants, and curvature parameters to determine step sizes
Practical Constraints: These parameters are often difficult to estimate or computationally expensive to calculate in practical applications
Insufficient Robustness: Fixed step-size strategies are sensitive to initialization and problem conditioning

Research Motivation

The core motivation of this paper is to design a fully adaptive RBO algorithm that can:

Operate without prior knowledge of problem-specific parameters
Automatically adjust step sizes to adapt to problem characteristics
Maintain theoretical complexity comparable to non-adaptive methods
Provide stronger practical robustness

Core Contributions

First Adaptive RBO Algorithm: Proposes AdaRHD, the first Riemannian bilevel optimization algorithm employing a fully adaptive step-size strategy, eliminating dependence on strong convexity, Lipschitz constants, and curvature parameters
Matching Theoretical Complexity: Proves that AdaRHD achieves O(1/ε) iteration complexity for finding ε-stationary points, matching the complexity of existing non-adaptive methods
Contraction Map Extension: Demonstrates that replacing the exponential map with a computationally more efficient contraction map maintains the same complexity guarantees
Experimental Validation: Verifies algorithm effectiveness and robustness on multiple RBO problems, including Riemannian meta-representation learning and robust optimization problems

Method Details

Task Definition

Consider the Riemannian bilevel optimization problem:

Upper-level problem: Minimize $F(x) = f(x, y^*(x))$ on manifold $\mathcal{M}_x$
Lower-level problem: For given $x$ , solve $y^*(x) = \arg\min_y g(x,y)$ on manifold $\mathcal{M}_y$
Constraints: $g(x,y)$ is geodesically strongly convex with respect to $y$ ; $f$ is not required to be convex

Core Technique: Riemannian Hypergradient

The Riemannian hypergradient is defined as: $G_F(x) = G_x f(x, y^*(x)) - G^2_{xy}g(x, y^*(x))[H^{-1}_y g(x, y^*(x))[G_y f(x, y^*(x))]]$

Due to computational difficulty, an approximate Riemannian hypergradient is used: $\hat{G}_F(x, \hat{y}, \hat{v}) = G_x f(x, \hat{y}) - G^2_{xy}g(x, \hat{y})[\hat{v}]$

where $\hat{y}$ is an approximate solution to the lower-level problem and $\hat{v}$ is an approximate solution to the linear system.

AdaRHD Algorithm Architecture

Algorithm 1: Main Steps of AdaRHD

Lower-level Problem Solving: Using adaptive gradient descent
- Step size update: $b^2_{k+1} = b^2_k + \|G_y g(x_t, y^k_t)\|^2$
- Iteration update: $y^{k+1}_t = \text{Exp}_{y^k_t}(-\frac{1}{b_{k+1}} G_y g(x_t, y^k_t))$
Linear System Solving: Two strategies
- Gradient Descent: Adaptive step size similar to the lower-level problem
- Conjugate Gradient: Using tangent space conjugate gradient method
Upper-level Update: Adaptive hypergradient descent
- Step size update: $a^2_{t+1} = a^2_t + \|\hat{G}_F(x_t, y^{K_t}_t, v^{N_t}_t)\|^2$
- Iteration update: $x_{t+1} = \text{Exp}_{x_t}(-\frac{1}{a_{t+1}} \hat{G}_F(x_t, y^{K_t}_t, v^{N_t}_t))$

Technical Innovations

Cumulative Gradient Norm Strategy: Employs "reciprocal of cumulative Riemannian gradient norm" as adaptive step size, requiring no prior knowledge of problem parameters
Three-level Adaptation: Applies adaptive step sizes to upper-level, lower-level, and linear system solving, forming a complete adaptive framework
Contraction Map Optimization: Provides a version using contraction maps instead of exponential maps, reducing computational complexity
Theoretical Guarantees: Rigorous convergence analysis addressing technical challenges posed by Riemannian manifold geometry

Experimental Setup

Datasets and Problems

Simple Matrix Similarity Problems: Optimization on Stiefel and SPD manifolds
- Data scales: n=100 and n=1000
- Parameter settings: d=50, r=20, λ=0.01
Deep Meta-representation Learning: AFEW emotion recognition dataset
- 3-layer SPD network architecture
- 7 emotion categories, 1747 training samples
- Imbalanced class distribution
Robust Optimization Problems:
- Robust Karcher mean problem
- Robust maximum likelihood estimation problem

Comparison Methods

RHGD-20/50: Riemannian hypergradient descent with maximum iterations of 20/50 for lower-level problem
AdaRHD-GD: AdaRHD using gradient descent to solve linear systems
AdaRHD-CG: AdaRHD using conjugate gradient to solve linear systems

Evaluation Metrics

Upper-level objective function value
Hypergradient estimation error
Validation accuracy
Convergence time and iteration count

Experimental Results

Main Results

Simple Problem Experiments:

AdaRHD exhibits faster convergence speed at both data scales
Lower hypergradient estimation error, particularly for AdaRHD-CG
Computational time advantages, especially on large-scale problems

Robustness Analysis:

AdaRHD demonstrates significant robustness under different initial step-size settings
RHGD fails with large step sizes (5, 1, 0.5), while AdaRHD converges stably
AdaRHD-CG achieves 85% validation accuracy fastest

Key Findings

Robustness Advantage: AdaRHD is insensitive to initial step-size selection, whereas RHGD completely fails with inappropriate step sizes
Efficiency Improvement: Although AdaRHD requires more outer iterations, the adaptive strategy results in competitive overall computation time
Method Selection: AdaRHD-CG outperforms AdaRHD-GD in accuracy and robustness, though the latter converges faster initially

Theoretical Analysis

Complexity Results

Theorem 3.1: Under standard assumptions, AdaRHD satisfies: $\frac{1}{T}\sum_{t=0}^{T-1} \|G_F(x_t)\|^2_{x_t} \leq \frac{C}{T} = O\left(\frac{1}{T}\right)$

Corollary 3.1: Complexity for finding ε-stationary point:

Total iterations: T = O(1/ε)
Gradient complexity: $G_f = O(1/ε)$ , $G_g = O(1/ε^2)$
Hessian-vector product complexity: O(1/ε²) for AdaRHD-GD, Õ(1/ε) for AdaRHD-CG

Technical Challenges

Geometric Structure: Curvature of Riemannian manifolds introduces additional analytical complexity
Triangle Distance Bounds: Requires using Riemannian manifold-specific triangle distance bounds rather than Euclidean counterparts
Adaptive Step-size Analysis: Adaptive strategies may cause divergence behavior initially, requiring rigorous theoretical treatment

Bilevel Optimization

Euclidean bilevel optimization: AID, ITD, Neumann series, conjugate gradient methods
Recent adaptive methods: D-TFBO

Riemannian Optimization

Classical methods: Riemannian gradient descent, nonlinear conjugate gradient, variance-reduced stochastic gradient
Adaptive methods: RASA, RAMSGrad, Riemannian SAM

Riemannian Bilevel Optimization

RieBO/RieSBO: Deterministic and stochastic Riemannian bilevel optimization
RHGD: Riemannian hypergradient descent framework
RF2SA: Fully randomized first-order method

Conclusions and Discussion

Main Conclusions

AdaRHD is the first fully adaptive Riemannian bilevel optimization algorithm, eliminating dependence on problem-specific parameters
Theoretically achieves O(1/ε) complexity matching non-adaptive methods
Experiments validate algorithm effectiveness and significant robustness advantages

Limitations

Complexity Gap: Gradient and Hessian-vector product complexity are higher by a factor of 1/ε compared to non-adaptive methods
Assumption Constraints: Still requires geodesic strong convexity of the lower-level problem
Single-loop vs. Double-loop: Currently only considers double-loop algorithms

Future Directions

Single-loop Algorithms: Design adaptive single-loop Riemannian bilevel optimization algorithms
Stochastic Settings: Extend to stochastic Riemannian bilevel optimization
Weak Convexity: Handle geodesically convex (non-strongly convex) lower-level objectives
Complexity Optimization: Explore adaptive strategies to eliminate the 1/ε gap

In-depth Evaluation

Strengths

Theoretical Innovation: First to achieve full adaptivity in RBO with rigorous theoretical analysis
Practical Value: Significantly improves algorithm robustness and usability
Technical Depth: Successfully addresses technical challenges posed by Riemannian geometry
Comprehensive Experiments: Thorough validation across multiple application scenarios

Weaknesses

Complexity Cost: Adaptivity comes at the expense of additional computational complexity
Assumption Limitations: Still requires relatively strong assumptions
Application Scope: Primarily focused on specific Riemannian manifolds

Impact

Academic Contribution: Provides important progress at the intersection of Riemannian optimization and bilevel optimization
Practical Value: Offers more robust tools for Riemannian bilevel optimization in practical applications
Future Research: Establishes foundation for further adaptive Riemannian optimization research

Applicable Scenarios

Riemannian meta-learning and neural architecture search
Image segmentation and low-rank adaptation
Robust statistics and geometric machine learning
Any application requiring bilevel optimization under manifold constraints

This paper makes significant contributions to the field of Riemannian bilevel optimization, achieving full adaptivity in algorithm design for the first time while maintaining theoretical complexity and substantially improving practical robustness. Despite certain complexity costs, its theoretical innovations and practical value make it an important advance in the field.