Existing methods for solving Riemannian bilevel optimization (RBO) problems require prior knowledge of the problem's first- and second-order information and curvature parameter of the Riemannian manifold to determine step sizes, which poses practical limitations when these parameters are unknown or computationally infeasible to obtain. In this paper, we introduce the Adaptive Riemannian Hypergradient Descent (AdaRHD) algorithm for solving RBO problems. To our knowledge, AdaRHD is the first method to incorporate a fully adaptive step size strategy that eliminates the need for problem-specific parameters in RBO. We prove that AdaRHD achieves an $\mathcal{O}(1/ε)$ iteration complexity for finding an $ε$-stationary point, thus matching the complexity of existing non-adaptive methods. Furthermore, we demonstrate that substituting exponential mappings with retraction mappings maintains the same complexity bound. Experiments demonstrate that AdaRHD achieves comparable performance to existing non-adaptive approaches while exhibiting greater robustness.
- Paper ID: 2504.06042
- Title: An Adaptive Algorithm for Bilevel Optimization on Riemannian Manifolds
- Authors: Xu Shi, Rufeng Xiao, Rujun Jiang (School of Data Science, Fudan University)
- Classification: math.OC (Optimization and Control)
- Conference: NeurIPS 2025 (39th Conference on Neural Information Processing Systems)
- Paper Link: https://arxiv.org/abs/2504.06042
Existing methods for solving Riemannian bilevel optimization (RBO) problems require prior knowledge of first-order and second-order information of the problem, as well as curvature parameters of the Riemannian manifold to determine step sizes, which imposes practical limitations when these parameters are unknown or computationally infeasible. This paper proposes the Adaptive Riemannian Hypergradient Descent (AdaRHD) algorithm to solve RBO problems. To the best of our knowledge, AdaRHD is the first method to employ a fully adaptive step-size strategy in RBO, eliminating the need for problem-specific parameters. We prove that AdaRHD achieves O(1/ε) iteration complexity for finding an ε-stationary point, matching the complexity of existing non-adaptive methods. Furthermore, we demonstrate that replacing the exponential map with a contraction map maintains the same complexity bound. Experiments show that AdaRHD exhibits stronger robustness while achieving comparable performance to existing non-adaptive methods.
Bilevel optimization problems have broad applications in machine learning, including reinforcement learning, meta-learning, hyperparameter optimization, and adversarial learning. Riemannian bilevel optimization (RBO) extends bilevel optimization to Riemannian manifolds with the general form:
minx∈MxF(x):=f(x,y∗(x))s.t. y∗(x)=argminy∈Myg(x,y)
where Mx,My are Riemannian manifolds, f,g are smooth functions, and g(x,y) is geodesically strongly convex with respect to y.
- Parameter Dependency: Existing RBO methods (such as RHGD, RieBO) require prior knowledge of strong convexity parameters, Lipschitz constants, and curvature parameters to determine step sizes
- Practical Constraints: These parameters are often difficult to estimate or computationally expensive to calculate in practical applications
- Insufficient Robustness: Fixed step-size strategies are sensitive to initialization and problem conditioning
The core motivation of this paper is to design a fully adaptive RBO algorithm that can:
- Operate without prior knowledge of problem-specific parameters
- Automatically adjust step sizes to adapt to problem characteristics
- Maintain theoretical complexity comparable to non-adaptive methods
- Provide stronger practical robustness
- First Adaptive RBO Algorithm: Proposes AdaRHD, the first Riemannian bilevel optimization algorithm employing a fully adaptive step-size strategy, eliminating dependence on strong convexity, Lipschitz constants, and curvature parameters
- Matching Theoretical Complexity: Proves that AdaRHD achieves O(1/ε) iteration complexity for finding ε-stationary points, matching the complexity of existing non-adaptive methods
- Contraction Map Extension: Demonstrates that replacing the exponential map with a computationally more efficient contraction map maintains the same complexity guarantees
- Experimental Validation: Verifies algorithm effectiveness and robustness on multiple RBO problems, including Riemannian meta-representation learning and robust optimization problems
Consider the Riemannian bilevel optimization problem:
- Upper-level problem: Minimize F(x)=f(x,y∗(x)) on manifold Mx
- Lower-level problem: For given x, solve y∗(x)=argminyg(x,y) on manifold My
- Constraints: g(x,y) is geodesically strongly convex with respect to y; f is not required to be convex
The Riemannian hypergradient is defined as:
GF(x)=Gxf(x,y∗(x))−Gxy2g(x,y∗(x))[Hy−1g(x,y∗(x))[Gyf(x,y∗(x))]]
Due to computational difficulty, an approximate Riemannian hypergradient is used:
G^F(x,y^,v^)=Gxf(x,y^)−Gxy2g(x,y^)[v^]
where y^ is an approximate solution to the lower-level problem and v^ is an approximate solution to the linear system.
Algorithm 1: Main Steps of AdaRHD
- Lower-level Problem Solving: Using adaptive gradient descent
- Step size update: bk+12=bk2+∥Gyg(xt,ytk)∥2
- Iteration update: ytk+1=Expytk(−bk+11Gyg(xt,ytk))
- Linear System Solving: Two strategies
- Gradient Descent: Adaptive step size similar to the lower-level problem
- Conjugate Gradient: Using tangent space conjugate gradient method
- Upper-level Update: Adaptive hypergradient descent
- Step size update: at+12=at2+∥G^F(xt,ytKt,vtNt)∥2
- Iteration update: xt+1=Expxt(−at+11G^F(xt,ytKt,vtNt))
- Cumulative Gradient Norm Strategy: Employs "reciprocal of cumulative Riemannian gradient norm" as adaptive step size, requiring no prior knowledge of problem parameters
- Three-level Adaptation: Applies adaptive step sizes to upper-level, lower-level, and linear system solving, forming a complete adaptive framework
- Contraction Map Optimization: Provides a version using contraction maps instead of exponential maps, reducing computational complexity
- Theoretical Guarantees: Rigorous convergence analysis addressing technical challenges posed by Riemannian manifold geometry
- Simple Matrix Similarity Problems: Optimization on Stiefel and SPD manifolds
- Data scales: n=100 and n=1000
- Parameter settings: d=50, r=20, λ=0.01
- Deep Meta-representation Learning: AFEW emotion recognition dataset
- 3-layer SPD network architecture
- 7 emotion categories, 1747 training samples
- Imbalanced class distribution
- Robust Optimization Problems:
- Robust Karcher mean problem
- Robust maximum likelihood estimation problem
- RHGD-20/50: Riemannian hypergradient descent with maximum iterations of 20/50 for lower-level problem
- AdaRHD-GD: AdaRHD using gradient descent to solve linear systems
- AdaRHD-CG: AdaRHD using conjugate gradient to solve linear systems
- Upper-level objective function value
- Hypergradient estimation error
- Validation accuracy
- Convergence time and iteration count
Simple Problem Experiments:
- AdaRHD exhibits faster convergence speed at both data scales
- Lower hypergradient estimation error, particularly for AdaRHD-CG
- Computational time advantages, especially on large-scale problems
Robustness Analysis:
- AdaRHD demonstrates significant robustness under different initial step-size settings
- RHGD fails with large step sizes (5, 1, 0.5), while AdaRHD converges stably
- AdaRHD-CG achieves 85% validation accuracy fastest
- Robustness Advantage: AdaRHD is insensitive to initial step-size selection, whereas RHGD completely fails with inappropriate step sizes
- Efficiency Improvement: Although AdaRHD requires more outer iterations, the adaptive strategy results in competitive overall computation time
- Method Selection: AdaRHD-CG outperforms AdaRHD-GD in accuracy and robustness, though the latter converges faster initially
Theorem 3.1: Under standard assumptions, AdaRHD satisfies:
T1∑t=0T−1∥GF(xt)∥xt2≤TC=O(T1)
Corollary 3.1: Complexity for finding ε-stationary point:
- Total iterations: T = O(1/ε)
- Gradient complexity: Gf=O(1/ε), Gg=O(1/ε2)
- Hessian-vector product complexity: O(1/ε²) for AdaRHD-GD, Õ(1/ε) for AdaRHD-CG
- Geometric Structure: Curvature of Riemannian manifolds introduces additional analytical complexity
- Triangle Distance Bounds: Requires using Riemannian manifold-specific triangle distance bounds rather than Euclidean counterparts
- Adaptive Step-size Analysis: Adaptive strategies may cause divergence behavior initially, requiring rigorous theoretical treatment
- Euclidean bilevel optimization: AID, ITD, Neumann series, conjugate gradient methods
- Recent adaptive methods: D-TFBO
- Classical methods: Riemannian gradient descent, nonlinear conjugate gradient, variance-reduced stochastic gradient
- Adaptive methods: RASA, RAMSGrad, Riemannian SAM
- RieBO/RieSBO: Deterministic and stochastic Riemannian bilevel optimization
- RHGD: Riemannian hypergradient descent framework
- RF2SA: Fully randomized first-order method
- AdaRHD is the first fully adaptive Riemannian bilevel optimization algorithm, eliminating dependence on problem-specific parameters
- Theoretically achieves O(1/ε) complexity matching non-adaptive methods
- Experiments validate algorithm effectiveness and significant robustness advantages
- Complexity Gap: Gradient and Hessian-vector product complexity are higher by a factor of 1/ε compared to non-adaptive methods
- Assumption Constraints: Still requires geodesic strong convexity of the lower-level problem
- Single-loop vs. Double-loop: Currently only considers double-loop algorithms
- Single-loop Algorithms: Design adaptive single-loop Riemannian bilevel optimization algorithms
- Stochastic Settings: Extend to stochastic Riemannian bilevel optimization
- Weak Convexity: Handle geodesically convex (non-strongly convex) lower-level objectives
- Complexity Optimization: Explore adaptive strategies to eliminate the 1/ε gap
- Theoretical Innovation: First to achieve full adaptivity in RBO with rigorous theoretical analysis
- Practical Value: Significantly improves algorithm robustness and usability
- Technical Depth: Successfully addresses technical challenges posed by Riemannian geometry
- Comprehensive Experiments: Thorough validation across multiple application scenarios
- Complexity Cost: Adaptivity comes at the expense of additional computational complexity
- Assumption Limitations: Still requires relatively strong assumptions
- Application Scope: Primarily focused on specific Riemannian manifolds
- Academic Contribution: Provides important progress at the intersection of Riemannian optimization and bilevel optimization
- Practical Value: Offers more robust tools for Riemannian bilevel optimization in practical applications
- Future Research: Establishes foundation for further adaptive Riemannian optimization research
- Riemannian meta-learning and neural architecture search
- Image segmentation and low-rank adaptation
- Robust statistics and geometric machine learning
- Any application requiring bilevel optimization under manifold constraints
This paper makes significant contributions to the field of Riemannian bilevel optimization, achieving full adaptivity in algorithm design for the first time while maintaining theoretical complexity and substantially improving practical robustness. Despite certain complexity costs, its theoretical innovations and practical value make it an important advance in the field.