2025-11-11T21:07:14.953280

Sparse Polyak: an adaptive step size rule for high-dimensional M-estimation

Qiao, Maros
We propose and study Sparse Polyak, a variant of Polyak's adaptive step size, designed to solve high-dimensional statistical estimation problems where the problem dimension is allowed to grow much faster than the sample size. In such settings, the standard Polyak step size performs poorly, requiring an increasing number of iterations to achieve optimal statistical precision-even when, the problem remains well conditioned and/or the achievable precision itself does not degrade with problem size. We trace this limitation to a mismatch in how smoothness is measured: in high dimensions, it is no longer effective to estimate the Lipschitz smoothness constant. Instead, it is more appropriate to estimate the smoothness restricted to specific directions relevant to the problem (restricted Lipschitz smoothness constant). Sparse Polyak overcomes this issue by modifying the step size to estimate the restricted Lipschitz smoothness constant. We support our approach with both theoretical analysis and numerical experiments, demonstrating its improved performance.
academic

Sparse Polyak: an adaptive step size rule for high-dimensional M-estimation

Basic Information

  • Paper ID: 2509.09802
  • Title: Sparse Polyak: an adaptive step size rule for high-dimensional M-estimation
  • Authors: Tianqi Qiao (Texas A&M University), Marie Maros (Texas A&M University)
  • Classification: math.OC cs.LG stat.ML
  • Publication Time/Conference: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
  • Paper Link: https://arxiv.org/abs/2509.09802

Abstract

This paper proposes and investigates Sparse Polyak, a variant of the Polyak adaptive step size specifically designed for high-dimensional statistical estimation problems where problem dimensionality grows much faster than sample size. In this regime, standard Polyak step sizes perform poorly, requiring increasingly more iterations to achieve optimal statistical accuracy—even when the problem remains well-conditioned and/or the achievable accuracy itself does not degrade with problem scale. The paper attributes this limitation to a mismatch in how smoothness is measured: in high dimensions, estimating the global Lipschitz smoothness constant becomes ineffective. Instead, it is more appropriate to estimate smoothness restricted to specific directions relevant to the problem (restricted Lipschitz smoothness constant). Sparse Polyak overcomes this issue by modifying the step size to estimate the restricted Lipschitz smoothness constant.

Research Background and Motivation

Problem Definition

This paper studies high-dimensional statistical estimation problems:

min_{θ∈R^d: ||θ||_0≤s} f(θ) = (1/n) Σ_{i=1}^n ℓ(z_i, θ)

where dimension d grows rapidly relative to sample size n, i.e., d/n → ∞.

Core Challenges

  1. High-dimensional challenges: In high-dimensional settings, traditional Lipschitz smoothness constant estimation fails, leading to overly conservative step size selection
  2. Performance degradation: Standard Polyak step sizes show significant performance deterioration as problem dimensionality increases, even when problem conditioning remains unchanged
  3. Missing rate invariance: Existing methods fail to maintain convergence guarantees equivalent to fixed step size IHT

Research Motivation

  • Iterative Hard Thresholding (IHT) algorithms perform excellently in high-dimensional sparse recovery but require knowledge of the restricted Lipschitz smoothness (RSS) constant L̄
  • Existing adaptive step size methods lack theoretical guarantees and practical performance in high-dimensional settings
  • There is a need for a method that adaptively adjusts step sizes while maintaining rate invariance

Core Contributions

  1. First high-dimensional adaptive step size rule: Proposes the first adaptive step size rule that performs well in high-dimensional settings and maintains rate invariance
  2. Theoretical innovation: Identifies the fundamental problem of smoothness measurement in high dimensions and proposes estimating restricted Lipschitz smoothness constants rather than global constants
  3. Convergence guarantees: Establishes linear convergence rates comparable to known optimal fixed step sizes, achieving optimal statistical accuracy
  4. Broad applicability: Provides theoretical guarantees for multiple statistical models (logistic regression, linear regression, matrix regression, etc.)
  5. Support recovery: Provides support recovery guarantees under signal-to-noise ratio conditions

Method Details

Problem Formulation

Consider the constrained optimization problem:

min_{θ∈R^d: ||θ||_0≤s} f(θ) = (1/n) Σ_{i=1}^n ℓ(z_i, θ)

where:

  • θ is a d-dimensional parameter vector constrained to have at most s nonzero elements
  • f(θ) is the empirical risk function
  • The goal is to efficiently solve this in high-dimensional settings (d/n → ∞)

Core Algorithm: Sparse Polyak Step Size

Algorithm 1: IHT with Sparse Polyak Step-Size

Input: function f, target function value f̂, sparsity parameter s, number of iterations T
Initialize: θ_0 ∈ R^d, ||θ_0||_0 ≤ s
for t = 0 to T-1 do:
    Compute step size: γ_t = max{f(θ_t) - f̂, 0} / (5||HT_s(∇f(θ_t))||²)
    Update: θ_{t+1} = HT_s(θ_t - γ_t∇f(θ_t))
end for

Key Innovation Points

  1. Modified step size formula:
    • Traditional Polyak: γ_t = (f(θ_t) - f̂) / ||∇f(θ_t)||²
    • Sparse Polyak: γ_t = (f(θ_t) - f̂) / ||HT_s(∇f(θ_t))||²
  2. Hard thresholding operator: HT_s retains the s components with largest magnitude and sets others to zero
  3. Theoretical foundation:
    • Leverages restricted strong convexity (RSC) and restricted smoothness (RSS) conditions
    • L̄ = L + 3τs, μ̄ = μ - 3τs, κ̄ = L̄/μ̄

Technical Innovation Points

  1. Dimension-independent step size: By using ||HT_s(∇f(θ_t))||² instead of ||∇f(θ_t)||², avoids scaling related to dimension d
  2. Restricted direction estimation: Focuses on smoothness in sparse directions rather than the entire space
  3. Double-loop algorithm: Provides Algorithm 2 for handling unknown target function values

Theoretical Analysis

Main Theorems

Theorem 1 (Main Convergence Result): Under RSC and RSS assumptions, when s ≥ (240κ̄)²s*:

  • Linear convergence: ||θ_{t+1} - θ̂||² ≤ (1 - 1/(80κ̄))||θ_t - θ̂||²
  • Final accuracy: O(||HT_s(∇f(θ̂))||²/μ̄²)

Corollary 1 (Support Recovery): Under signal-to-noise ratio condition |θ̂|_min ≥ 7||HT_s(∇f(θ̂))||/μ̄, the algorithm can accurately recover the support set.

Statistical Model Guarantees

Sparse Logistic Regression (Corollary 2):

  • Convergence rate: (1 - 1/(80κ̄))
  • Statistical accuracy: O(s log d / n)
  • Probability guarantee: at least 1 - e^{-c_0 n} - 2/d

Matrix Regression (Corollary 3):

  • Applicable to low-rank matrix recovery
  • Similar convergence guarantees and statistical accuracy

Experimental Setup

Synthetic Data Experiments

  • Dimension settings: d ∈ {5000, 10000, 20000}
  • Sparsity: s* = 300
  • Sample size: n = ⌈α s log d⌉, α = 5
  • Design matrix: Time series structure with correlation parameter ω = 0.5
  • Noise settings: Linear regression σ² = 0.25, logistic regression generated according to model (4)

Real Data Experiments

  • Linear regression: Large-scale Wave Energy Farm dataset (120 samples, 149 features)
  • Logistic regression: Molecule Musk dataset (120 samples, 166 features)
  • Sparsity: s = 20

Comparison Methods

  • IHT with fixed step size γ = 2/(3L̄)
  • Classical Polyak step size
  • Grid search optimized fixed step size

Experimental Results

Main Results

  1. Rate invariance verification:
    • Sparse Polyak maintains consistent convergence behavior across different dimensions d
    • Classical Polyak shows significant performance degradation as dimension increases
    • When log(d)/n remains constant, iteration count remains essentially unchanged
  2. Convergence speed comparison:
    • Sparse Polyak typically converges faster than fixed step size
    • Advantage is more pronounced in logistic regression (due to local curvature adaptation)
    • Choice of target value f̂ directly affects achievable accuracy
  3. Real data performance:
    • Logistic regression task: Sparse Polyak > fixed step size > classical Polyak
    • Linear regression task: Optimal fixed step size slightly better, but Sparse Polyak still outperforms classical Polyak

Key Findings

  1. Dimension scaling issue: In high dimensions, ||∇f(θ_t)|| ≤ √(d/s)||HT_s(∇f(θ_t))||, with equality in the worst case
  2. Step size conservatism: Using full gradient norm leads to overly conservative step sizes that worsen with increasing d
  3. Adaptivity advantage: Adaptive step sizes can exploit local curvature information, performing better on problems with non-uniform curvature

Sparse Recovery Algorithms

  • IHT and variants: Blumensath & Davies (2009), Jain et al. (2014)
  • Other methods: Matching Pursuit, OMP, CoSaMP
  • Accelerated versions: Khanna & Kyrillidis (2018), Zhou et al. (2018)

Adaptive Step Size Methods

  • Polyak step size: Polyak (1969), recent revival Ren et al. (2022)
  • Local Lipschitz methods: Malitsky & Mishchenko (2020, 2024)
  • Constrained problems: Cheng & Li (2012), Devanathan & Boyd (2024)

High-dimensional Statistics

  • Theoretical foundations: Agarwal et al. (2012), Loh & Wainwright (2015)
  • Condition requirements: Development of RSC/RSS and RIP conditions

Conclusions and Discussion

Main Conclusions

  1. Method effectiveness: Sparse Polyak successfully addresses the challenge of adaptive step sizes in high-dimensional settings
  2. Theoretical completeness: Provides convergence guarantees comparable to optimal fixed step sizes
  3. Practical value: Demonstrates good performance across multiple statistical models
  4. Rate invariance: Achieves performance stability as problem scale grows

Limitations

  1. Constant factors: The factor 1/5 in the step size formula may be an artifact of analysis and may not be necessary in practice
  2. Initialization requirements: Some results require specific initialization conditions
  3. Condition restrictions: Still requires RSC/RSS conditions, though more relaxed than RIP conditions
  4. Parameter selection: Requires prior knowledge or estimation of target function value f̂

Future Directions

  1. Extension to other algorithms: Authors conjecture that BB methods and others may also require similar improvements
  2. Weaker conditions: Further relax theoretical assumptions
  3. Practical applications: Validate method effectiveness on more real-world problems
  4. Computational optimization: Reduce computational complexity and improve practicality

In-Depth Evaluation

Strengths

  1. Precise problem identification: Accurately identifies the core issue of adaptive step sizes in high dimensions
  2. Important theoretical contribution: First to provide theoretical guarantees for adaptive step sizes in high-dimensional sparse problems
  3. Simple and effective method: Simple modification but significant effectiveness
  4. Comprehensive experiments: Includes theoretical verification, synthetic data, and real data experiments
  5. Clear writing: Logical structure and complete technical details

Weaknesses

  1. Strong conditions: The requirement s ≥ (240κ̄)²s* may be overly restrictive in some applications
  2. Constant analysis: Some constants may not be tight enough, affecting practical performance
  3. Limited scope: Primarily targets sparse problems; extensibility to other structured problems is unclear
  4. Computational overhead: Each iteration requires hard thresholding operations, increasing computational cost

Impact

  1. Theoretical value: Makes important contributions to high-dimensional optimization theory
  2. Practical significance: Provides new tools for sparse problems in machine learning
  3. Inspirational: Provides insights for improving other adaptive methods in high-dimensional settings
  4. Reproducibility: Complete theoretical analysis and clear algorithm description facilitate implementation

Applicable Scenarios

  1. High-dimensional sparse regression: Feature selection, genomic data analysis
  2. Compressed sensing: Signal reconstruction, image processing
  3. Machine learning: Sparsification in deep learning, neural network pruning
  4. Statistical learning: High-dimensional statistical inference, variable selection

References

Key references include:

  1. Jain et al. (2014) - IHT theoretical foundations
  2. Agarwal et al. (2012) - RSC/RSS conditions
  3. Polyak (1969) - Original Polyak step size
  4. Loh & Wainwright (2015) - High-dimensional statistics theory
  5. Malitsky & Mishchenko (2020) - Modern adaptive methods

Overall Assessment: This is a high-quality theoretical paper that proposes an innovative solution to an important problem in high-dimensional optimization. The theoretical analysis is rigorous, experimental validation is comprehensive, and it makes significant contributions to the field. While there are some technical limitations, overall it represents important progress in this research area.