2025-11-11T21:07:14.953280

Sparse Polyak: an adaptive step size rule for high-dimensional M-estimation

Qiao, Maros

We propose and study Sparse Polyak, a variant of Polyak's adaptive step size, designed to solve high-dimensional statistical estimation problems where the problem dimension is allowed to grow much faster than the sample size. In such settings, the standard Polyak step size performs poorly, requiring an increasing number of iterations to achieve optimal statistical precision-even when, the problem remains well conditioned and/or the achievable precision itself does not degrade with problem size. We trace this limitation to a mismatch in how smoothness is measured: in high dimensions, it is no longer effective to estimate the Lipschitz smoothness constant. Instead, it is more appropriate to estimate the smoothness restricted to specific directions relevant to the problem (restricted Lipschitz smoothness constant). Sparse Polyak overcomes this issue by modifying the step size to estimate the restricted Lipschitz smoothness constant. We support our approach with both theoretical analysis and numerical experiments, demonstrating its improved performance.

academic

Sparse Polyak: an adaptive step size rule for high-dimensional M-estimation

Basic Information

Paper ID: 2509.09802
Title: Sparse Polyak: an adaptive step size rule for high-dimensional M-estimation
Authors: Tianqi Qiao (Texas A&M University), Marie Maros (Texas A&M University)
Classification: math.OC cs.LG stat.ML
Publication Time/Conference: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Paper Link: https://arxiv.org/abs/2509.09802

Abstract

This paper proposes and investigates Sparse Polyak, a variant of the Polyak adaptive step size specifically designed for high-dimensional statistical estimation problems where problem dimensionality grows much faster than sample size. In this regime, standard Polyak step sizes perform poorly, requiring increasingly more iterations to achieve optimal statistical accuracy—even when the problem remains well-conditioned and/or the achievable accuracy itself does not degrade with problem scale. The paper attributes this limitation to a mismatch in how smoothness is measured: in high dimensions, estimating the global Lipschitz smoothness constant becomes ineffective. Instead, it is more appropriate to estimate smoothness restricted to specific directions relevant to the problem (restricted Lipschitz smoothness constant). Sparse Polyak overcomes this issue by modifying the step size to estimate the restricted Lipschitz smoothness constant.

Research Background and Motivation

Problem Definition

This paper studies high-dimensional statistical estimation problems:

min_{θ∈R^d: ||θ||_0≤s} f(θ) = (1/n) Σ_{i=1}^n ℓ(z_i, θ)

where dimension d grows rapidly relative to sample size n, i.e., d/n → ∞.

Core Challenges

High-dimensional challenges: In high-dimensional settings, traditional Lipschitz smoothness constant estimation fails, leading to overly conservative step size selection
Performance degradation: Standard Polyak step sizes show significant performance deterioration as problem dimensionality increases, even when problem conditioning remains unchanged
Missing rate invariance: Existing methods fail to maintain convergence guarantees equivalent to fixed step size IHT

Research Motivation

Iterative Hard Thresholding (IHT) algorithms perform excellently in high-dimensional sparse recovery but require knowledge of the restricted Lipschitz smoothness (RSS) constant L̄
Existing adaptive step size methods lack theoretical guarantees and practical performance in high-dimensional settings
There is a need for a method that adaptively adjusts step sizes while maintaining rate invariance

Core Contributions

First high-dimensional adaptive step size rule: Proposes the first adaptive step size rule that performs well in high-dimensional settings and maintains rate invariance
Theoretical innovation: Identifies the fundamental problem of smoothness measurement in high dimensions and proposes estimating restricted Lipschitz smoothness constants rather than global constants
Convergence guarantees: Establishes linear convergence rates comparable to known optimal fixed step sizes, achieving optimal statistical accuracy
Broad applicability: Provides theoretical guarantees for multiple statistical models (logistic regression, linear regression, matrix regression, etc.)
Support recovery: Provides support recovery guarantees under signal-to-noise ratio conditions

Method Details

Problem Formulation

Consider the constrained optimization problem:

min_{θ∈R^d: ||θ||_0≤s} f(θ) = (1/n) Σ_{i=1}^n ℓ(z_i, θ)

where:

θ is a d-dimensional parameter vector constrained to have at most s nonzero elements
f(θ) is the empirical risk function
The goal is to efficiently solve this in high-dimensional settings (d/n → ∞)

Core Algorithm: Sparse Polyak Step Size

Algorithm 1: IHT with Sparse Polyak Step-Size

Input: function f, target function value f̂, sparsity parameter s, number of iterations T
Initialize: θ_0 ∈ R^d, ||θ_0||_0 ≤ s
for t = 0 to T-1 do:
    Compute step size: γ_t = max{f(θ_t) - f̂, 0} / (5||HT_s(∇f(θ_t))||²)
    Update: θ_{t+1} = HT_s(θ_t - γ_t∇f(θ_t))
end for

Key Innovation Points

Modified step size formula:
- Traditional Polyak: γ_t = (f(θ_t) - f̂) / ||∇f(θ_t)||²
- Sparse Polyak: γ_t = (f(θ_t) - f̂) / ||HT_s(∇f(θ_t))||²
Hard thresholding operator: HT_s retains the s components with largest magnitude and sets others to zero
Theoretical foundation:
- Leverages restricted strong convexity (RSC) and restricted smoothness (RSS) conditions
- L̄ = L + 3τs, μ̄ = μ - 3τs, κ̄ = L̄/μ̄

Technical Innovation Points

Dimension-independent step size: By using ||HT_s(∇f(θ_t))||² instead of ||∇f(θ_t)||², avoids scaling related to dimension d
Restricted direction estimation: Focuses on smoothness in sparse directions rather than the entire space
Double-loop algorithm: Provides Algorithm 2 for handling unknown target function values

Theoretical Analysis

Main Theorems

Theorem 1 (Main Convergence Result): Under RSC and RSS assumptions, when s ≥ (240κ̄)²s*:

Linear convergence: ||θ_{t+1} - θ̂||² ≤ (1 - 1/(80κ̄))||θ_t - θ̂||²
Final accuracy: O(||HT_s(∇f(θ̂))||²/μ̄²)

Corollary 1 (Support Recovery): Under signal-to-noise ratio condition |θ̂|_min ≥ 7||HT_s(∇f(θ̂))||/μ̄, the algorithm can accurately recover the support set.

Statistical Model Guarantees

Sparse Logistic Regression (Corollary 2):

Convergence rate: (1 - 1/(80κ̄))
Statistical accuracy: O(s log d / n)
Probability guarantee: at least 1 - e^{-c_0 n} - 2/d

Matrix Regression (Corollary 3):

Applicable to low-rank matrix recovery
Similar convergence guarantees and statistical accuracy

Experimental Setup

Synthetic Data Experiments

Dimension settings: d ∈ {5000, 10000, 20000}
Sparsity: s* = 300
Sample size: n = ⌈α s log d⌉, α = 5
Design matrix: Time series structure with correlation parameter ω = 0.5
Noise settings: Linear regression σ² = 0.25, logistic regression generated according to model (4)

Real Data Experiments

Linear regression: Large-scale Wave Energy Farm dataset (120 samples, 149 features)
Logistic regression: Molecule Musk dataset (120 samples, 166 features)
Sparsity: s = 20

Comparison Methods

IHT with fixed step size γ = 2/(3L̄)
Classical Polyak step size
Grid search optimized fixed step size

Experimental Results

Main Results

Rate invariance verification:
- Sparse Polyak maintains consistent convergence behavior across different dimensions d
- Classical Polyak shows significant performance degradation as dimension increases
- When log(d)/n remains constant, iteration count remains essentially unchanged
Convergence speed comparison:
- Sparse Polyak typically converges faster than fixed step size
- Advantage is more pronounced in logistic regression (due to local curvature adaptation)
- Choice of target value f̂ directly affects achievable accuracy
Real data performance:
- Logistic regression task: Sparse Polyak > fixed step size > classical Polyak
- Linear regression task: Optimal fixed step size slightly better, but Sparse Polyak still outperforms classical Polyak

Key Findings

Dimension scaling issue: In high dimensions, ||∇f(θ_t)|| ≤ √(d/s)||HT_s(∇f(θ_t))||, with equality in the worst case
Step size conservatism: Using full gradient norm leads to overly conservative step sizes that worsen with increasing d
Adaptivity advantage: Adaptive step sizes can exploit local curvature information, performing better on problems with non-uniform curvature

Sparse Recovery Algorithms

IHT and variants: Blumensath & Davies (2009), Jain et al. (2014)
Other methods: Matching Pursuit, OMP, CoSaMP
Accelerated versions: Khanna & Kyrillidis (2018), Zhou et al. (2018)

Adaptive Step Size Methods

Polyak step size: Polyak (1969), recent revival Ren et al. (2022)
Local Lipschitz methods: Malitsky & Mishchenko (2020, 2024)
Constrained problems: Cheng & Li (2012), Devanathan & Boyd (2024)

High-dimensional Statistics

Theoretical foundations: Agarwal et al. (2012), Loh & Wainwright (2015)
Condition requirements: Development of RSC/RSS and RIP conditions

Conclusions and Discussion

Main Conclusions

Method effectiveness: Sparse Polyak successfully addresses the challenge of adaptive step sizes in high-dimensional settings
Theoretical completeness: Provides convergence guarantees comparable to optimal fixed step sizes
Practical value: Demonstrates good performance across multiple statistical models
Rate invariance: Achieves performance stability as problem scale grows

Limitations

Constant factors: The factor 1/5 in the step size formula may be an artifact of analysis and may not be necessary in practice
Initialization requirements: Some results require specific initialization conditions
Condition restrictions: Still requires RSC/RSS conditions, though more relaxed than RIP conditions
Parameter selection: Requires prior knowledge or estimation of target function value f̂

Future Directions

Extension to other algorithms: Authors conjecture that BB methods and others may also require similar improvements
Weaker conditions: Further relax theoretical assumptions
Practical applications: Validate method effectiveness on more real-world problems
Computational optimization: Reduce computational complexity and improve practicality

In-Depth Evaluation

Strengths

Precise problem identification: Accurately identifies the core issue of adaptive step sizes in high dimensions
Important theoretical contribution: First to provide theoretical guarantees for adaptive step sizes in high-dimensional sparse problems
Simple and effective method: Simple modification but significant effectiveness
Comprehensive experiments: Includes theoretical verification, synthetic data, and real data experiments
Clear writing: Logical structure and complete technical details

Weaknesses

Strong conditions: The requirement s ≥ (240κ̄)²s* may be overly restrictive in some applications
Constant analysis: Some constants may not be tight enough, affecting practical performance
Limited scope: Primarily targets sparse problems; extensibility to other structured problems is unclear
Computational overhead: Each iteration requires hard thresholding operations, increasing computational cost

Impact

Theoretical value: Makes important contributions to high-dimensional optimization theory
Practical significance: Provides new tools for sparse problems in machine learning
Inspirational: Provides insights for improving other adaptive methods in high-dimensional settings
Reproducibility: Complete theoretical analysis and clear algorithm description facilitate implementation

Applicable Scenarios

High-dimensional sparse regression: Feature selection, genomic data analysis
Compressed sensing: Signal reconstruction, image processing
Machine learning: Sparsification in deep learning, neural network pruning
Statistical learning: High-dimensional statistical inference, variable selection

References

Key references include:

Jain et al. (2014) - IHT theoretical foundations
Agarwal et al. (2012) - RSC/RSS conditions
Polyak (1969) - Original Polyak step size
Loh & Wainwright (2015) - High-dimensional statistics theory
Malitsky & Mishchenko (2020) - Modern adaptive methods

Overall Assessment: This is a high-quality theoretical paper that proposes an innovative solution to an important problem in high-dimensional optimization. The theoretical analysis is rigorous, experimental validation is comprehensive, and it makes significant contributions to the field. While there are some technical limitations, overall it represents important progress in this research area.