2025-11-14T20:13:11.521057

Minimax Optimal Kernel Two-Sample Tests with Random Features

Mukherjee, Sriperumbudur
Reproducing Kernel Hilbert Space (RKHS) embedding of probability distributions has proved to be an effective approach, via MMD (maximum mean discrepancy), for nonparametric hypothesis testing problems involving distributions defined over general (non-Euclidean) domains. While a substantial amount of work has been done on this topic, only recently have minimax optimal two-sample tests been constructed that incorporate, unlike MMD, both the mean element and a regularized version of the covariance operator. However, as with most kernel algorithms, the optimal test scales cubically in the sample size, limiting its applicability. In this paper, we propose a spectral-regularized two-sample test based on random Fourier feature (RFF) approximation and investigate the trade-offs between statistical optimality and computational efficiency. We show the proposed test to be minimax optimal if the approximation order of RFF (which depends on the smoothness of the likelihood ratio and the decay rate of the eigenvalues of the integral operator) is sufficiently large. We develop a practically implementable permutation-based version of the proposed test with a data-adaptive strategy for selecting the regularization parameter. Finally, through numerical experiments on simulated and benchmark datasets, we demonstrate that the proposed RFF-based test is computationally efficient and performs almost similarly (with a small drop in power) to the exact test.
academic

Minimax Optimal Kernel Two-Sample Tests with Random Features

Basic Information

  • Paper ID: 2502.20755
  • Title: Minimax Optimal Kernel Two-Sample Tests with Random Features
  • Authors: Soumya Mukherjee, Bharath K. Sriperumbudur (Pennsylvania State University)
  • Classification: math.ST cs.LG stat.ML stat.TH
  • Publication Date: October 17, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2502.20755

Abstract

This paper proposes a spectrally regularized two-sample test based on random Fourier features (RFF) approximation for the two-sample testing problem based on reproducing kernel Hilbert space (RKHS) embeddings. The method maintains statistical optimality while significantly reducing computational complexity from cubic to near-linear, and provides a fully data-driven practical implementation.

Research Background and Motivation

Core Problem

Two-sample testing is a fundamental problem in statistics, aiming to determine whether two random samples originate from equal probability distributions. Traditional parametric and nonparametric testing methods face significant limitations when dealing with high-dimensional data or distributions on non-Euclidean domains.

Limitations of Existing Methods

  1. Suboptimality of MMD tests: Although the maximum mean discrepancy (MMD) test is widely applied, it lacks minimax optimality and only considers mean embeddings while ignoring covariance operator information
  2. Computational bottleneck of spectral regularization methods: Recently proposed spectrally regularized MMD tests achieve minimax optimality but have O(n³) computational complexity, limiting their application to large-scale data
  3. Difficulty in parameter selection: The choice of regularization parameters depends on unknown distributional properties, lacking data-driven adaptive strategies

Research Motivation

This paper aims to significantly improve the computational efficiency of spectrally regularized two-sample tests through random Fourier features approximation while maintaining statistical optimality, and develop practical adaptive versions.

Core Contributions

  1. Computationally efficient and statistically optimal test: Proposes an RFF-based spectrally regularized two-sample test that reduces computational complexity from O(n³) to O(nl² + nld) while maintaining minimax optimality under sufficient conditions
  2. Theoretical guarantees: Establishes theoretical connections between RFF approximation order and statistical optimality, proving minimax optimality of the test when the number of features l satisfies specific conditions
  3. Practical adaptive version: Develops a fully data-driven version based on permutation testing, including adaptive selection strategies for regularization parameters and kernel functions
  4. Comprehensive experimental validation: Validates the method's effectiveness through synthetic and benchmark datasets, demonstrating good trade-offs between computational efficiency and statistical performance

Method Details

Task Definition

Given independent samples X₁:N and Y₁:M from distributions P and Q, test the hypothesis H₀: P = Q vs H₁: P ≠ Q.

Core Method Architecture

1. Spectral Regularization Framework

For kernel function K, define the spectrally regularized discrepancy:

ηλ(P,Q) = 4⟨Tgλ(T)u, u⟩_{L²(R)}

where T is the integral operator, u = dP/dR - 1 is the likelihood ratio deviation, and gλ is the regularization function.

2. Random Fourier Features Approximation

For kernels of the form K(x,y) = ∫φ(x,θ)φ(y,θ)dΞ(θ), construct the approximate kernel:

Kₗ(x,y) = (1/l)∑ᵢ₌₁ˡ φ(x,θᵢ)φ(y,θᵢ)

3. Approximate Test Statistic

Based on the approximate kernel Kₗ, construct the test statistic:

η̂λ,l = (1/[n(n-1)m(m-1)]) ∑ᵢ≠ⱼ ∑ᵢ'≠ⱼ' t(Xᵢ,Xⱼ,Yᵢ',Yⱼ')

where the function t involves the square root of the regularized covariance operator.

Technical Innovations

1. Theoretical Innovations

  • Minimax optimality conditions: Establishes precise relationships between RFF feature count l and statistical optimality
  • Polynomial and exponential decay cases: Separately analyzes different decay patterns of integral operator eigenvalues
  • Non-asymptotic analysis: Provides finite-sample performance guarantees

2. Algorithmic Innovations

  • Adaptive regularization: Implements data-driven selection of regularization parameters through joint testing
  • Kernel function adaptation: Extends to adaptive selection of multiple kernel functions
  • Permutation test implementation: Provides fully data-driven critical value computation

3. Computational Innovations

  • Efficient algorithms: Detailed computational complexity analysis and optimized implementation
  • Parallelization: Natural parallelization properties of permutation testing

Experimental Setup

Datasets

  1. Synthetic Data:
    • Gaussian mean shift: P = N(0,I), Q = N(μ,I)
    • Gaussian scale shift: P = N(0,I), Q = N(0,σ²I)
    • Cauchy median shift: Cauchy distributions with different medians
  2. Real Data:
    • MNIST dataset: 7×7 pixel handwritten digit images, dimension d=49
    • Distribution difference detection across different digit subsets

Evaluation Metrics

  • Statistical Power: Probability of correctly rejecting the null hypothesis under the alternative
  • Computational Time: Algorithm runtime comparison
  • Type I Error: Controlled at α=0.05 level

Comparison Methods

  • Exact adaptive test: Spectral regularization test based on complete kernel matrix
  • Different RFF feature counts: Performance comparison for l ∈ {1,3,5,7,9,15,20}

Implementation Details

  • Regularization function: Showalter regularizer gλ(x) = (1-e^(-x/λ))/x
  • Kernel functions: Gaussian and Laplacian kernels with adaptive bandwidth selection
  • Number of permutations: B=550-800 for RFF version, B'=250-450 for exact version

Experimental Results

Main Results

1. Statistical Performance

  • Gaussian mean shift: Power approaches exact method when l≥7
  • Gaussian scale shift: Good performance achieved when l≥5
  • Cauchy distribution: Requires more features (l≥10-15) to handle heavy-tailed distributions
  • MNIST data: Good performance on complex real data when l≥15

2. Computational Efficiency

Significant computational time savings:

  • Gaussian experiments: RFF method requires only 33-44% of computation time
  • Scale shift: 30-40% time consumption
  • Cauchy experiments: Only 5-6% of computation time
  • MNIST: 5-15% of computation time

3. Theoretical Verification

Experimental results verify theoretical predictions:

  • Feature count requirements correlate with data dimensionality and distribution complexity
  • Computational-statistical trade-off aligns with theoretical analysis

Ablation Studies

Comparisons across different RFF feature counts verify:

  1. Existence of feature count thresholds
  2. Power loss from insufficient features
  3. Optimal trade-offs with appropriate feature counts

Experimental Findings

  1. Dimensionality effect: High-dimensional data requires more RFF features
  2. Distribution type impact: Heavy-tailed distributions require more features for performance guarantees
  3. Practical thresholds: Theoretically required feature counts can be moderately reduced in practice

Kernel Two-Sample Testing

  • MMD tests: Foundational work by Gretton et al. (2006, 2012)
  • Minimax optimality: Recent advances by Li & Yuan (2024), Schrab et al. (2023)
  • Spectral regularization: Theoretical breakthrough by Hagrass et al. (2024)

Random Features Approximation

  • RFF theory: Foundational framework by Rahimi & Recht (2007)
  • MMD acceleration: Applications by Zhao & Meng (2015), Choi & Kim (2024)
  • Computational-statistical trade-off: Theoretical analysis by Sriperumbudur & Sterge (2022)

Advantages over Existing Work

Main advantages compared to existing work:

  1. More general kernel functions: Not limited to translation-invariant kernels
  2. Broader applicability: Supports non-Euclidean domains
  3. Stronger theoretical guarantees: Non-asymptotic minimax optimality
  4. More practical algorithms: Fully data-driven implementation

Conclusions and Discussion

Main Conclusions

  1. Theoretical contribution: First to establish minimax optimality theory for spectrally regularized two-sample tests under RFF approximation
  2. Algorithmic contribution: Provides computationally efficient and statistically optimal practical algorithms
  3. Empirical validation: Validates method effectiveness across multiple data types

Limitations

  1. Feature count selection: While theoretical guidance is provided, practical selection still requires empirical adjustment
  2. Kernel function dependence: Performance still depends on kernel function choice
  3. Heavy-tailed distributions: May require large feature counts for extreme heavy-tailed distributions

Future Directions

  1. Alternative approximation methods: Explore Nyström and other approximation techniques
  2. Cheap permutation testing: Combine with cheap permutation testing to further reduce computational cost
  3. Automatic hyperparameter tuning: Develop more intelligent hyperparameter selection strategies

In-Depth Evaluation

Strengths

  1. Theoretical rigor: Provides complete non-asymptotic theoretical analysis including sufficient conditions and minimax optimality proofs
  2. Strong practicality: Algorithms are fully data-driven, requiring no prior knowledge
  3. Comprehensive experiments: Covers synthetic and real data with multiple distribution types
  4. Clear presentation: Technical details are thorough with complete proofs

Weaknesses

  1. Complexity analysis: While asymptotic complexity is reduced, constant factors may be large
  2. Parameter sensitivity: Selection of regularization parameters and feature counts remains somewhat sensitive
  3. Heavy-tailed handling: Performance on heavy-tailed distributions could be improved

Impact

  1. Theoretical value: Provides new theoretical framework for computational-statistical trade-offs in kernel methods
  2. Practical value: Has important application prospects in large-scale two-sample testing
  3. Methodological contribution: RFF approximation application in statistical testing provides new insights for related research

Applicable Scenarios

  1. Large-scale data: Computational advantages are significant when sample sizes are large
  2. High-dimensional data: Shows substantial advantages over traditional methods in high-dimensional settings
  3. Real-time applications: Computational efficiency improvements enable real-time testing
  4. Nonparametric settings: Applicable to general cases where distributional form is unknown

References

  1. Gretton, A., et al. (2012). A kernel two-sample test. JMLR.
  2. Hagrass, O., et al. (2024). Spectral regularized kernel two-sample tests. Annals of Statistics.
  3. Rahimi, A., & Recht, B. (2007). Random features for large-scale kernel machines. NIPS.
  4. Choi, I., & Kim, I. (2024). Computational-statistical trade-off in kernel two-sample testing with random Fourier features.
  5. Sriperumbudur, B. K., & Sterge, N. (2022). Approximate kernel PCA: Computational versus statistical trade-off. Annals of Statistics.

Overall Assessment: This is a high-quality theoretical statistics paper that successfully applies random features approximation techniques to spectrally regularized two-sample testing, significantly improving computational efficiency while maintaining statistical optimality. The paper's theoretical analysis is thorough and detailed, experimental validation is comprehensive, and it has important value for both statistical learning theory and practical applications.