2025-11-14T20:13:11.521057

Minimax Optimal Kernel Two-Sample Tests with Random Features

Mukherjee, Sriperumbudur

Reproducing Kernel Hilbert Space (RKHS) embedding of probability distributions has proved to be an effective approach, via MMD (maximum mean discrepancy), for nonparametric hypothesis testing problems involving distributions defined over general (non-Euclidean) domains. While a substantial amount of work has been done on this topic, only recently have minimax optimal two-sample tests been constructed that incorporate, unlike MMD, both the mean element and a regularized version of the covariance operator. However, as with most kernel algorithms, the optimal test scales cubically in the sample size, limiting its applicability. In this paper, we propose a spectral-regularized two-sample test based on random Fourier feature (RFF) approximation and investigate the trade-offs between statistical optimality and computational efficiency. We show the proposed test to be minimax optimal if the approximation order of RFF (which depends on the smoothness of the likelihood ratio and the decay rate of the eigenvalues of the integral operator) is sufficiently large. We develop a practically implementable permutation-based version of the proposed test with a data-adaptive strategy for selecting the regularization parameter. Finally, through numerical experiments on simulated and benchmark datasets, we demonstrate that the proposed RFF-based test is computationally efficient and performs almost similarly (with a small drop in power) to the exact test.

academic

Minimax Optimal Kernel Two-Sample Tests with Random Features

Basic Information

Paper ID: 2502.20755
Title: Minimax Optimal Kernel Two-Sample Tests with Random Features
Authors: Soumya Mukherjee, Bharath K. Sriperumbudur (Pennsylvania State University)
Classification: math.ST cs.LG stat.ML stat.TH
Publication Date: October 17, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2502.20755

Abstract

This paper proposes a spectrally regularized two-sample test based on random Fourier features (RFF) approximation for the two-sample testing problem based on reproducing kernel Hilbert space (RKHS) embeddings. The method maintains statistical optimality while significantly reducing computational complexity from cubic to near-linear, and provides a fully data-driven practical implementation.

Research Background and Motivation

Core Problem

Two-sample testing is a fundamental problem in statistics, aiming to determine whether two random samples originate from equal probability distributions. Traditional parametric and nonparametric testing methods face significant limitations when dealing with high-dimensional data or distributions on non-Euclidean domains.

Limitations of Existing Methods

Suboptimality of MMD tests: Although the maximum mean discrepancy (MMD) test is widely applied, it lacks minimax optimality and only considers mean embeddings while ignoring covariance operator information
Computational bottleneck of spectral regularization methods: Recently proposed spectrally regularized MMD tests achieve minimax optimality but have O(n³) computational complexity, limiting their application to large-scale data
Difficulty in parameter selection: The choice of regularization parameters depends on unknown distributional properties, lacking data-driven adaptive strategies

Research Motivation

This paper aims to significantly improve the computational efficiency of spectrally regularized two-sample tests through random Fourier features approximation while maintaining statistical optimality, and develop practical adaptive versions.

Core Contributions

Computationally efficient and statistically optimal test: Proposes an RFF-based spectrally regularized two-sample test that reduces computational complexity from O(n³) to O(nl² + nld) while maintaining minimax optimality under sufficient conditions
Theoretical guarantees: Establishes theoretical connections between RFF approximation order and statistical optimality, proving minimax optimality of the test when the number of features l satisfies specific conditions
Practical adaptive version: Develops a fully data-driven version based on permutation testing, including adaptive selection strategies for regularization parameters and kernel functions
Comprehensive experimental validation: Validates the method's effectiveness through synthetic and benchmark datasets, demonstrating good trade-offs between computational efficiency and statistical performance

Method Details

Task Definition

Given independent samples X₁:N and Y₁:M from distributions P and Q, test the hypothesis H₀: P = Q vs H₁: P ≠ Q.

Core Method Architecture

1. Spectral Regularization Framework

For kernel function K, define the spectrally regularized discrepancy:

ηλ(P,Q) = 4⟨Tgλ(T)u, u⟩_{L²(R)}

where T is the integral operator, u = dP/dR - 1 is the likelihood ratio deviation, and gλ is the regularization function.

2. Random Fourier Features Approximation

For kernels of the form K(x,y) = ∫φ(x,θ)φ(y,θ)dΞ(θ), construct the approximate kernel:

Kₗ(x,y) = (1/l)∑ᵢ₌₁ˡ φ(x,θᵢ)φ(y,θᵢ)

3. Approximate Test Statistic

Based on the approximate kernel Kₗ, construct the test statistic:

η̂λ,l = (1/[n(n-1)m(m-1)]) ∑ᵢ≠ⱼ ∑ᵢ'≠ⱼ' t(Xᵢ,Xⱼ,Yᵢ',Yⱼ')

where the function t involves the square root of the regularized covariance operator.

Technical Innovations

1. Theoretical Innovations

Minimax optimality conditions: Establishes precise relationships between RFF feature count l and statistical optimality
Polynomial and exponential decay cases: Separately analyzes different decay patterns of integral operator eigenvalues
Non-asymptotic analysis: Provides finite-sample performance guarantees

2. Algorithmic Innovations

Adaptive regularization: Implements data-driven selection of regularization parameters through joint testing
Kernel function adaptation: Extends to adaptive selection of multiple kernel functions
Permutation test implementation: Provides fully data-driven critical value computation

3. Computational Innovations

Efficient algorithms: Detailed computational complexity analysis and optimized implementation
Parallelization: Natural parallelization properties of permutation testing

Experimental Setup

Datasets

Synthetic Data:
- Gaussian mean shift: P = N(0,I), Q = N(μ,I)
- Gaussian scale shift: P = N(0,I), Q = N(0,σ²I)
- Cauchy median shift: Cauchy distributions with different medians
Real Data:
- MNIST dataset: 7×7 pixel handwritten digit images, dimension d=49
- Distribution difference detection across different digit subsets

Evaluation Metrics

Statistical Power: Probability of correctly rejecting the null hypothesis under the alternative
Computational Time: Algorithm runtime comparison
Type I Error: Controlled at α=0.05 level

Comparison Methods

Exact adaptive test: Spectral regularization test based on complete kernel matrix
Different RFF feature counts: Performance comparison for l ∈ {1,3,5,7,9,15,20}

Implementation Details

Regularization function: Showalter regularizer gλ(x) = (1-e^(-x/λ))/x
Kernel functions: Gaussian and Laplacian kernels with adaptive bandwidth selection
Number of permutations: B=550-800 for RFF version, B'=250-450 for exact version

Experimental Results

Main Results

1. Statistical Performance

Gaussian mean shift: Power approaches exact method when l≥7
Gaussian scale shift: Good performance achieved when l≥5
Cauchy distribution: Requires more features (l≥10-15) to handle heavy-tailed distributions
MNIST data: Good performance on complex real data when l≥15

2. Computational Efficiency

Significant computational time savings:

Gaussian experiments: RFF method requires only 33-44% of computation time
Scale shift: 30-40% time consumption
Cauchy experiments: Only 5-6% of computation time
MNIST: 5-15% of computation time

3. Theoretical Verification

Experimental results verify theoretical predictions:

Feature count requirements correlate with data dimensionality and distribution complexity
Computational-statistical trade-off aligns with theoretical analysis

Ablation Studies

Comparisons across different RFF feature counts verify:

Existence of feature count thresholds
Power loss from insufficient features
Optimal trade-offs with appropriate feature counts

Experimental Findings

Dimensionality effect: High-dimensional data requires more RFF features
Distribution type impact: Heavy-tailed distributions require more features for performance guarantees
Practical thresholds: Theoretically required feature counts can be moderately reduced in practice

Kernel Two-Sample Testing

MMD tests: Foundational work by Gretton et al. (2006, 2012)
Minimax optimality: Recent advances by Li & Yuan (2024), Schrab et al. (2023)
Spectral regularization: Theoretical breakthrough by Hagrass et al. (2024)

Random Features Approximation

RFF theory: Foundational framework by Rahimi & Recht (2007)
MMD acceleration: Applications by Zhao & Meng (2015), Choi & Kim (2024)
Computational-statistical trade-off: Theoretical analysis by Sriperumbudur & Sterge (2022)

Advantages over Existing Work

Main advantages compared to existing work:

More general kernel functions: Not limited to translation-invariant kernels
Broader applicability: Supports non-Euclidean domains
Stronger theoretical guarantees: Non-asymptotic minimax optimality
More practical algorithms: Fully data-driven implementation

Conclusions and Discussion

Main Conclusions

Theoretical contribution: First to establish minimax optimality theory for spectrally regularized two-sample tests under RFF approximation
Algorithmic contribution: Provides computationally efficient and statistically optimal practical algorithms
Empirical validation: Validates method effectiveness across multiple data types

Limitations

Feature count selection: While theoretical guidance is provided, practical selection still requires empirical adjustment
Kernel function dependence: Performance still depends on kernel function choice
Heavy-tailed distributions: May require large feature counts for extreme heavy-tailed distributions

Future Directions

Alternative approximation methods: Explore Nyström and other approximation techniques
Cheap permutation testing: Combine with cheap permutation testing to further reduce computational cost
Automatic hyperparameter tuning: Develop more intelligent hyperparameter selection strategies

In-Depth Evaluation

Strengths

Theoretical rigor: Provides complete non-asymptotic theoretical analysis including sufficient conditions and minimax optimality proofs
Strong practicality: Algorithms are fully data-driven, requiring no prior knowledge
Comprehensive experiments: Covers synthetic and real data with multiple distribution types
Clear presentation: Technical details are thorough with complete proofs

Weaknesses

Complexity analysis: While asymptotic complexity is reduced, constant factors may be large
Parameter sensitivity: Selection of regularization parameters and feature counts remains somewhat sensitive
Heavy-tailed handling: Performance on heavy-tailed distributions could be improved

Impact

Theoretical value: Provides new theoretical framework for computational-statistical trade-offs in kernel methods
Practical value: Has important application prospects in large-scale two-sample testing
Methodological contribution: RFF approximation application in statistical testing provides new insights for related research

Applicable Scenarios

Large-scale data: Computational advantages are significant when sample sizes are large
High-dimensional data: Shows substantial advantages over traditional methods in high-dimensional settings
Real-time applications: Computational efficiency improvements enable real-time testing
Nonparametric settings: Applicable to general cases where distributional form is unknown

References

Gretton, A., et al. (2012). A kernel two-sample test. JMLR.
Hagrass, O., et al. (2024). Spectral regularized kernel two-sample tests. Annals of Statistics.
Rahimi, A., & Recht, B. (2007). Random features for large-scale kernel machines. NIPS.
Choi, I., & Kim, I. (2024). Computational-statistical trade-off in kernel two-sample testing with random Fourier features.
Sriperumbudur, B. K., & Sterge, N. (2022). Approximate kernel PCA: Computational versus statistical trade-off. Annals of Statistics.

Overall Assessment: This is a high-quality theoretical statistics paper that successfully applies random features approximation techniques to spectrally regularized two-sample testing, significantly improving computational efficiency while maintaining statistical optimality. The paper's theoretical analysis is thorough and detailed, experimental validation is comprehensive, and it has important value for both statistical learning theory and practical applications.