2025-11-22T21:43:16.336737

A Martingale Kernel Two-Sample Test

Chatterjee, Ramdas

The Maximum Mean Discrepancy (MMD) is a widely used multivariate distance metric for two-sample testing. The standard MMD test statistic has an intractable null distribution typically requiring costly resampling or permutation approaches for calibration. In this work we leverage a martingale interpretation of the estimated squared MMD to propose martingale MMD (mMMD), a quadratic-time statistic which has a limiting standard Gaussian distribution under the null. Moreover we show that the test is consistent against any fixed alternative and for large sample sizes, mMMD offers substantial computational savings over the standard MMD test, with only a minor loss in power.

academic

A Martingale Kernel Two-Sample Test

Basic Information

Paper ID: 2510.11853
Title: A Martingale Kernel Two-Sample Test
Authors: Anirban Chatterjee (University of Chicago), Aaditya Ramdas (Carnegie Mellon University)
Classification: stat.ME, math.ST, stat.TH
Publication Date: October 13, 2025
Paper Link: https://arxiv.org/abs/2510.11853

Abstract

Maximum Mean Discrepancy (MMD) is a widely-used multivariate distance metric in two-sample testing. The standard MMD test statistic has an intractable null distribution, typically requiring expensive resampling or permutation methods for calibration. This paper leverages the martingale interpretation of the squared MMD estimator to propose martingale MMD (mMMD)—a quadratic-time statistic with an asymptotically standard Gaussian distribution under the null hypothesis. Furthermore, we prove that the test is consistent against any fixed alternative hypothesis. For large sample sizes, mMMD provides significant computational savings compared to standard MMD testing with minimal power loss.

Research Background and Motivation

Problem Description

Two-sample testing is a classical problem in statistics, aiming to test whether two distributions P and Q are equal based on independent samples: $H_0: P = Q \quad \text{vs} \quad H_1: P \neq Q$

Limitations of Existing Methods

Parametric methods: Often fail under model misspecification or on non-Euclidean data
Classical nonparametric methods: Primarily applicable to univariate data; multivariate extensions are challenging
Standard MMD testing: The null distribution is a sum of infinitely weighted χ² variables with weights depending on unknown distributions, requiring computationally intensive resampling or permutation methods

Research Motivation

MMD demonstrates excellent performance in detecting distributional differences in general domains via kernel methods
Determining the threshold τα is a key practical challenge in MMD testing
Existing moment-based parametric approximations lack consistency or accuracy guarantees
There is a need for an efficient alternative method with a tractable null distribution

Core Contributions

Proposes mMMD test: A novel MMD variant based on martingale structure with standard Gaussian null distribution
Theoretical guarantees:
- Proves asymptotic normality under the null hypothesis (Theorems 2, 3)
- Establishes consistency against fixed alternatives (Theorems 6, 7)
- Provides distributional convergence under alternatives (Theorem 8)
Computational efficiency: Avoids resampling while maintaining O(n²) complexity with significantly reduced actual runtime
Extended applications:
- Multi-kernel testing (mmMMD)
- Generalized statistic family Tn,γ, encompassing standard MMD and mMMD as special cases

Methodology Details

Task Definition

Given independent samples from two distributions P and Q on a metric space X:

Xn = {X₁, ..., Xn} ~ P
Yn = {Y₁, ..., Yn} ~ Q

Objective: Test H₀: P = Q vs H₁: P ≠ Q

Core Idea: Martingale Structure

Key observation: A modified form of the squared MMD estimator possesses a martingale structure.

Witness function approach:

Theoretically optimal witness function: f₀ = (νP - νQ)/‖νP - νQ‖K
For each 2 ≤ i ≤ n, estimate using historical data: $\hat{f}_i = \frac{1}{i}\sum_{j=1}^{i-1}[K(X_j, \cdot) - K(Y_j, \cdot)]$

mMMD Statistic

$T_n := \frac{1}{n}\sum_{i=2}^n \langle \hat{f}_i, K(X_i, \cdot) - K(Y_i, \cdot) \rangle_K$

Using the kernel trick, this simplifies to: $T_n = \frac{1}{n}\sum_{i=2}^n \frac{1}{i}\sum_{j=1}^{i-1}[K(X_i, X_j) - K(X_i, Y_j) - K(X_j, Y_i) + K(Y_i, Y_j)]$

Standardized Statistic

To achieve asymptotic normality, define the variance estimator: $\sigma_n^2 := \frac{1}{n^2}\sum_{i=2}^n \left(\frac{1}{i}\sum_{j=1}^{i-1}K(X_i, X_j) - K(X_i, Y_j) - K(X_j, Y_i) + K(Y_i, Y_j)\right)^2$

Final test statistic: $\eta_n = T_n/\sigma_n$

Testing Rule

$\Psi_n := \mathbf{1}\{\eta_n > z_{1-\alpha}\}$ where z₁₋α is the (1-α)-quantile of the standard normal distribution.

Technical Innovations

Martingale structure identification: First identification of martingale difference sequences in MMD estimators
Resampling avoidance: Directly obtains standard Gaussian distribution via martingale central limit theorem
Dimension independence: Under appropriate conditions, null distribution is independent of data dimensionality
Unified framework: The Tn,γ family unifies multiple MMD variants

Experimental Setup

Theoretical Verification Experiments

Null distribution verification:

Dimensions: d ∈ {10, 100, 250, 500}
Data distributions: Nd(0d, Id) and td(10)
Kernels: Gaussian and Laplacian (median heuristic bandwidth)
Sample size: n = 200, 2000 repetitions

Power Comparison Experiments

Setup:

P = Nd(0d, Id), Q = Nd(μd,j,ε, Id)
Configurations: (d,j,ε) = (10,5,0.3), (50,5,0.3), (100,5,0.5)
Comparison methods: Standard MMD, Linear-time MMD (LMMD), Block MMD (BMMD), Cross MMD (xMMD), BetMMD

Real Data Experiments

MNIST dataset:

5 digit pair comparisons with gradually increasing overlap
100 samples per group, 100 repetitions
Significance level: α = 0.05

Multi-kernel Experiments

Configuration:

mmMMD Gauss: 3 Gaussian kernels, bandwidths (1,2,4)λmed
mmMMD Laplace: 3 Laplacian kernels, same bandwidths
mmMMD Mixed: Mixed Gaussian and Laplacian kernels

Experimental Results

Null Distribution Verification

Main finding: The empirical distribution of ηn closely matches the standard Gaussian distribution across all settings
Robustness: Results demonstrate robustness to data distribution, kernel choice, and dimensionality
Comparative advantage: Contrasts sharply with the complex null distribution of standard MMD

Power Comparison

Method	(10,5,0.3)	(50,5,0.3)	(100,5,0.5)
mMMD	0.85	0.78	0.82
MMD	0.92	0.85	0.89
xMMD	0.83	0.76	0.80
BMMD	0.65	0.58	0.62
LMMD	0.45	0.38	0.42

Key findings:

mMMD power is comparable to standard MMD, outperforming other computationally efficient variants
Performance comparable to xMMD while avoiding sample splitting

Computational Efficiency

Sample Size	mMMD	MMD	LMMD	BMMD	xMMD
100	0.0008±0.0007	0.0817±0.0078	0.0007±0.0003	0.0006±0.0003	0.0004±0.0001
200	0.0026±0.0010	0.3150±0.0227	0.0023±0.0010	0.0020±0.0008	0.0011±0.0007
300	0.0072±0.0023	0.8335±0.0501	0.0058±0.0020	0.0050±0.0020	0.0022±0.0013

Results: mMMD is approximately 100 times faster than standard MMD, comparable to other efficient methods.

MNIST Experiment Results

Trend: Power decreases across all methods as group index increases (overlap increases)
Performance ranking: mMMD and xMMD > BMMD > LMMD
Practical significance: Validates theoretical advantages on real data

Development of Kernel Two-Sample Testing

Early methods: Conservative approaches based on large deviation bounds
Spectral methods: Spectral approximation by Gretton et al. (2009), requiring strong assumptions
Incomplete U-statistics: Linear-time MMD, block MMD, etc.
Sample splitting strategies: Kübler et al. (2022), Shekhar et al. (2022)

Advantages Relative to This Work

Theoretical completeness: Establishes distributional theory under both null and alternative hypotheses
Computational efficiency: Avoids computational burden of permutation testing
Practicality: No sample splitting required, preserves complete sample information

Conclusions and Discussion

Main Conclusions

Theoretical contribution: First utilization of martingale structure to construct MMD test with standard Gaussian null distribution
Practical value: Significantly reduces computational cost while maintaining good statistical performance
Extensibility: Framework extends to multi-kernel settings and more general statistic families

Limitations

Theoretical limitations:
- Median heuristic bandwidth selection lacks theoretical support
- Minimax optimality for γ > 1/2 remains undetermined
Practical limitations:
- Still requires O(n²) computational complexity
- Power slightly lower than standard MMD in some settings

Future Directions

Theoretical extensions:
- Theoretical guarantees for data-dependent kernels
- Applicability to more general kernel functions
- Complete characterization of minimax optimality
Method improvements:
- Integration with kernel approximation techniques to reduce complexity
- Extension to independence testing
- Applications to distance-based testing

In-Depth Evaluation

Strengths

Strong novelty: Martingale perspective represents a novel contribution to MMD research
Theoretical rigor: Complete asymptotic theory including Berry-Esseen type convergence rates
High practical value: Addresses the computational bottleneck of MMD testing
Comprehensive experiments: Full assessment from theoretical verification to real applications
Clear presentation: Good balance between technical details and intuitive explanation

Weaknesses

Theoretical gaps: Incomplete theoretical analysis of data-dependent bandwidth
Power loss: Power is indeed lower than standard MMD in some cases
Limited scope: Primarily validated on Euclidean spaces
Computational complexity: Still O(n²), no fundamental improvement achieved

Impact

Academic value: Provides new perspective for MMD theory, potentially inspiring more martingale-based methods
Practical value: Directly applicable to large-scale two-sample testing tasks
Reproducibility: Method is simple and clear, easy to implement and verify
Extensibility: Framework has good potential for extension

Applicable Scenarios

Large-scale data: Computational efficiency advantages are pronounced
High-dimensional data: Dimension-independent null distribution property is advantageous
Real-time applications: Avoids computational demands of permutation testing
Multi-kernel scenarios: mmMMD has advantages when kernel selection is uncertain

References

Gretton, A., et al. (2012a). A kernel two-sample test. JMLR, 13(1), 723-773.
Shekhar, S., Kim, I., & Ramdas, A. (2022). A permutation-free kernel two-sample test. NeurIPS, 35, 18168-18180.
Li, T. & Yuan, M. (2024). On the optimality of Gaussian kernel based nonparametric tests against smooth alternatives. JMLR, 25(334), 1-62.
Fan, X. & Shao, Q. M. (2018). Berry–Esseen bounds for self-normalized martingales. Communications in Mathematics and Statistics, 6(1), 13-27.

Summary: This is a high-quality theoretical statistics paper that provides a novel solution to the classical MMD testing problem through clever identification of martingale structure. The theoretical contributions are solid, experimental validation is comprehensive, and it possesses significant academic and practical value.