The Maximum Mean Discrepancy (MMD) is a widely used multivariate distance metric for two-sample testing. The standard MMD test statistic has an intractable null distribution typically requiring costly resampling or permutation approaches for calibration. In this work we leverage a martingale interpretation of the estimated squared MMD to propose martingale MMD (mMMD), a quadratic-time statistic which has a limiting standard Gaussian distribution under the null. Moreover we show that the test is consistent against any fixed alternative and for large sample sizes, mMMD offers substantial computational savings over the standard MMD test, with only a minor loss in power.
- Paper ID: 2510.11853
- Title: A Martingale Kernel Two-Sample Test
- Authors: Anirban Chatterjee (University of Chicago), Aaditya Ramdas (Carnegie Mellon University)
- Classification: stat.ME, math.ST, stat.TH
- Publication Date: October 13, 2025
- Paper Link: https://arxiv.org/abs/2510.11853
Maximum Mean Discrepancy (MMD) is a widely-used multivariate distance metric in two-sample testing. The standard MMD test statistic has an intractable null distribution, typically requiring expensive resampling or permutation methods for calibration. This paper leverages the martingale interpretation of the squared MMD estimator to propose martingale MMD (mMMD)—a quadratic-time statistic with an asymptotically standard Gaussian distribution under the null hypothesis. Furthermore, we prove that the test is consistent against any fixed alternative hypothesis. For large sample sizes, mMMD provides significant computational savings compared to standard MMD testing with minimal power loss.
Two-sample testing is a classical problem in statistics, aiming to test whether two distributions P and Q are equal based on independent samples:
H0:P=QvsH1:P=Q
- Parametric methods: Often fail under model misspecification or on non-Euclidean data
- Classical nonparametric methods: Primarily applicable to univariate data; multivariate extensions are challenging
- Standard MMD testing: The null distribution is a sum of infinitely weighted χ² variables with weights depending on unknown distributions, requiring computationally intensive resampling or permutation methods
- MMD demonstrates excellent performance in detecting distributional differences in general domains via kernel methods
- Determining the threshold τα is a key practical challenge in MMD testing
- Existing moment-based parametric approximations lack consistency or accuracy guarantees
- There is a need for an efficient alternative method with a tractable null distribution
- Proposes mMMD test: A novel MMD variant based on martingale structure with standard Gaussian null distribution
- Theoretical guarantees:
- Proves asymptotic normality under the null hypothesis (Theorems 2, 3)
- Establishes consistency against fixed alternatives (Theorems 6, 7)
- Provides distributional convergence under alternatives (Theorem 8)
- Computational efficiency: Avoids resampling while maintaining O(n²) complexity with significantly reduced actual runtime
- Extended applications:
- Multi-kernel testing (mmMMD)
- Generalized statistic family Tn,γ, encompassing standard MMD and mMMD as special cases
Given independent samples from two distributions P and Q on a metric space X:
- Xn = {X₁, ..., Xn} ~ P
- Yn = {Y₁, ..., Yn} ~ Q
Objective: Test H₀: P = Q vs H₁: P ≠ Q
Key observation: A modified form of the squared MMD estimator possesses a martingale structure.
Witness function approach:
- Theoretically optimal witness function: f₀ = (νP - νQ)/‖νP - νQ‖K
- For each 2 ≤ i ≤ n, estimate using historical data:
f^i=i1∑j=1i−1[K(Xj,⋅)−K(Yj,⋅)]
Tn:=n1∑i=2n⟨f^i,K(Xi,⋅)−K(Yi,⋅)⟩K
Using the kernel trick, this simplifies to:
Tn=n1∑i=2ni1∑j=1i−1[K(Xi,Xj)−K(Xi,Yj)−K(Xj,Yi)+K(Yi,Yj)]
To achieve asymptotic normality, define the variance estimator:
σn2:=n21∑i=2n(i1∑j=1i−1K(Xi,Xj)−K(Xi,Yj)−K(Xj,Yi)+K(Yi,Yj))2
Final test statistic:
ηn=Tn/σn
Ψn:=1{ηn>z1−α}
where z₁₋α is the (1-α)-quantile of the standard normal distribution.
- Martingale structure identification: First identification of martingale difference sequences in MMD estimators
- Resampling avoidance: Directly obtains standard Gaussian distribution via martingale central limit theorem
- Dimension independence: Under appropriate conditions, null distribution is independent of data dimensionality
- Unified framework: The Tn,γ family unifies multiple MMD variants
Null distribution verification:
- Dimensions: d ∈ {10, 100, 250, 500}
- Data distributions: Nd(0d, Id) and td(10)
- Kernels: Gaussian and Laplacian (median heuristic bandwidth)
- Sample size: n = 200, 2000 repetitions
Setup:
- P = Nd(0d, Id), Q = Nd(μd,j,ε, Id)
- Configurations: (d,j,ε) = (10,5,0.3), (50,5,0.3), (100,5,0.5)
- Comparison methods: Standard MMD, Linear-time MMD (LMMD), Block MMD (BMMD), Cross MMD (xMMD), BetMMD
MNIST dataset:
- 5 digit pair comparisons with gradually increasing overlap
- 100 samples per group, 100 repetitions
- Significance level: α = 0.05
Configuration:
- mmMMD Gauss: 3 Gaussian kernels, bandwidths (1,2,4)λmed
- mmMMD Laplace: 3 Laplacian kernels, same bandwidths
- mmMMD Mixed: Mixed Gaussian and Laplacian kernels
- Main finding: The empirical distribution of ηn closely matches the standard Gaussian distribution across all settings
- Robustness: Results demonstrate robustness to data distribution, kernel choice, and dimensionality
- Comparative advantage: Contrasts sharply with the complex null distribution of standard MMD
| Method | (10,5,0.3) | (50,5,0.3) | (100,5,0.5) |
|---|
| mMMD | 0.85 | 0.78 | 0.82 |
| MMD | 0.92 | 0.85 | 0.89 |
| xMMD | 0.83 | 0.76 | 0.80 |
| BMMD | 0.65 | 0.58 | 0.62 |
| LMMD | 0.45 | 0.38 | 0.42 |
Key findings:
- mMMD power is comparable to standard MMD, outperforming other computationally efficient variants
- Performance comparable to xMMD while avoiding sample splitting
| Sample Size | mMMD | MMD | LMMD | BMMD | xMMD |
|---|
| 100 | 0.0008±0.0007 | 0.0817±0.0078 | 0.0007±0.0003 | 0.0006±0.0003 | 0.0004±0.0001 |
| 200 | 0.0026±0.0010 | 0.3150±0.0227 | 0.0023±0.0010 | 0.0020±0.0008 | 0.0011±0.0007 |
| 300 | 0.0072±0.0023 | 0.8335±0.0501 | 0.0058±0.0020 | 0.0050±0.0020 | 0.0022±0.0013 |
Results: mMMD is approximately 100 times faster than standard MMD, comparable to other efficient methods.
- Trend: Power decreases across all methods as group index increases (overlap increases)
- Performance ranking: mMMD and xMMD > BMMD > LMMD
- Practical significance: Validates theoretical advantages on real data
- Early methods: Conservative approaches based on large deviation bounds
- Spectral methods: Spectral approximation by Gretton et al. (2009), requiring strong assumptions
- Incomplete U-statistics: Linear-time MMD, block MMD, etc.
- Sample splitting strategies: Kübler et al. (2022), Shekhar et al. (2022)
- Theoretical completeness: Establishes distributional theory under both null and alternative hypotheses
- Computational efficiency: Avoids computational burden of permutation testing
- Practicality: No sample splitting required, preserves complete sample information
- Theoretical contribution: First utilization of martingale structure to construct MMD test with standard Gaussian null distribution
- Practical value: Significantly reduces computational cost while maintaining good statistical performance
- Extensibility: Framework extends to multi-kernel settings and more general statistic families
- Theoretical limitations:
- Median heuristic bandwidth selection lacks theoretical support
- Minimax optimality for γ > 1/2 remains undetermined
- Practical limitations:
- Still requires O(n²) computational complexity
- Power slightly lower than standard MMD in some settings
- Theoretical extensions:
- Theoretical guarantees for data-dependent kernels
- Applicability to more general kernel functions
- Complete characterization of minimax optimality
- Method improvements:
- Integration with kernel approximation techniques to reduce complexity
- Extension to independence testing
- Applications to distance-based testing
- Strong novelty: Martingale perspective represents a novel contribution to MMD research
- Theoretical rigor: Complete asymptotic theory including Berry-Esseen type convergence rates
- High practical value: Addresses the computational bottleneck of MMD testing
- Comprehensive experiments: Full assessment from theoretical verification to real applications
- Clear presentation: Good balance between technical details and intuitive explanation
- Theoretical gaps: Incomplete theoretical analysis of data-dependent bandwidth
- Power loss: Power is indeed lower than standard MMD in some cases
- Limited scope: Primarily validated on Euclidean spaces
- Computational complexity: Still O(n²), no fundamental improvement achieved
- Academic value: Provides new perspective for MMD theory, potentially inspiring more martingale-based methods
- Practical value: Directly applicable to large-scale two-sample testing tasks
- Reproducibility: Method is simple and clear, easy to implement and verify
- Extensibility: Framework has good potential for extension
- Large-scale data: Computational efficiency advantages are pronounced
- High-dimensional data: Dimension-independent null distribution property is advantageous
- Real-time applications: Avoids computational demands of permutation testing
- Multi-kernel scenarios: mmMMD has advantages when kernel selection is uncertain
- Gretton, A., et al. (2012a). A kernel two-sample test. JMLR, 13(1), 723-773.
- Shekhar, S., Kim, I., & Ramdas, A. (2022). A permutation-free kernel two-sample test. NeurIPS, 35, 18168-18180.
- Li, T. & Yuan, M. (2024). On the optimality of Gaussian kernel based nonparametric tests against smooth alternatives. JMLR, 25(334), 1-62.
- Fan, X. & Shao, Q. M. (2018). Berry–Esseen bounds for self-normalized martingales. Communications in Mathematics and Statistics, 6(1), 13-27.
Summary: This is a high-quality theoretical statistics paper that provides a novel solution to the classical MMD testing problem through clever identification of martingale structure. The theoretical contributions are solid, experimental validation is comprehensive, and it possesses significant academic and practical value.