Exact conditional tests for contingency tables require sampling from fibers with fixed margins. Classical Markov basis MCMC is general but often impractical: computing full Markov bases that connect all fibers of a given constraint matrix can be infeasible and the resulting chains may converge slowly, especially in sparse settings or in presence of structural zeros. We introduce a SAT-based alternative that encodes fibers as Boolean circuits which allows modern SAT samplers to generate tables randomly. We analyze the sampling bias that SAT samplers may introduce, provide diagnostics, and propose practical mitigation. We propose hybrid MCMC schemes that combine SAT proposals with local moves to ensure correct stationary distributions which do not necessarily require connectivity via local moves which is particularly beneficial in presence of structural zeros. Across benchmarks, including small and involved tables with many structural zeros where pure Markov-basis methods underperform, our methods deliver reliable conditional p-values and often outperform samplers that rely on precomputed Markov bases.
- Paper ID: 2511.05709
- Title: SAT-sampling for statistical significance testing in sparse contingency tables
- Authors: Patrick Scharpfenecker, Tobias Windisch (University of Applied Sciences Kempten, Germany)
- Classification: stat.ME (Statistics - Methodology), math.CO (Mathematics - Combinatorics), stat.CO (Statistics - Computation)
- Publication Date: November 7, 2025
- Paper Link: https://arxiv.org/abs/2511.05709
This paper proposes a novel SAT solver-based approach to replace traditional Markov basis MCMC methods for exact conditional testing in contingency tables. The conventional approach requires computing complete Markov bases connecting all fibers, which is often infeasible and converges slowly in sparse settings or when structural zeros are present. The authors encode fibers as Boolean circuits and utilize modern SAT samplers to randomly generate tables, analyze sampling bias potentially introduced by SAT samplers, and propose practical mitigation strategies. Through a hybrid MCMC scheme combining SAT proposals and local moves, the method ensures correct stationary distributions, particularly suitable for cases with structural zeros.
Exact conditional inference for contingency tables is an important problem in statistics, particularly for independence testing. Such problems require sampling from fibers under fixed marginal constraints, i.e., finding non-negative integer tables u satisfying linear constraints Au=b.
Traditional Markov basis MCMC methods face two major bottlenecks:
- Computational Complexity: Computing complete Markov bases for realistic models and table sizes is typically computationally prohibitive or entirely infeasible
- Convergence Issues: Even when bases are available, induced moves may mix slowly, requiring substantial tuning effort
- Structural Zero Problem: Structural zeros and other constraints increase Markov basis size and complicate connectivity
The authors observe that modern SAT solvers perform excellently on large structured instances, particularly conflict-driven clause learning (CDCL) solvers. Recent advances in SAT sampling techniques (such as UniGen3, CMSGen, etc.) provide new possibilities for solving fiber sampling problems.
- SAT Encoding Method: Proposes an efficient method for encoding fiber constraints as Boolean circuits, converting them to CNF form via Tseitin transformation while maintaining sparsity and enabling strong unit propagation in CDCL solvers
- Sampling Bias Analysis: Quantifies the extent and structure of sampling bias in state-of-the-art SAT samplers, develops practical mitigation techniques to improve accuracy of conditional p-values
- Hybrid MCMC Scheme: Proposes two hybrid schemes An(M) and Pn,k(M) combining SAT proposals and local moves, ensuring correct stationary distributions
- Performance Improvement: Demonstrates performance advantages over traditional Markov basis methods on benchmark tests including small complex tables with structural zeros
Given constraint matrix A∈Nk×d and marginal vector b∈Zk, the objective is to sample from fiber FA,b={u∈Nd:Au=b} to approximate conditional p-values:
Eρ[f]=∑u∈FA,bf(u)ρ(u)
where ρ(v)∼v1!⋯vd!1, f(v)=1X(v)≥X(uobs)
- Constraint Representation: Reformulates linear constraints Au=b as a series of addition, multiplication, and equality checks
- Bit Representation: Uses l bits to represent each entry, where l=⌈log2(maxi,j,Ai,j>0Ai,jbi)⌉
- Circuit Construction: Constructs Boolean circuit C of size poly(k,d,l)
Uses classical Tseitin encoding to convert circuit C to CNF form F, satisfying:
- C(u1,…,ud)=1 if and only if there exist y1,…,ym such that F(u1,…,ud,y1,…,ym)=1
- Establishes bijection between FA,b∩[2l−1]d and satisfying assignments of F
One step out of every n steps uses SAT sampler, remaining steps use predefined move set M:
- Alternates between SAT steps and Markov basis moves
- Maintains low SAT step proportion to mitigate structural bias
Manages k random walks in parallel:
- First uses SAT sampler to sample n independent initial points from fiber
- Then executes k random walks using M
- Randomly selects one walk to continue for n steps at each iteration
For SAT proposals, computes acceptance probability:
pW(u,v)=min{1,W(u,v)W(v,u)⋅∏i=1dvi!ui!}
- Independence Models Id1×⋯×dk: d1×d2×⋯×dk independence models
- Quasi-Independence Models QId1×⋯×dk(S): Independence models with structural zeros S
- No Three-Factor Interaction Models N3Fd: No three-factor interaction models on d×d×d tables
Uses evaluation scheme from Algorithm 1:
- Generates T=100 initial samples
- Runs Fisher test on each sample
- Measures steps required to converge to conditional p-value (rather than sample count)
- Evaluates steps needed to achieve 0.005 accuracy
- Uses CMSGen as primary SAT sampler (faster than UniGen3)
- Implements general gradient descent method for MLE computation
- Uses L-BFGS optimization, monitors marginal difference ∥A(uobs−u^(θ))∥2
Experimental results demonstrate SAT-based method outperforms Markov basis method in multiple scenarios, particularly in:
- Sparse Tables: Excels when sample size is small or structural zeros are present
- Complex Structures: An(M) scheme outperforms Pn,k(M) on large problem instances
- Structural Zero Handling: Ensures convergence to correct p-values without requiring complete Markov bases (e.g., Graver bases)
- N3F4 Model: Hybrid method significantly outperforms pure Markov method at sample sizes 80 and 100
- QI5×5 Model: Convergence to true p-values verified through complete fiber enumeration
- QI10×10 Model: Demonstrates faster convergence across various sample sizes
Figure 2 demonstrates strong structural bias in pure SAT method, rendering it unsuitable for direct p-value approximation, but hybrid schemes successfully mitigate this issue, achieving convergence to correct p-values.
- Markov Basis MCMC: Seminal work by Diaconis and Sturmfels (1998)
- Dynamic Markov Bases: On-the-fly move computation by Dobra (2011)
- Lattice Basis Methods: Lattice basis moves by Hazelton and Karimi (2024)
- RUMBA Method: High-dimensional lattice point sampling by Bakenhus and Petrović (2024)
- Problem-Specific Strategies: Independence testing for large sparse tables by Zhang (2019)
- Heat-Bath Methods: Dynamic move length adjustment by Stanley and Windisch (2018)
- SAT Solvers: High-performance solvers including CryptoMiniSat5, Kissat
- SAT Sampling: Sampling tools including UniGen3, CMSGen, SMT-Sampler
- SAT-based method provides effective alternative for fiber sampling, particularly suitable for sparse settings with structural zeros
- Hybrid MCMC scheme successfully mitigates structural bias from SAT samplers
- Method significantly outperforms traditional Markov basis methods in complex scenarios involving small sample sizes or abundant structural zeros
- Computational Overhead: Single SAT sampling step may require more time than single local move
- Memory Requirements: Boolean encoding of large design matrices and rich constraint sets may grow rapidly
- Hyperparameter Tuning: Hybrid method introduces hyperparameters requiring tuning (e.g., number of walks, steps per walk)
- More efficient encoding methods for ultra-high-dimensional constraint systems
- Adaptive hyperparameter selection strategies
- Integration with other modern sampling techniques
- Strong Innovation: First systematic application of SAT technology to fiber sampling problems
- Solid Theory: Provides rigorous sampling bias analysis and mitigation strategies
- Comprehensive Experiments: Covers multiple model categories and scenarios with thorough benchmarking
- High Practical Value: Particularly applicable to sparse and structural zero scenarios where traditional methods fail
- Scalability Limitations: For extremely large-scale problems, memory requirements of Boolean encoding may become bottleneck
- Parameter Sensitivity: Performance of hybrid scheme depends on hyperparameter selection, lacking automatic tuning guidance
- Limited Comparisons: Primarily compares with traditional Markov basis methods, lacks comparison with other modern approaches
- Academic Contribution: Opens new directions for cross-disciplinary research between statistical computation and combinatorial optimization
- Practical Value: Provides practical tools for analyzing complex contingency tables
- Reproducibility: Commits to open-source implementation, facilitating method dissemination
- Analysis of sparse contingency tables with abundant structural zeros
- High-dimensional constraint problems where traditional Markov basis computation is infeasible
- Scenarios requiring rapid exploration of distant fiber regions
- Exact conditional testing with small sample sizes
This paper cites important literature from statistics, combinatorial mathematics, and computer science, including:
- Diaconis and Sturmfels (1998): Seminal work on algebraic algorithms for sampling conditional distributions
- Modern SAT solver literature: CryptoMiniSat5, UniGen3, CMSGen, etc.
- Statistical computation methods: Related research on Markov bases, dynamic bases, and lattice bases