2025-11-11T16:28:09.601154

SAT-sampling for statistical significance testing in sparse contingency tables

Scharpfenecker, Windisch
Exact conditional tests for contingency tables require sampling from fibers with fixed margins. Classical Markov basis MCMC is general but often impractical: computing full Markov bases that connect all fibers of a given constraint matrix can be infeasible and the resulting chains may converge slowly, especially in sparse settings or in presence of structural zeros. We introduce a SAT-based alternative that encodes fibers as Boolean circuits which allows modern SAT samplers to generate tables randomly. We analyze the sampling bias that SAT samplers may introduce, provide diagnostics, and propose practical mitigation. We propose hybrid MCMC schemes that combine SAT proposals with local moves to ensure correct stationary distributions which do not necessarily require connectivity via local moves which is particularly beneficial in presence of structural zeros. Across benchmarks, including small and involved tables with many structural zeros where pure Markov-basis methods underperform, our methods deliver reliable conditional p-values and often outperform samplers that rely on precomputed Markov bases.
academic

SAT-sampling for statistical significance testing in sparse contingency tables

Basic Information

  • Paper ID: 2511.05709
  • Title: SAT-sampling for statistical significance testing in sparse contingency tables
  • Authors: Patrick Scharpfenecker, Tobias Windisch (University of Applied Sciences Kempten, Germany)
  • Classification: stat.ME (Statistics - Methodology), math.CO (Mathematics - Combinatorics), stat.CO (Statistics - Computation)
  • Publication Date: November 7, 2025
  • Paper Link: https://arxiv.org/abs/2511.05709

Abstract

This paper proposes a novel SAT solver-based approach to replace traditional Markov basis MCMC methods for exact conditional testing in contingency tables. The conventional approach requires computing complete Markov bases connecting all fibers, which is often infeasible and converges slowly in sparse settings or when structural zeros are present. The authors encode fibers as Boolean circuits and utilize modern SAT samplers to randomly generate tables, analyze sampling bias potentially introduced by SAT samplers, and propose practical mitigation strategies. Through a hybrid MCMC scheme combining SAT proposals and local moves, the method ensures correct stationary distributions, particularly suitable for cases with structural zeros.

Research Background and Motivation

Problem Definition

Exact conditional inference for contingency tables is an important problem in statistics, particularly for independence testing. Such problems require sampling from fibers under fixed marginal constraints, i.e., finding non-negative integer tables uu satisfying linear constraints Au=bAu = b.

Limitations of Existing Methods

Traditional Markov basis MCMC methods face two major bottlenecks:

  1. Computational Complexity: Computing complete Markov bases for realistic models and table sizes is typically computationally prohibitive or entirely infeasible
  2. Convergence Issues: Even when bases are available, induced moves may mix slowly, requiring substantial tuning effort
  3. Structural Zero Problem: Structural zeros and other constraints increase Markov basis size and complicate connectivity

Research Motivation

The authors observe that modern SAT solvers perform excellently on large structured instances, particularly conflict-driven clause learning (CDCL) solvers. Recent advances in SAT sampling techniques (such as UniGen3, CMSGen, etc.) provide new possibilities for solving fiber sampling problems.

Core Contributions

  1. SAT Encoding Method: Proposes an efficient method for encoding fiber constraints as Boolean circuits, converting them to CNF form via Tseitin transformation while maintaining sparsity and enabling strong unit propagation in CDCL solvers
  2. Sampling Bias Analysis: Quantifies the extent and structure of sampling bias in state-of-the-art SAT samplers, develops practical mitigation techniques to improve accuracy of conditional p-values
  3. Hybrid MCMC Scheme: Proposes two hybrid schemes An(M)A_n(M) and Pn,k(M)P_{n,k}(M) combining SAT proposals and local moves, ensuring correct stationary distributions
  4. Performance Improvement: Demonstrates performance advantages over traditional Markov basis methods on benchmark tests including small complex tables with structural zeros

Methodology Details

Task Definition

Given constraint matrix ANk×dA \in \mathbb{N}^{k \times d} and marginal vector bZkb \in \mathbb{Z}^k, the objective is to sample from fiber FA,b={uNd:Au=b}F_{A,b} = \{u \in \mathbb{N}^d : Au = b\} to approximate conditional p-values:

Eρ[f]=uFA,bf(u)ρ(u)E_\rho[f] = \sum_{u \in F_{A,b}} f(u)\rho(u)

where ρ(v)1v1!vd!\rho(v) \sim \frac{1}{v_1! \cdots v_d!}, f(v)=1X(v)X(uobs)f(v) = \mathbf{1}_{X(v) \geq X(u^{obs})}

SAT Encoding Architecture

Boolean Circuit Encoding

  1. Constraint Representation: Reformulates linear constraints Au=bAu = b as a series of addition, multiplication, and equality checks
  2. Bit Representation: Uses ll bits to represent each entry, where l=log2(maxi,j,Ai,j>0biAi,j)l = \lceil \log_2(\max_{i,j,A_{i,j}>0} \frac{b_i}{A_{i,j}}) \rceil
  3. Circuit Construction: Constructs Boolean circuit CC of size poly(k,d,l)\text{poly}(k,d,l)

Tseitin Transformation

Uses classical Tseitin encoding to convert circuit CC to CNF form FF, satisfying:

  • C(u1,,ud)=1C(u_1, \ldots, u_d) = 1 if and only if there exist y1,,ymy_1, \ldots, y_m such that F(u1,,ud,y1,,ym)=1F(u_1, \ldots, u_d, y_1, \ldots, y_m) = 1
  • Establishes bijection between FA,b[2l1]dF_{A,b} \cap [2^l-1]^d and satisfying assignments of FF

Hybrid MCMC Schemes

An(M)A_n(M) Scheme

One step out of every nn steps uses SAT sampler, remaining steps use predefined move set MM:

  • Alternates between SAT steps and Markov basis moves
  • Maintains low SAT step proportion to mitigate structural bias

Pn,k(M)P_{n,k}(M) Scheme

Manages kk random walks in parallel:

  • First uses SAT sampler to sample nn independent initial points from fiber
  • Then executes kk random walks using MM
  • Randomly selects one walk to continue for nn steps at each iteration

Metropolis-Hastings Correction

For SAT proposals, computes acceptance probability: pW(u,v)=min{1,W(v,u)W(u,v)i=1dui!vi!}p_W(u,v) = \min\left\{1, \frac{W(v,u)}{W(u,v)} \cdot \prod_{i=1}^d \frac{u_i!}{v_i!}\right\}

Experimental Setup

Model Categories

  1. Independence Models Id1××dkI_{d_1 \times \cdots \times d_k}: d1×d2××dkd_1 \times d_2 \times \cdots \times d_k independence models
  2. Quasi-Independence Models QId1××dk(S)QI_{d_1 \times \cdots \times d_k}(S): Independence models with structural zeros SS
  3. No Three-Factor Interaction Models N3FdN3F_d: No three-factor interaction models on d×d×dd \times d \times d tables

Evaluation Scheme

Uses evaluation scheme from Algorithm 1:

  • Generates T=100T=100 initial samples
  • Runs Fisher test on each sample
  • Measures steps required to converge to conditional p-value (rather than sample count)
  • Evaluates steps needed to achieve 0.005 accuracy

Implementation Details

  • Uses CMSGen as primary SAT sampler (faster than UniGen3)
  • Implements general gradient descent method for MLE computation
  • Uses L-BFGS optimization, monitors marginal difference A(uobsu^(θ))2\|A(u^{obs} - \hat{u}(\theta))\|_2

Experimental Results

Main Results

Experimental results demonstrate SAT-based method outperforms Markov basis method in multiple scenarios, particularly in:

  1. Sparse Tables: Excels when sample size is small or structural zeros are present
  2. Complex Structures: An(M)A_n(M) scheme outperforms Pn,k(M)P_{n,k}(M) on large problem instances
  3. Structural Zero Handling: Ensures convergence to correct p-values without requiring complete Markov bases (e.g., Graver bases)

Specific Performance

  • N3F4 Model: Hybrid method significantly outperforms pure Markov method at sample sizes 80 and 100
  • QI5×5 Model: Convergence to true p-values verified through complete fiber enumeration
  • QI10×10 Model: Demonstrates faster convergence across various sample sizes

Sampling Bias Analysis

Figure 2 demonstrates strong structural bias in pure SAT method, rendering it unsuitable for direct p-value approximation, but hybrid schemes successfully mitigate this issue, achieving convergence to correct p-values.

Traditional Methods

  • Markov Basis MCMC: Seminal work by Diaconis and Sturmfels (1998)
  • Dynamic Markov Bases: On-the-fly move computation by Dobra (2011)
  • Lattice Basis Methods: Lattice basis moves by Hazelton and Karimi (2024)

Modern Developments

  • RUMBA Method: High-dimensional lattice point sampling by Bakenhus and Petrović (2024)
  • Problem-Specific Strategies: Independence testing for large sparse tables by Zhang (2019)
  • Heat-Bath Methods: Dynamic move length adjustment by Stanley and Windisch (2018)

SAT Technology

  • SAT Solvers: High-performance solvers including CryptoMiniSat5, Kissat
  • SAT Sampling: Sampling tools including UniGen3, CMSGen, SMT-Sampler

Conclusions and Discussion

Main Conclusions

  1. SAT-based method provides effective alternative for fiber sampling, particularly suitable for sparse settings with structural zeros
  2. Hybrid MCMC scheme successfully mitigates structural bias from SAT samplers
  3. Method significantly outperforms traditional Markov basis methods in complex scenarios involving small sample sizes or abundant structural zeros

Limitations

  1. Computational Overhead: Single SAT sampling step may require more time than single local move
  2. Memory Requirements: Boolean encoding of large design matrices and rich constraint sets may grow rapidly
  3. Hyperparameter Tuning: Hybrid method introduces hyperparameters requiring tuning (e.g., number of walks, steps per walk)

Future Directions

  1. More efficient encoding methods for ultra-high-dimensional constraint systems
  2. Adaptive hyperparameter selection strategies
  3. Integration with other modern sampling techniques

In-Depth Evaluation

Strengths

  1. Strong Innovation: First systematic application of SAT technology to fiber sampling problems
  2. Solid Theory: Provides rigorous sampling bias analysis and mitigation strategies
  3. Comprehensive Experiments: Covers multiple model categories and scenarios with thorough benchmarking
  4. High Practical Value: Particularly applicable to sparse and structural zero scenarios where traditional methods fail

Weaknesses

  1. Scalability Limitations: For extremely large-scale problems, memory requirements of Boolean encoding may become bottleneck
  2. Parameter Sensitivity: Performance of hybrid scheme depends on hyperparameter selection, lacking automatic tuning guidance
  3. Limited Comparisons: Primarily compares with traditional Markov basis methods, lacks comparison with other modern approaches

Impact

  1. Academic Contribution: Opens new directions for cross-disciplinary research between statistical computation and combinatorial optimization
  2. Practical Value: Provides practical tools for analyzing complex contingency tables
  3. Reproducibility: Commits to open-source implementation, facilitating method dissemination

Applicable Scenarios

  1. Analysis of sparse contingency tables with abundant structural zeros
  2. High-dimensional constraint problems where traditional Markov basis computation is infeasible
  3. Scenarios requiring rapid exploration of distant fiber regions
  4. Exact conditional testing with small sample sizes

References

This paper cites important literature from statistics, combinatorial mathematics, and computer science, including:

  • Diaconis and Sturmfels (1998): Seminal work on algebraic algorithms for sampling conditional distributions
  • Modern SAT solver literature: CryptoMiniSat5, UniGen3, CMSGen, etc.
  • Statistical computation methods: Related research on Markov bases, dynamic bases, and lattice bases