2025-11-18T15:28:13.400087

Local Causal Discovery for Statistically Efficient Causal Inference

Schubert, Claassen, Magliacane
Causal discovery methods can identify valid adjustment sets for causal effect estimation for a pair of target variables, even when the underlying causal graph is unknown. Global causal discovery methods focus on learning the whole causal graph and therefore enable the recovery of optimal adjustment sets, i.e., sets with the lowest asymptotic variance, but they quickly become computationally prohibitive as the number of variables grows. Local causal discovery methods offer a more scalable alternative by focusing on the local neighborhood of the target variables, but are restricted to statistically suboptimal adjustment sets. In this work, we propose Local Optimal Adjustments Discovery (LOAD), a sound and complete causal discovery approach that combines the computational efficiency of local methods with the statistical optimality of global methods. First, LOAD identifies the causal relation between the targets and tests if the causal effect is identifiable by using only local information. If it is identifiable, it then finds the optimal adjustment set by leveraging local causal discovery to infer the mediators and their parents. Otherwise, it returns the locally valid parent adjustment sets based on the learned local structure. In our experiments on synthetic and realistic data LOAD outperforms global methods in scalability, while providing more accurate effect estimation than local methods.
academic

Local Causal Discovery for Statistically Efficient Causal Inference

Basic Information

  • Paper ID: 2510.14582
  • Title: Local Causal Discovery for Statistically Efficient Causal Inference
  • Authors: Mátyás Schubert (University of Amsterdam), Tom Claassen (Radboud University Nijmegen), Sara Magliacane (University of Amsterdam)
  • Classification: stat.ML cs.AI cs.LG
  • Publication Date: October 16, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.14582v1

Abstract

Causal discovery methods can identify valid adjustment sets for estimating causal effects between a pair of target variables, even when the underlying causal graph is unknown. Global causal discovery methods focus on learning the entire causal graph and thus can recover optimal adjustment sets (i.e., those with the lowest asymptotic variance), but they quickly become computationally intractable as the number of variables grows. Local causal discovery methods offer more scalable alternatives by focusing on the local neighborhood of target variables, but are limited to statistically suboptimal adjustment sets. In this work, the authors propose Local Optimal Adjustment Discovery (LOAD), a reliable and complete causal discovery method that combines the computational efficiency of local methods with the statistical optimality of global methods.

Research Background and Motivation

Problem Definition

In causal inference, estimating the causal effect between two variables is a core task. When the underlying causal graph is unknown, causal discovery methods are needed to identify valid adjustment sets for causal effect estimation. Existing methods face a fundamental trade-off:

  1. The Dilemma of Global Methods: Global causal discovery methods (e.g., PC algorithm) can learn the complete causal graph and recover optimal adjustment sets, but their computational complexity grows exponentially with the number of variables, making them infeasible for large-scale problems.
  2. Limitations of Local Methods: Local causal discovery methods (e.g., MB-by-MB, LDECC) are computationally efficient but can only recover suboptimal adjustment sets, resulting in higher asymptotic variance in causal effect estimation.

Research Motivation

The authors identify the following problems with existing local methods:

  • The LocalPC algorithm is unreliable in identifying adjacent variables and may incorrectly identify non-adjacent spouses as adjacent
  • The LDECC algorithm is incomplete and cannot orient all orientable edges in some cases
  • The LDP algorithm may incorrectly report effects as unidentifiable when they are actually identifiable with zero effect

Therefore, there is a need for a new method that maintains the computational efficiency of local methods while achieving the statistical optimality of global methods.

Core Contributions

  1. Developed methods for determining causal effect identifiability based on local information: Proposed necessary and sufficient conditions for determining whether a causal effect is identifiable using only local information.
  2. Proposed the LOAD algorithm: A reliable and complete method that identifies optimal adjustment sets using only local information around variables.
  3. Comprehensive experimental evaluation: Evaluated LOAD on synthetic and real data, demonstrating its ability to recover high-quality adjustment sets at low computational cost.
  4. Theoretical guarantees: Proved the reliability and completeness of LOAD in determining causal effect identifiability and finding optimal adjustment sets.

Methodology Details

Task Definition

Given a pair of target variables X and Y, the objectives are:

  1. Determine the causal relationship between X and Y (explicit ancestor, possible ancestor, or definite non-ancestor)
  2. Determine whether the causal effect is identifiable
  3. If identifiable, find the optimal adjustment set; otherwise, return a locally valid parental adjustment set

LOAD Algorithm Architecture

The LOAD algorithm consists of five main steps:

Step 1: Determine Causal Relationship Between Target Variables

Using the LocalRelate algorithm (Algorithm 1), relationships are determined through the following theorems:

  • Explicit Ancestor Relationship (Theorem 4.1): For any two distinct nodes X and Y in CPDAG G, X ∈ ExplAn_G(Y) if and only if X ⊥̸⊥ Y | Pa_G(X) ∪ Sib_G(X)
  • Definite Non-Ancestor Relationship (Theorem 4.2): X is a definite non-ancestor of Y if and only if X ⊥⊥ Y | Pa_G(X)

Step 2: Test Causal Effect Identifiability

An adaptive test based on local information is proposed:

Lemma 4.3: For X ∈ PossAn_G(Y) in CPDAG G, G is adjustment-amenable with respect to (X,Y) if and only if:

∀V ∈ Sib_G(X) : V ⊥⊥ Y | Pa_G(V) ∪ {X}

This condition can be efficiently detected through the LocalAmenTest algorithm (Algorithm 2).

Steps 3-5: Construct Optimal Adjustment Set

If the causal effect is identifiable, LOAD constructs the optimal adjustment set through the following steps:

  1. Find explicit descendants: Identify all explicit descendants of T
  2. Identify mediator nodes: Find nodes that are both explicit descendants of T and explicit ancestors of O
  3. Construct optimal adjustment set:
    Oset_G(T,O) = Pa_G(Cn_G(T,O)) \ (Cn_G(T,O) ∪ {T})
    

Technical Innovations

  1. Local Amenability Test: First proposes necessary and sufficient conditions for testing amenability using only local information, avoiding the need to check all possible directed paths.
  2. Caching Mechanism: An improved MB-by-MB algorithm uses caching to reuse Markov blankets and local structures identified in previous runs, significantly improving computational efficiency.
  3. Theoretical Completeness: Proves that LOAD is reliable and complete in determining causal relationships, identifiability, and optimal adjustment sets.

Experimental Setup

Datasets

  1. Synthetic Data:
    • Randomly generated Erdős–Rényi graphs
    • Number of variables: 100-1000
    • Expected degree: d=2, maximum degree: dmax=10
    • Sample size: nD=10000
  2. Real Networks:
    • MAGIC-NIAB network: 44 nodes, average degree 3
    • ANDES network: 223 nodes, average degree 3.03

Evaluation Metrics

  1. Computational Efficiency: Number of conditional independence tests
  2. Adjustment Set Quality: F1 score of optimal adjustment sets
  3. Causal Effect Estimation Quality: Intervention distance

Comparison Methods

  • Global Methods: PC, MARVEL, SNAP(∞)
  • Local Methods: MB-by-MB+, LDECC+, LDP+ (extended versions)

Implementation Details

  • Significance level: α = 0.01
  • Three types of conditional independence tests: oracle d-separation, Fisher-Z test, G² test
  • Each setting run 100 times, excluding the best and worst 5 results

Experimental Results

Main Results

Computational Efficiency

LOAD consistently requires fewer conditional independence tests than global methods across all settings, with only slightly more than local methods:

  • At 1000 nodes, LOAD requires 9.43×10³ tests, while PC requires 542.52×10³
  • Compared to MB-by-MB+'s 5.64×10³ tests, LOAD's additional overhead is reasonable

Adjustment Set Quality (F1 Score)

  • Oracle Setting: LOAD achieves perfect F1=1.0, comparable to global methods
  • Fisher-Z Test: LOAD outperforms baseline methods across all node counts, with F1 scores approximately 0.91-0.95
  • G² Test: LOAD performs suboptimally but remains the second-best method

Intervention Distance

LOAD achieves the lowest intervention distance in most settings:

  • Oracle setting: 0.003 (comparable to PC and SNAP)
  • Fisher-Z test: 0.014-0.026 (best)
  • G² test: 0.022-0.036 (second best, only behind PC)

Real Data Results

On the MAGIC-NIAB network:

  • LOAD achieves the best F1 score (0.62)
  • Achieves the lowest intervention distance (0.007)
  • Number of conditional independence tests (4.35×10³) falls between local and global methods

Ablation Studies

  1. Known Treatment-Outcome Relationship: When background knowledge is provided, LOAD* surpasses PC on binary data
  2. Identifiable Target Pairs: Result patterns remain consistent when ensuring causal effects are identifiable
  3. Parameter Sensitivity: LOAD is robust to different sample sizes and expected degrees

Global Causal Discovery Methods

  • PC Algorithm: Classical constraint-based method, but with high computational complexity
  • MARVEL: Recursive method, still difficult to scale to hundreds of variables
  • SNAP: Progressive identification and removal of definite non-ancestors, but still requires causal discovery on all possible ancestors

Local Causal Discovery Methods

  • MB-by-MB: Sequential Markov blanket discovery, but limited to suboptimal adjustment sets
  • LDECC: Efficient collider checking, but with reliability and completeness issues
  • LDP: Learning adjustment sets through partitioning, but still potentially suboptimal with restrictive assumptions

Advantages of This Work

LOAD is the first method to simultaneously achieve:

  1. Use only local information
  2. Recover optimal adjustment sets
  3. Provide theoretical guarantees (reliability and completeness)

Conclusions and Discussion

Main Conclusions

  1. LOAD successfully combines the computational efficiency of local methods with the statistical optimality of global methods
  2. The proposed local amenability test provides an efficient method for determining causal effect identifiability
  3. LOAD demonstrates superior performance across various data types and network structures

Limitations

  1. Causal Sufficiency Assumption: The current version assumes no latent confounders or selection bias
  2. Computational Bottleneck in Large-Scale Networks: Markov blanket search may still become a computational bottleneck on extremely large graphs
  3. Limited Performance on Binary Data: Performance is limited when using G² tests on binary data

Future Directions

  1. Extension to Causal Insufficiency Settings: Handling cases with latent confounders
  2. Optimize Markov Blanket Discovery: Further improve computational efficiency for large-scale networks
  3. Improve Finite-Sample Performance: Particularly improve performance on binary data

In-Depth Evaluation

Strengths

  1. Significant Theoretical Contributions: First to propose amenability testing based solely on local information, with important theoretical value
  2. Strong Practicality: Achieves statistical optimality while maintaining computational efficiency, solving a key problem in practical applications
  3. Comprehensive Experiments: Covers multiple data types, network scales, and evaluation metrics with convincing results
  4. Algorithm Completeness: Provides theoretical guarantees of reliability and completeness with rigorous algorithm design

Weaknesses

  1. Restrictive Assumptions: The causal sufficiency assumption may not hold in practical applications
  2. Scalability Issues: While better than global methods, still faces computational challenges on extremely large-scale networks
  3. Finite-Sample Performance: Performance is not sufficiently stable in certain finite-sample settings

Impact

  1. Academic Value: Provides new theoretical frameworks and algorithm design insights for the causal discovery field
  2. Practical Value: Important for practical applications requiring causal effect estimation
  3. Reproducibility: Provides detailed algorithm descriptions and experimental settings for easy reproduction and extension

Applicable Scenarios

  1. Medium-Scale Causal Inference: Causal effect estimation tasks with hundreds to thousands of variables
  2. Computationally Constrained Environments: Applications requiring balance between computational efficiency and statistical performance
  3. Causally Sufficient Environments: Observational studies without important latent confounders

References

The paper cites important literature in causal inference, including:

  • Pearl (2009): Causality - classic textbook on causal inference
  • Spirtes et al. (2000): Foundational work on constraint-based causal discovery
  • Henckel et al. (2022): Graphical criteria for optimal adjustment sets
  • Perković et al. (2015): Definition and properties of amenability

Overall Assessment: This is a high-quality causal inference paper with important contributions at both theoretical and practical levels. The LOAD algorithm elegantly resolves the trade-off between computational efficiency and statistical optimality in causal discovery, with significant academic value and promising application prospects.