2025-11-12T13:34:14.831387

Efficient & Correct Predictive Equivalence for Decision Trees

Marques-Silva, Ignatiev

The Rashomon set of decision trees (DTs) finds importance uses. Recent work showed that DTs computing the same classification function, i.e. predictive equivalent DTs, can represent a significant fraction of the Rashomon set. Such redundancy is undesirable. For example, feature importance based on the Rashomon set becomes inaccurate due the existence of predictive equivalent DTs, i.e. DTs with the same prediction for every possible input. In recent work, McTavish et al. proposed solutions for several computational problems related with DTs, including that of deciding predictive equivalent DTs. The approach of McTavish et al. consists of applying the well-known method of Quine-McCluskey (QM) for obtaining minimum-size DNF (disjunctive normal form) representations of DTs, which are then used for comparing DTs for predictive equivalence. Furthermore, the minimum-size DNF representation was also applied to computing explanations for the predictions made by DTs, and to finding predictions in the presence of missing data. However, the problem of formula minimization is hard for the second level of the polynomial hierarchy, and the QM method may exhibit worst-case exponential running time and space. This paper first demonstrates that there exist decision trees that trigger the worst-case exponential running time and space of the QM method. Second, the paper shows that the QM method may incorrectly decide predictive equivalence, if two key constraints are not respected, and one may be difficult to formally guarantee. Third, the paper shows that any of the problems to which the smallest DNF representation has been applied to can be solved in polynomial time, in the size of the DT. The experiments confirm that, for DTs for which the worst-case of the QM method is triggered, the algorithms proposed in this paper are orders of magnitude faster than the ones proposed by McTavish et al.

academic

Efficient & Correct Predictive Equivalence for Decision Trees

基本信息

论文ID: 2509.17774
标题: Efficient & Correct Predictive Equivalence for Decision Trees
作者: João Marques-Silva (ICREA & University of Lleida), Alexey Ignatiev (Monash University)
分类: cs.AI cs.LG cs.LO
发表时间/会议: Journal of Machine Learning Research 23 (2025) 1-35
论文链接: https://arxiv.org/abs/2509.17774

摘要

决策树的Rashomon集合具有重要应用价值。最近研究表明，计算相同分类函数的决策树（即预测等价决策树）可能占Rashomon集合的很大一部分。这种冗余是不理想的，例如基于Rashomon集合的特征重要性会因预测等价决策树的存在而变得不准确。McTavish等人最近提出了解决决策树相关计算问题的方案，包括判断预测等价决策树。他们的方法使用著名的Quine-McCluskey（QM）方法获得决策树的最小DNF表示，然后用于比较决策树的预测等价性。然而，公式最小化问题对多项式层次结构的第二层是困难的，QM方法可能表现出最坏情况下的指数运行时间和空间复杂度。本文首先证明存在触发QM方法最坏情况指数复杂度的决策树，其次表明如果不满足两个关键约束，QM方法可能错误判断预测等价性，最后证明所有应用最小DNF表示的问题都可以在决策树大小的多项式时间内解决。

Rashomon集合优化：在机器学习中，Rashomon集合包含多个性能相近的模型。预测等价的决策树在该集合中造成冗余，影响特征重要性评估的准确性。
可解释性需求：决策树被广泛认为是可解释的模型，但即使是最优决策树也需要形式化解释，特别是在高风险应用场景中。
计算效率：现有方法在处理大规模决策树时面临严重的计算瓶颈。

现有方法局限性

McTavish等人提出的方法基于Quine-McCluskey（QM）算法，存在以下问题：

计算复杂度：QM方法求解Σₚ²-hard问题，在最坏情况下需要指数时间和空间
正确性问题：在不满足特定约束时可能产生错误结果
实际可行性：对于具有数十个变量的问题，QM方法已知扩展性很差

核心贡献

理论分析：证明了存在决策树能够触发QM方法的最坏情况指数复杂度
正确性分析：揭示了QM方法在预测等价性判断中的潜在不正确性问题
高效算法：提出了多项式时间算法解决完整性、简洁性和预测等价性判断问题
实验验证：在触发QM最坏情况的决策树上，新算法比现有方法快几个数量级
理论联系：建立了预测等价性与逻辑解释、重要性度量之间的理论联系

方法详解

任务定义

给定两个决策树T₁和T₂，判断它们是否预测等价，即：

∀(x ∈ F). (κₜ₁(x) = κₜ₂(x))

其中F是特征空间，κ是分类函数。

核心技术框架

1. 弱归纳解释（WAXp）方法

论文提出基于WAXp的多项式时间算法：

算法1：路径一致性检查

def ConsistentPath(A, P, T):
    # 检查部分赋值A与树路径P的一致性
    for each feature i:
        combine literals from A and P for feature i
        if inconsistent: return False
    return True

算法2：WAXp判断

def IsWAXp(A, c, T):
    # 判断部分赋值A是否为类别c的WAXp
    for each path P in T:
        if Class(P) != c and ConsistentPath(A, P, T):
            return False  # A与其他类别路径一致
    return True

2. 预测等价性判断算法

算法5：预测等价性判断

def PredictivelyEquivalent(T1, T2):
    for P1 in Paths(T1):
        c1 = Class(P1)
        A1 = Literals(P1)  # 创建部分赋值
        for P2 in Paths(T2):
            c2 = Class(P2)
            if c1 != c2 and ConsistentPath(A1, P2, T2):
                return False  # 发现不等价证据
    return True  # 无法证明不等价，因此等价