2025-11-25T06:13:17.736050

RFOD: Random Forest-based Outlier Detection for Tabular Data

Ang, Yao, Bao et al.

Outlier detection in tabular data is crucial for safeguarding data integrity in high-stakes domains such as cybersecurity, financial fraud detection, and healthcare, where anomalies can cause serious operational and economic impacts. Despite advances in both data mining and deep learning, many existing methods struggle with mixed-type tabular data, often relying on encoding schemes that lose important semantic information. Moreover, they frequently lack interpretability, offering little insight into which specific values cause anomalies. To overcome these challenges, we introduce \textsf{\textbf{RFOD}}, a novel \textsf{\textbf{R}}andom \textsf{\textbf{F}}orest-based \textsf{\textbf{O}}utlier \textsf{\textbf{D}}etection framework tailored for tabular data. Rather than modeling a global joint distribution, \textsf{RFOD} reframes anomaly detection as a feature-wise conditional reconstruction problem, training dedicated random forests for each feature conditioned on the others. This design robustly handles heterogeneous data types while preserving the semantic integrity of categorical features. To further enable precise and interpretable detection, \textsf{RFOD} combines Adjusted Gower's Distance (AGD) for cell-level scoring, which adapts to skewed numerical data and accounts for categorical confidence, with Uncertainty-Weighted Averaging (UWA) to aggregate cell-level scores into robust row-level anomaly scores. Extensive experiments on 15 real-world datasets demonstrate that \textsf{RFOD} consistently outperforms state-of-the-art baselines in detection accuracy while offering superior robustness, scalability, and interpretability for mixed-type tabular data.

academic

RFOD: Random Forest-based Outlier Detection for Tabular Data

基本信息

论文ID: 2510.08747
标题: RFOD: Random Forest-based Outlier Detection for Tabular Data
作者: Yihao Ang, Peicheng Yao, Yifan Bao, Yushuo Feng, Qiang Huang, Anthony K. H. Tung, Zhiyong Huang
分类: cs.LG (Machine Learning), cs.DB (Database)
发表时间: 2025年10月9日 (arXiv预印本)
论文链接: https://arxiv.org/abs/2510.08747

摘要

表格数据中的异常值检测对于保障网络安全、金融欺诈检测和医疗保健等高风险领域的数据完整性至关重要。尽管数据挖掘和深度学习技术不断进步，但现有方法在处理混合类型表格数据时仍面临挑战，往往依赖会丢失重要语义信息的编码方案，且缺乏可解释性。为解决这些问题，本文提出了RFOD，一个专门针对表格数据的基于随机森林的异常检测框架。RFOD将异常检测重新定义为特征级条件重构问题，为每个特征训练专用的随机森林，实现了对异构数据类型的鲁棒处理。该方法结合了调整Gower距离(AGD)进行单元级评分和不确定性加权平均(UWA)进行行级异常评分聚合。在15个真实数据集上的广泛实验表明，RFOD在检测准确性方面始终优于最先进的基线方法，同时提供了卓越的鲁棒性、可扩展性和可解释性。

研究背景与动机

问题定义

异常值检测旨在识别数据中显著偏离主导分布的实例，这在高风险领域如网络安全、金融欺诈检测和医疗保健中至关重要。未检测到的异常可能导致分析失真、隐藏关键洞察并破坏操作。

现有方法的局限性

传统数据挖掘方法：
- LOF、Isolation Forest、OCSVM等方法通常依赖全局邻近性或统计启发式
- 往往独立处理特征，无法捕获多变量关系中的上下文异常
- 对混合类型数据的原生支持不足
深度学习方法：
- Deep SVDD、DevNet、ICL等方法主要假设纯数值输入
- 依赖预处理（如独热编码）可能丢失语义细节
- 黑盒性质阻碍可解释性

研究动机

现有方法在混合类型表格数据上表现不一致，缺乏既能提供高检测精度又能提供可解释性的统一解决方案。本文旨在开发一个能够：

原生处理混合类型数据
提供细粒度的可解释性
保持高检测精度和计算效率的异常检测框架。

核心贡献

特征级条件重构范式：提出了将异常检测重新定义为特征级条件重构问题的新范式，避免了全局联合分布建模的局限性
RFOD框架：设计了基于随机森林的异常检测框架，包含四个核心模块：
- 特征专用随机森林
- 森林剪枝机制
- 调整Gower距离(AGD)
- 不确定性加权平均(UWA)
AGD距离度量：提出了适应偏斜数值分布和类别特征置信度的改进距离度量方法
卓越的实验性能：在15个真实数据集上实现了最佳平均性能，AUC-ROC相比最佳竞争方法提升高达9.1%，测试时间延迟平均减少91.2%

方法详解

任务定义

给定训练集 $\mathbf{X}_{train} \in \mathbb{R}^{n \times d}$ 和测试集 $\mathbf{X}_{test} \in \mathbb{R}^{m \times d}$ ，目标是计算：

单元级异常评分矩阵： $\mathbf{S}_{cell} = [s_{i,j}] \in \mathbb{R}^{m \times d}$
行级异常评分向量： $\mathbf{s}_{row} = [s_{row,1}, \ldots, s_{row,m}] \in \mathbb{R}^m$

模型架构

1. 特征专用随机森林

采用留一特征法分解策略，为每个特征 $\mathbf{x}_j$ 训练专用随机森林 $\mathbf{RF}_j$ ： $\mathbf{RF}_j: \mathbf{X}^j_{train} \in \mathbb{R}^{n \times (d-1)} \rightarrow \mathbf{y}^j_{train} \in \mathbb{R}^n$

其中 $\mathbf{X}^j_{train} = \mathbf{X}_{train} \setminus \{\mathbf{x}_j\}$ ， $\mathbf{y}^j_{train} = \mathbf{x}_j$ 。

2. 森林剪枝

基于袋外(OOB)验证保留最优树木： $\text{Prune}(\mathbf{RF}) = \{T_{U(i)} | 1 \leq i \leq \lfloor\beta \cdot t\rfloor\}$

其中 $\beta \in (0,1]$ 是保留比例， $U$ 是按OOB分数降序排列的索引。

3. 调整Gower距离(AGD)

数值特征： $AGD^{(num)}(x_{i,j}, \hat{x}_{i,j}) = \frac{|x_{i,j} - \hat{x}_{i,j}|}{Q_{1-\alpha}(\mathbf{x}_j) - Q_\alpha(\mathbf{x}_j)}$

类别特征： $AGD^{(cat)}(x_{i,j}, \hat{x}_{i,j}) = 1 - p_{x_{i,j}}$

其中 $p_{x_{i,j}}$ 是真实类别的预测概率。

4. 不确定性加权平均(UWA)

计算不确定性矩阵 $\mathbf{U} = [u_{i,j}]$ ，其中 $u_{i,j}$ 是树预测的标准差。置信度权重： $\mathbf{W} = \mathbf{1}_{m \times d} - \tilde{\mathbf{U}}$ 最终行级评分： $s_{row,i} = \frac{1}{d} \sum_{j=1}^d w_{i,j} \cdot s_{i,j}$