2025-11-25T06:13:17.736050

RFOD: Random Forest-based Outlier Detection for Tabular Data

Ang, Yao, Bao et al.

Outlier detection in tabular data is crucial for safeguarding data integrity in high-stakes domains such as cybersecurity, financial fraud detection, and healthcare, where anomalies can cause serious operational and economic impacts. Despite advances in both data mining and deep learning, many existing methods struggle with mixed-type tabular data, often relying on encoding schemes that lose important semantic information. Moreover, they frequently lack interpretability, offering little insight into which specific values cause anomalies. To overcome these challenges, we introduce \textsf{\textbf{RFOD}}, a novel \textsf{\textbf{R}}andom \textsf{\textbf{F}}orest-based \textsf{\textbf{O}}utlier \textsf{\textbf{D}}etection framework tailored for tabular data. Rather than modeling a global joint distribution, \textsf{RFOD} reframes anomaly detection as a feature-wise conditional reconstruction problem, training dedicated random forests for each feature conditioned on the others. This design robustly handles heterogeneous data types while preserving the semantic integrity of categorical features. To further enable precise and interpretable detection, \textsf{RFOD} combines Adjusted Gower's Distance (AGD) for cell-level scoring, which adapts to skewed numerical data and accounts for categorical confidence, with Uncertainty-Weighted Averaging (UWA) to aggregate cell-level scores into robust row-level anomaly scores. Extensive experiments on 15 real-world datasets demonstrate that \textsf{RFOD} consistently outperforms state-of-the-art baselines in detection accuracy while offering superior robustness, scalability, and interpretability for mixed-type tabular data.

academic

RFOD: Random Forest-based Outlier Detection for Tabular Data

Basic Information

Paper ID: 2510.08747
Title: RFOD: Random Forest-based Outlier Detection for Tabular Data
Authors: Yihao Ang, Peicheng Yao, Yifan Bao, Yushuo Feng, Qiang Huang, Anthony K. H. Tung, Zhiyong Huang
Categories: cs.LG (Machine Learning), cs.DB (Database)
Publication Date: October 9, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.08747

Abstract

Outlier detection in tabular data is critical for ensuring data integrity in high-risk domains such as cybersecurity, financial fraud detection, and healthcare. Despite advances in data mining and deep learning techniques, existing methods face challenges in handling mixed-type tabular data, often relying on encoding schemes that lose important semantic information and lacking interpretability. To address these issues, this paper proposes RFOD, a Random Forest-based outlier detection framework specifically designed for tabular data. RFOD redefines outlier detection as a feature-level conditional reconstruction problem, training dedicated random forests for each feature to achieve robust handling of heterogeneous data types. The method combines Adjusted Gower Distance (AGD) for cell-level scoring and Uncertainty-Weighted Averaging (UWA) for row-level anomaly score aggregation. Extensive experiments on 15 real-world datasets demonstrate that RFOD consistently outperforms state-of-the-art baseline methods in detection accuracy while providing superior robustness, scalability, and interpretability.

Research Background and Motivation

Problem Definition

Outlier detection aims to identify instances in data that significantly deviate from the dominant distribution, which is critical in high-risk domains such as cybersecurity, financial fraud detection, and healthcare. Undetected anomalies can lead to distorted analysis, obscure critical insights, and compromise operations.

Limitations of Existing Methods

Traditional Data Mining Methods:
- Methods such as LOF, Isolation Forest, and OCSVM typically rely on global proximity or statistical heuristics
- Often process features independently, failing to capture contextual anomalies in multivariate relationships
- Limited native support for mixed-type data
Deep Learning Methods:
- Methods such as Deep SVDD, DevNet, and ICL primarily assume purely numerical inputs
- Rely on preprocessing (e.g., one-hot encoding) that may lose semantic details
- Black-box nature hinders interpretability

Research Motivation

Existing methods show inconsistent performance on mixed-type tabular data and lack a unified solution that provides both high detection accuracy and interpretability. This work aims to develop an outlier detection framework that can:

Natively handle mixed-type data
Provide fine-grained interpretability
Maintain high detection accuracy and computational efficiency

Core Contributions

Feature-Level Conditional Reconstruction Paradigm: Proposes a novel paradigm that redefines outlier detection as a feature-level conditional reconstruction problem, avoiding the limitations of global joint distribution modeling
RFOD Framework: Designs a Random Forest-based outlier detection framework with four core modules:
- Feature-specific random forests
- Forest pruning mechanism
- Adjusted Gower Distance (AGD)
- Uncertainty-Weighted Averaging (UWA)
AGD Distance Metric: Proposes an improved distance metric that adapts to skewed numerical distributions and categorical feature confidence
Superior Experimental Performance: Achieves best average performance on 15 real-world datasets, with AUC-ROC improvements up to 9.1% over the best competing methods and average test time latency reduction of 91.2%

Methodology Details

Task Definition

Given training set $\mathbf{X}_{train} \in \mathbb{R}^{n \times d}$ and test set $\mathbf{X}_{test} \in \mathbb{R}^{m \times d}$ , the objective is to compute:

Cell-level anomaly score matrix: $\mathbf{S}_{cell} = [s_{i,j}] \in \mathbb{R}^{m \times d}$
Row-level anomaly score vector: $\mathbf{s}_{row} = [s_{row,1}, \ldots, s_{row,m}] \in \mathbb{R}^m$

Model Architecture

1. Feature-Specific Random Forests

Employs a leave-one-feature-out decomposition strategy, training a dedicated random forest $\mathbf{RF}_j$ for each feature $\mathbf{x}_j$ : $\mathbf{RF}_j: \mathbf{X}^j_{train} \in \mathbb{R}^{n \times (d-1)} \rightarrow \mathbf{y}^j_{train} \in \mathbb{R}^n$

where $\mathbf{X}^j_{train} = \mathbf{X}_{train} \setminus \{\mathbf{x}_j\}$ and $\mathbf{y}^j_{train} = \mathbf{x}_j$ .

2. Forest Pruning

Retains optimal trees based on out-of-bag (OOB) validation: $\text{Prune}(\mathbf{RF}) = \{T_{U(i)} | 1 \leq i \leq \lfloor\beta \cdot t\rfloor\}$

where $\beta \in (0,1]$ is the retention ratio and $U$ is the index ordering by OOB scores in descending order.

3. Adjusted Gower Distance (AGD)

For Numerical Features: $AGD^{(num)}(x_{i,j}, \hat{x}_{i,j}) = \frac{|x_{i,j} - \hat{x}_{i,j}|}{Q_{1-\alpha}(\mathbf{x}_j) - Q_\alpha(\mathbf{x}_j)}$

For Categorical Features: $AGD^{(cat)}(x_{i,j}, \hat{x}_{i,j}) = 1 - p_{x_{i,j}}$

where $p_{x_{i,j}}$ is the predicted probability of the true category.

4. Uncertainty-Weighted Averaging (UWA)

Computes uncertainty matrix $\mathbf{U} = [u_{i,j}]$ , where $u_{i,j}$ is the standard deviation of tree predictions. Confidence weights: $\mathbf{W} = \mathbf{1}_{m \times d} - \tilde{\mathbf{U}}$ Final row-level score: $s_{row,i} = \frac{1}{d} \sum_{j=1}^d w_{i,j} \cdot s_{i,j}$

Technical Innovations

Conditional Reconstruction vs. Global Modeling: Avoids the curse of dimensionality in global joint distribution modeling in high-dimensional spaces
Native Mixed-Type Data Support: Handles mixed numerical and categorical features without complex encoding
Adaptive Distance Metric: AGD adapts to skewed distributions through quantile normalization and handles categorical uncertainty through confidence-aware matching
Uncertainty-Aware Aggregation: UWA leverages prediction variance from the ensemble structure to dynamically adjust feature weights

Experimental Setup

Datasets

Uses 15 public tabular datasets covering cybersecurity, finance, and healthcare domains:

Domain	Dataset	Samples	Features	Anomaly Ratio
Cybersecurity	Backdoor	95,329	42	2.44%
Cybersecurity	DoS	109,353	42	14.95%
Cybersecurity	KDD	4,898,430	41	19.86%
Finance	Bank	45,211	16	11.70%
Healthcare	Arrhythmia	452	279	45.80%

Evaluation Metrics

AUC-ROC: Measures ranking quality of anomaly scores
AUC-PR: Emphasizes precision and recall, particularly suitable for class imbalance
F1-Score and Accuracy: Threshold-based classification performance metrics
Log-Loss: Evaluates calibration of anomaly probabilities
Training Time and Test Time: Assess efficiency and scalability

Baseline Methods

Data Mining Baselines: ECOD, LOF, IF, OCSVM, OT Deep Learning Baselines: Deep SVDD, SLAD, DevNet, DIF, ICL

Implementation Details

Deep model training epochs: 50
Environment: Intel Xeon Platinum 8480C @3.80GHz, 256GB RAM, NVIDIA H200 GPU
RFOD parameters: $\alpha \in [0.01, 0.02]$ (AGD sensitivity), $\beta$ adaptively selected via OOB validation

Experimental Results

Main Results

RFOD demonstrates excellent performance across all evaluation metrics:

Average Ranking: Ranks in top 2 across all 5 metrics, with 1st place in AUC-ROC and F1
Performance Improvement: Average AUC-PR improvement of 46.7% over data mining methods and average AUC-ROC improvement of 24.8% over deep learning methods
Consistency: Outperforms each baseline method on 80-100% of datasets

Ablation Studies

Validates the importance of each module:

Forest Pruning: Significantly improves performance on Bank and Ethereum datasets, reducing overfitting
AGD: Most critical component; removing it reduces AUC-ROC from 0.96 to 0.41 on DoS dataset
UWA: Provides stable performance improvements on large-scale datasets such as Backdoor and DoS

Case Analysis

Using the Pima healthcare dataset as an example:

Cell-Level Interpretability: Heatmaps show RFOD can precisely locate anomalous feature combinations
Row-Level Interpretability: Predicted values fall in high-density regions of normal distribution, while actual anomalies are in distribution tails
Comparative Analysis: OCSVM and DIF produce uniform high activations, making it difficult to isolate true anomaly sources

Efficiency Analysis

Training Time: Orders of magnitude faster than deep learning methods, supporting parallelization
Test Time: Average reduction of 91.2% in test latency
Scalability: Linear scaling demonstrated on KDD dataset from 1% to 100% data scale

Data Mining Methods

Traditional methods such as LOF, IF, and OCSVM primarily rely on statistical or proximity-based criteria, but typically assume feature independence and struggle to capture multivariate interactions.

Deep Learning Methods

Methods such as Deep SVDD, DevNet, and ICL can learn complex representations but are primarily designed for numerical inputs, requiring preprocessing for mixed-type data and lacking interpretability.

Advantages of This Work

RFOD combines the interpretability of tree methods with the robustness of ensemble learning, avoiding the limitations of global modeling through feature-level conditional modeling while providing native support for mixed-type data.

Conclusions and Discussion

Main Conclusions

RFOD successfully addresses outlier detection in mixed-type tabular data through feature-level conditional reconstruction
The design of AGD and UWA significantly improves detection accuracy and robustness
The method maintains high accuracy while providing superior interpretability and computational efficiency

Limitations

Parameter Sensitivity: While the $\alpha$ parameter is relatively stable, some tuning is still required
Memory Overhead: Training independent forests for each feature may create memory pressure on extremely high-dimensional data
Categorical Feature Handling: Processing high-cardinality categorical features may require further optimization

Future Directions

Explore more efficient feature selection and dimensionality reduction techniques
Investigate applications in streaming data and online learning scenarios
Extend to time series and graph-structured data

In-Depth Evaluation

Strengths

Methodological Innovation: The feature-level conditional reconstruction paradigm is a novel and effective approach
Experimental Comprehensiveness: Comprehensive comparison across 15 datasets and 10 baseline methods
Interpretability: Provides dual-level interpretability at both cell and row levels
Practical Value: Achieves good balance between efficiency and accuracy

Weaknesses

Theoretical Analysis: Lacks in-depth theoretical analysis of convergence and complexity
Extreme Scenarios: Performance on extremely high-dimensional or highly imbalanced data requires further verification
Parameter Guidance: Lacks more systematic principles for parameter selection

Impact

Academic Contribution: Provides a new research direction for outlier detection in tabular data
Practical Value: Has direct application potential in critical domains such as finance and healthcare
Reproducibility: Clear algorithm description facilitates implementation and reproduction

Applicable Scenarios

Outlier detection in mixed-type tabular data
High-risk decision scenarios requiring interpretability
Real-time anomaly monitoring on medium-scale data
Feature importance analysis and root cause analysis

References

The paper cites important works in the outlier detection field, including:

Classical Methods: LOF (Breunig et al., 2000), Isolation Forest (Liu et al., 2008)
Deep Learning Methods: Deep SVDD (Ruff et al., 2018), DevNet (Pang et al., 2019)
Distance Metrics: Gower's Distance (Gower, 1971)
Evaluation Benchmarks: ADBench (Han et al., 2022)

Overall Assessment: This is a high-quality research paper on outlier detection that proposes an innovative methodological framework with comprehensive experimental validation and strong potential for practical application. The method's advantages in interpretability and efficiency make it competitive for real-world deployment.