Outlier detection in tabular data is crucial for safeguarding data integrity in high-stakes domains such as cybersecurity, financial fraud detection, and healthcare, where anomalies can cause serious operational and economic impacts. Despite advances in both data mining and deep learning, many existing methods struggle with mixed-type tabular data, often relying on encoding schemes that lose important semantic information. Moreover, they frequently lack interpretability, offering little insight into which specific values cause anomalies. To overcome these challenges, we introduce \textsf{\textbf{RFOD}}, a novel \textsf{\textbf{R}}andom \textsf{\textbf{F}}orest-based \textsf{\textbf{O}}utlier \textsf{\textbf{D}}etection framework tailored for tabular data. Rather than modeling a global joint distribution, \textsf{RFOD} reframes anomaly detection as a feature-wise conditional reconstruction problem, training dedicated random forests for each feature conditioned on the others. This design robustly handles heterogeneous data types while preserving the semantic integrity of categorical features. To further enable precise and interpretable detection, \textsf{RFOD} combines Adjusted Gower's Distance (AGD) for cell-level scoring, which adapts to skewed numerical data and accounts for categorical confidence, with Uncertainty-Weighted Averaging (UWA) to aggregate cell-level scores into robust row-level anomaly scores. Extensive experiments on 15 real-world datasets demonstrate that \textsf{RFOD} consistently outperforms state-of-the-art baselines in detection accuracy while offering superior robustness, scalability, and interpretability for mixed-type tabular data.
RFOD: Random Forest-based Outlier Detection for Tabular Data
- Paper ID: 2510.08747
- Title: RFOD: Random Forest-based Outlier Detection for Tabular Data
- Authors: Yihao Ang, Peicheng Yao, Yifan Bao, Yushuo Feng, Qiang Huang, Anthony K. H. Tung, Zhiyong Huang
- Categories: cs.LG (Machine Learning), cs.DB (Database)
- Publication Date: October 9, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.08747
Outlier detection in tabular data is critical for ensuring data integrity in high-risk domains such as cybersecurity, financial fraud detection, and healthcare. Despite advances in data mining and deep learning techniques, existing methods face challenges in handling mixed-type tabular data, often relying on encoding schemes that lose important semantic information and lacking interpretability. To address these issues, this paper proposes RFOD, a Random Forest-based outlier detection framework specifically designed for tabular data. RFOD redefines outlier detection as a feature-level conditional reconstruction problem, training dedicated random forests for each feature to achieve robust handling of heterogeneous data types. The method combines Adjusted Gower Distance (AGD) for cell-level scoring and Uncertainty-Weighted Averaging (UWA) for row-level anomaly score aggregation. Extensive experiments on 15 real-world datasets demonstrate that RFOD consistently outperforms state-of-the-art baseline methods in detection accuracy while providing superior robustness, scalability, and interpretability.
Outlier detection aims to identify instances in data that significantly deviate from the dominant distribution, which is critical in high-risk domains such as cybersecurity, financial fraud detection, and healthcare. Undetected anomalies can lead to distorted analysis, obscure critical insights, and compromise operations.
- Traditional Data Mining Methods:
- Methods such as LOF, Isolation Forest, and OCSVM typically rely on global proximity or statistical heuristics
- Often process features independently, failing to capture contextual anomalies in multivariate relationships
- Limited native support for mixed-type data
- Deep Learning Methods:
- Methods such as Deep SVDD, DevNet, and ICL primarily assume purely numerical inputs
- Rely on preprocessing (e.g., one-hot encoding) that may lose semantic details
- Black-box nature hinders interpretability
Existing methods show inconsistent performance on mixed-type tabular data and lack a unified solution that provides both high detection accuracy and interpretability. This work aims to develop an outlier detection framework that can:
- Natively handle mixed-type data
- Provide fine-grained interpretability
- Maintain high detection accuracy and computational efficiency
- Feature-Level Conditional Reconstruction Paradigm: Proposes a novel paradigm that redefines outlier detection as a feature-level conditional reconstruction problem, avoiding the limitations of global joint distribution modeling
- RFOD Framework: Designs a Random Forest-based outlier detection framework with four core modules:
- Feature-specific random forests
- Forest pruning mechanism
- Adjusted Gower Distance (AGD)
- Uncertainty-Weighted Averaging (UWA)
- AGD Distance Metric: Proposes an improved distance metric that adapts to skewed numerical distributions and categorical feature confidence
- Superior Experimental Performance: Achieves best average performance on 15 real-world datasets, with AUC-ROC improvements up to 9.1% over the best competing methods and average test time latency reduction of 91.2%
Given training set Xtrain∈Rn×d and test set Xtest∈Rm×d, the objective is to compute:
- Cell-level anomaly score matrix: Scell=[si,j]∈Rm×d
- Row-level anomaly score vector: srow=[srow,1,…,srow,m]∈Rm
Employs a leave-one-feature-out decomposition strategy, training a dedicated random forest RFj for each feature xj:
RFj:Xtrainj∈Rn×(d−1)→ytrainj∈Rn
where Xtrainj=Xtrain∖{xj} and ytrainj=xj.
Retains optimal trees based on out-of-bag (OOB) validation:
Prune(RF)={TU(i)∣1≤i≤⌊β⋅t⌋}
where β∈(0,1] is the retention ratio and U is the index ordering by OOB scores in descending order.
For Numerical Features:
AGD(num)(xi,j,x^i,j)=Q1−α(xj)−Qα(xj)∣xi,j−x^i,j∣
For Categorical Features:
AGD(cat)(xi,j,x^i,j)=1−pxi,j
where pxi,j is the predicted probability of the true category.
Computes uncertainty matrix U=[ui,j], where ui,j is the standard deviation of tree predictions.
Confidence weights: W=1m×d−U~
Final row-level score:
srow,i=d1∑j=1dwi,j⋅si,j
- Conditional Reconstruction vs. Global Modeling: Avoids the curse of dimensionality in global joint distribution modeling in high-dimensional spaces
- Native Mixed-Type Data Support: Handles mixed numerical and categorical features without complex encoding
- Adaptive Distance Metric: AGD adapts to skewed distributions through quantile normalization and handles categorical uncertainty through confidence-aware matching
- Uncertainty-Aware Aggregation: UWA leverages prediction variance from the ensemble structure to dynamically adjust feature weights
Uses 15 public tabular datasets covering cybersecurity, finance, and healthcare domains:
| Domain | Dataset | Samples | Features | Anomaly Ratio |
|---|
| Cybersecurity | Backdoor | 95,329 | 42 | 2.44% |
| Cybersecurity | DoS | 109,353 | 42 | 14.95% |
| Cybersecurity | KDD | 4,898,430 | 41 | 19.86% |
| Finance | Bank | 45,211 | 16 | 11.70% |
| Healthcare | Arrhythmia | 452 | 279 | 45.80% |
- AUC-ROC: Measures ranking quality of anomaly scores
- AUC-PR: Emphasizes precision and recall, particularly suitable for class imbalance
- F1-Score and Accuracy: Threshold-based classification performance metrics
- Log-Loss: Evaluates calibration of anomaly probabilities
- Training Time and Test Time: Assess efficiency and scalability
Data Mining Baselines: ECOD, LOF, IF, OCSVM, OT
Deep Learning Baselines: Deep SVDD, SLAD, DevNet, DIF, ICL
- Deep model training epochs: 50
- Environment: Intel Xeon Platinum 8480C @3.80GHz, 256GB RAM, NVIDIA H200 GPU
- RFOD parameters: α∈[0.01,0.02] (AGD sensitivity), β adaptively selected via OOB validation
RFOD demonstrates excellent performance across all evaluation metrics:
- Average Ranking: Ranks in top 2 across all 5 metrics, with 1st place in AUC-ROC and F1
- Performance Improvement: Average AUC-PR improvement of 46.7% over data mining methods and average AUC-ROC improvement of 24.8% over deep learning methods
- Consistency: Outperforms each baseline method on 80-100% of datasets
Validates the importance of each module:
- Forest Pruning: Significantly improves performance on Bank and Ethereum datasets, reducing overfitting
- AGD: Most critical component; removing it reduces AUC-ROC from 0.96 to 0.41 on DoS dataset
- UWA: Provides stable performance improvements on large-scale datasets such as Backdoor and DoS
Using the Pima healthcare dataset as an example:
- Cell-Level Interpretability: Heatmaps show RFOD can precisely locate anomalous feature combinations
- Row-Level Interpretability: Predicted values fall in high-density regions of normal distribution, while actual anomalies are in distribution tails
- Comparative Analysis: OCSVM and DIF produce uniform high activations, making it difficult to isolate true anomaly sources
- Training Time: Orders of magnitude faster than deep learning methods, supporting parallelization
- Test Time: Average reduction of 91.2% in test latency
- Scalability: Linear scaling demonstrated on KDD dataset from 1% to 100% data scale
Traditional methods such as LOF, IF, and OCSVM primarily rely on statistical or proximity-based criteria, but typically assume feature independence and struggle to capture multivariate interactions.
Methods such as Deep SVDD, DevNet, and ICL can learn complex representations but are primarily designed for numerical inputs, requiring preprocessing for mixed-type data and lacking interpretability.
RFOD combines the interpretability of tree methods with the robustness of ensemble learning, avoiding the limitations of global modeling through feature-level conditional modeling while providing native support for mixed-type data.
- RFOD successfully addresses outlier detection in mixed-type tabular data through feature-level conditional reconstruction
- The design of AGD and UWA significantly improves detection accuracy and robustness
- The method maintains high accuracy while providing superior interpretability and computational efficiency
- Parameter Sensitivity: While the α parameter is relatively stable, some tuning is still required
- Memory Overhead: Training independent forests for each feature may create memory pressure on extremely high-dimensional data
- Categorical Feature Handling: Processing high-cardinality categorical features may require further optimization
- Explore more efficient feature selection and dimensionality reduction techniques
- Investigate applications in streaming data and online learning scenarios
- Extend to time series and graph-structured data
- Methodological Innovation: The feature-level conditional reconstruction paradigm is a novel and effective approach
- Experimental Comprehensiveness: Comprehensive comparison across 15 datasets and 10 baseline methods
- Interpretability: Provides dual-level interpretability at both cell and row levels
- Practical Value: Achieves good balance between efficiency and accuracy
- Theoretical Analysis: Lacks in-depth theoretical analysis of convergence and complexity
- Extreme Scenarios: Performance on extremely high-dimensional or highly imbalanced data requires further verification
- Parameter Guidance: Lacks more systematic principles for parameter selection
- Academic Contribution: Provides a new research direction for outlier detection in tabular data
- Practical Value: Has direct application potential in critical domains such as finance and healthcare
- Reproducibility: Clear algorithm description facilitates implementation and reproduction
- Outlier detection in mixed-type tabular data
- High-risk decision scenarios requiring interpretability
- Real-time anomaly monitoring on medium-scale data
- Feature importance analysis and root cause analysis
The paper cites important works in the outlier detection field, including:
- Classical Methods: LOF (Breunig et al., 2000), Isolation Forest (Liu et al., 2008)
- Deep Learning Methods: Deep SVDD (Ruff et al., 2018), DevNet (Pang et al., 2019)
- Distance Metrics: Gower's Distance (Gower, 1971)
- Evaluation Benchmarks: ADBench (Han et al., 2022)
Overall Assessment: This is a high-quality research paper on outlier detection that proposes an innovative methodological framework with comprehensive experimental validation and strong potential for practical application. The method's advantages in interpretability and efficiency make it competitive for real-world deployment.