Fault detection is essential in complex industrial systems to prevent failures and optimize performance by distinguishing abnormal from normal operating conditions. With the growing availability of condition monitoring data, data-driven approaches have increasingly applied in detecting system faults. However, these methods typically require large, diverse, and representative training datasets that capture the full range of operating scenarios, an assumption rarely met in practice, particularly in the early stages of deployment.
Industrial systems often operate under highly variable and evolving conditions, making it difficult to collect comprehensive training data. This variability results in a distribution shift between training and testing data, as future operating conditions may diverge from those previously observed ones. Such domain shifts hinder the generalization of traditional models, limiting their ability to transfer knowledge across time and system instances, ultimately leading to performance degradation in practical deployments.
To address these challenges, we propose a novel method for continuous test-time domain adaptation, designed to support robust early-stage fault detection in the presence of domain shifts and limited representativeness of training data. Our proposed framework --Test-time domain Adaptation for Robust fault Detection (TARD) -- explicitly separates input features into system parameters and sensor measurements. It employs a dedicated domain adaptation module to adapt to each input type using different strategies, enabling more targeted and effective adaptation to evolving operating conditions. We validate our approach on two real-world case studies from multi-phase flow facilities, delivering substantial improvements in both fault detection accuracy and model robustness over existing domain adaptation methods under real-world variability.
TARD: Test-time Domain Adaptation for Robust Fault Detection under Evolving Operating Conditions
- Paper ID: 2507.16354
- Title: TARD: Test-time Domain Adaptation for Robust Fault Detection under Evolving Operating Conditions
- Authors: Han Sun, Olga Fink (EPFL)
- Classification: stat.AP (Statistics - Applications)
- Publication Date: October 13, 2025 (arXiv v2)
- Paper Link: https://arxiv.org/abs/2507.16354
Fault detection in industrial systems is crucial for preventing failures and optimizing performance. With the increasing availability of condition monitoring data, data-driven methods have been widely adopted for fault detection. However, these methods typically require large-scale, diverse, and representative training datasets, which are difficult to obtain in practice, particularly during early deployment stages. Industrial systems often operate under highly variable and continuously evolving conditions, leading to distribution shifts between training and test data. To address these challenges, this paper proposes TARD, a novel continuous test-time domain adaptation method specifically designed to support robust early fault detection under domain shift and limited training data conditions.
- Data Scarcity: Industrial systems, particularly newly deployed or refurbished equipment, lack comprehensive historical data, with fault data being extremely scarce
- Domain Shift Challenges: Significant differences exist in operating conditions between different equipment units and within the same system over time, violating the i.i.d assumption of traditional machine learning
- Dynamic Environments: Industrial systems operate in continuously evolving environments, requiring continuous adaptation rather than discrete domain adaptation
- Early fault detection is critical for optimizing system performance, minimizing maintenance costs, and reducing asset unavailability
- Existing methods suffer from high false positive rates and reduced detection accuracy when facing distribution shifts
- There is a need to support fleet-level knowledge transfer, transferring experience from data-rich systems to data-scarce new systems
- Traditional Domain Adaptation: Requires substantial source and target domain data, typically requiring labeled fault data
- Static Adaptation: Most methods assume discrete static domain characteristics, unable to handle continuously evolving operating conditions
- Test-time Adaptation Risks: Existing TTA methods may incorrectly adapt fault patterns to normal behavior
- Proposes TARD Framework: A continuous test-time domain adaptation framework specifically designed for unsupervised fault detection, completely independent of labeled fault data
- Innovative Feature Separation Strategy: Explicitly separates input variables into control parameters and sensor measurements, employing specialized adaptation strategies for each category
- Practical Framework: Requires only limited normal samples from the target system, suitable for early deployment and fleet-level knowledge transfer
- Empirical Validation: Validates the method's effectiveness through real case studies on two multiphase flow facilities
Given:
- Rich healthy training data from source system: Xs=[x1s,⋯,xns]
- Limited normal data from target domain: Xt=[x1t,⋯,xmt]
Objective: Achieve robust fault detection in target domain t, considering:
- Both domains lack fault training data
- Limited target domain data availability
- Continuous distribution shifts during inference
Input data is divided into two groups: X=[x,w]
- Control Variables w: System condition control variables set by operators or control systems
- Sensor Measurements x: Sensor signals monitoring system components and reflecting real-time system status
Employs autoencoder fθ as the reconstruction model, trained on source domain normal data:
lossMSE=n1∑1n(Xs−X^s)2
Introduces adaptation module hϕ, rather than directly modifying the reconstruction model:
- Input: Control variables w and predictions from the pretrained autoencoder
- Output: Compensation term Δx
- Design Rationale: Avoids adapting to potentially faulty data distributions
- Frozen Main Model: Pretrained autoencoder fθ remains frozen during adaptation
- AdaBN Layers: Integrates adaptive batch normalization layers in the adaptation module, updating mean and variance based on batch statistics
- Separated Adaptation: Adaptation applied only to control variables, protecting anomaly detection capability of sensor measurements
ri=Xˉt_training∣X^i−Xi∣
si=k1∑j=1krij+max∑j=1krij
si_smooth=mean∑q=0l−1si+q
si_smooth>α⋅rˉt_training
- Monitored Variables: 24 process variables (pressure, flow rate, liquid level, density, temperature, valve position)
- Control Variables: Air and water flow rate setpoints
- Fault Types: 6 types (air line blockage, water line blockage, top separator inlet blockage, direct bypass opening, slug flow conditions, 2-inch line pressurization)
- Sampling Frequency: 1 Hz
- Monitored Variables: 15 process variables
- Operating Conditions: 20 different air and water flow rate combinations
- Fault Types: 3 types (air leak, air blockage, flow diversion)
- Sampling Frequency: 1 Hz
- Accuracy: Overall prediction correctness rate
- F1 Score: Harmonic mean of precision and recall
- AUC: Area under the ROC curve
- Baseline: Model trained only on source domain
- AdaBN: Adaptive batch normalization
- MMD: Maximum mean discrepancy
- Optimizer: Adam, learning rate 1e-5
- Batch Size: 128
- Training Epochs: 500 for autoencoder, 50 for adaptation module
- Architecture: 3-layer fully connected encoder and decoder, dimensions 50-50-10
| Fault Type | Baseline | AdaBN | MMD | TARD |
|---|
| Air Line Blockage | F1: 0.43 | F1: 0.43 | F1: 0.47 | F1: 0.70 |
| Water Line Blockage | F1: 0.67 | F1: 0.62 | F1: 0.69 | F1: 0.76 |
| Top Separator Blockage | F1: 0.63 | F1: 0.65 | F1: 0.64 | F1: 0.79 |
| Direct Bypass Opening | F1: 0.53 | F1: 0.60 | F1: 0.56 | F1: 0.69 |
| Slug Flow Conditions | F1: 0.85 | F1: 0.88 | F1: 0.89 | F1: 0.92 |
| 2-inch Line Pressurization | F1: 0.94 | F1: 0.98 | F1: 1.00 | F1: 1.00 |
| Fault Type | Baseline | AdaBN | MMD | TARD |
|---|
| Air Leak | F1: 0.62 | F1: 0.36 | F1: 0.51 | F1: 0.76 |
| Air Blockage | F1: 0.93 | F1: 0.88 | F1: 0.96 | F1: 0.94 |
| Flow Diversion | F1: 0.11 | F1: 0.51 | F1: 0.51 | F1: 0.69 |
Under different operating conditions for the Cranfield top separator blockage case:
- Variable Conditions: TARD performs best in dynamic environments (F1: 0.86 vs MMD: 0.79)
- Steady-state Conditions: TARD maintains advantages across most steady-state conditions
Deep ensemble validation (10 independent models) confirms high confidence in TARD detection results, with uncertainty bands remaining narrow during fault detection (standard deviation approximately 0.8).
- 100-dimensional Sensors: F1 improved from 0.42 to 0.67
- 1000-dimensional Sensors: F1 improved from 0.10 to 0.48
- Inference Latency: Maintained within real-time monitoring requirements (<2ms)
- Probabilistic Models: Gaussian mixture models, energy-based models
- One-class Classification: Discriminative boundary methods such as support vector machines
- Reconstruction Methods: Reconstruction error-based methods such as autoencoders
- Homogeneous Sub-fleets: Similarity clustering-based methods
- Functional Representation Learning: Methods learning overall fleet behavior
- Limitations: Depend on sufficient similarity assumptions
- Discrepancy Minimization Methods: Statistical distance minimization such as MMD
- Adversarial Methods: Domain discriminator networks such as DANN
- Test-time Adaptation: Methods such as Tent and SHOT
- Challenges: Require labeled data, assume static domains, may adapt to fault data
- TARD successfully addresses three major challenges in industrial fault detection: lack of labeled fault data, limited target domain data, and continuous domain shift
- The feature separation strategy effectively distinguishes between operating condition changes and actual faults
- Significantly outperforms existing domain adaptation methods on two real industrial datasets
- Parameter Tuning: Fault detection sensitivity parameter α requires manual setting
- Major System Changes: Lacks protective mechanisms for handling permanent major system changes
- Temporal Dynamics: Current residual smoothing strategy may lose important temporal details
- Automatic Protection Mechanisms: Develop methods to detect major domain shifts and trigger adaptation module retraining
- Adaptive Parameter Adjustment: Methods for automatically adjusting sensitivity parameter α
- Time Series Analysis: Introduce specialized time series models to analyze complex patterns in residual sequences
- Strong Practicality: Addresses real challenges in industry, requiring only limited normal data
- Technical Innovation: Clever and effective design of feature separation and specialized adaptation strategies
- Comprehensive Experiments: Full validation with two real industrial datasets plus high-dimensional synthetic data
- Solid Theoretical Foundation: Clear problem definition and method motivation
- Limited Scope: Primarily validated on multiphase flow systems; generalization to other industrial systems remains to be verified
- Theoretical Analysis: Lacks theoretical guarantees on method convergence and stability
- Computational Overhead: While inference time is reported, detailed computational complexity analysis is lacking
- Hyperparameter Sensitivity: Insufficient sensitivity analysis for critical hyperparameters (e.g., α, window length l)
- Academic Contribution: Provides new research directions for industrial fault detection
- Practical Value: Directly applicable to industrial deployment, particularly for early monitoring of new equipment
- Reproducibility: Provides detailed implementation details and algorithm descriptions
- Newly Deployed Systems: Industrial equipment with limited historical data
- Fleet Management: Scenarios requiring cross-device knowledge transfer
- Dynamic Environments: Industrial systems with continuously changing operating conditions
- Critical Infrastructure: Important industrial systems sensitive to false alarms
The paper cites 51 related references covering important works in fault detection, domain adaptation, and deep learning, providing a solid theoretical foundation for the research.
Overall Assessment: This is a high-quality applied statistics paper that successfully applies domain adaptation techniques to the important practical problem of industrial fault detection. The method design is sound, experimental validation is comprehensive, and it possesses strong practical value and academic significance.