Modern machine learning (ML) has grown into a tightly coupled, full-stack ecosystem that combines hardware, software, network, and applications. Many users rely on cloud providers for elastic, isolated, and cost-efficient resources. Unfortunately, these platforms as a service use virtualization, which means operators have little insight into the users' workloads. This hinders resource optimizations by the operator, which is essential to ensure cost efficiency and minimize execution time. In this paper, we argue that workload knowledge is unnecessary for system-level optimization. We propose Reveal, which takes a hardware-centric approach, relying only on hardware signals - fully accessible by operators. Using low-level signals collected from the system, Reveal detects anomalies through an unsupervised learning pipeline. The pipeline is developed by analyzing over 30 popular ML models on various hardware platforms, ensuring adaptability to emerging workloads and unknown deployment patterns. Using Reveal, we successfully identified both network and system configuration issues, accelerating the DeepSeek model by 5.97%.
- Paper ID: 2510.26008
- Title: Detecting Anomalies in Systems for AI Using Hardware Telemetry
- Authors: Ziji Chen, Steven W. D. Chien, Peng Qian, Noa Zilberman (University of Oxford)
- Classification: cs.PF (Performance), cs.AR (Computer Architecture), cs.DC (Distributed Computing), cs.LG (Machine Learning)
- Publication Date: October 31, 2025 (arXiv v2)
- Paper Link: https://arxiv.org/abs/2510.26008v2
Modern machine learning has evolved into a tightly-coupled full-stack ecosystem combining hardware, software, networking, and applications. Many users rely on cloud providers for elastic, isolated, and cost-efficient resources. However, these platform-as-a-service offerings employ virtualization, resulting in operators lacking visibility into user workloads. This impedes resource optimization by operators, which is critical for ensuring cost efficiency and minimizing execution time. This paper proposes that system-level optimization requires no workload knowledge. We present Reveal, adopting a hardware-centric approach that relies solely on hardware signals fully accessible to operators. By analyzing the performance of 30+ popular ML models across various hardware platforms, we developed an unsupervised learning pipeline to detect anomalies. Using Reveal, we successfully identified network and system configuration issues, accelerating the DeepSeek model by 5.97%.
- Missing Observability: Virtualization in cloud platforms obscures underlying hardware, preventing operators from obtaining high-level workload information, making system-level optimization difficult
- Performance Bottleneck Detection Challenges: ML workloads exhibit tight hardware-software coupling, where minor inefficiencies can cascade into system-level performance degradation
- Limitations of Existing Tools: Require application-level integration, high runtime overhead (up to 90.2%), limited coverage
- Specialized accelerators like GPUs are expensive (tens of thousands of dollars per GPU)
- Cloud AI resource demand projected to grow 30% annually through 2030
- Even minor configuration errors can cause 1.5x performance degradation
- Distributed training heavily depends on collective communication, vulnerable to network issues
- High-level Observability Dependency: Most tools require application-level information, unavailable in virtualized environments
- High Overhead: Plumber adds 21% overhead, RL-Scope adds 90.2% GPU kernel launch time
- Rule-driven Detection: Requires workload-specific threshold tuning, poor portability
- Limited Coverage: Framework analyzers typically cover only applications and framework runtime
- Propose Reveal Framework: A hardware-centric analysis and anomaly detection framework with high portability, deployability, and analytical accuracy
- Identify Key Performance Indicators: Determine a set of low-level performance metrics representing ML workload behavior on hardware, with all collected datasets open-sourced
- Develop Unsupervised Detection Pipeline: Successfully detect performance issues in containerized ML workloads, identify system bottlenecks, and accelerate DeepSeek by 5.97%
Input: Host-level hardware telemetry data (CPU, GPU, memory, network, storage metrics)
Output: Anomalous window detection, subsystem attribution, root cause analysis reports
Constraints: Use only hardware-level signals accessible to operators, without requiring high-level workload knowledge
- Collects approximately 150 unique metric types using perf, procfs, nvidia-smi, and standard Linux tools
- Scales to 700+ time series channels when replicated across CPU cores and GPUs
- CPU overhead maintained below 1.5%
- Metric Filtering: Correlation-driven pruning, retaining approximately 60% of metrics at |r|=0.5 threshold
- Derived Metrics: Compute IPC (execution throughput), branch misprediction rate, cache miss rate, etc.
- Sliding Window: 3-second window with 1-second stride, extracting statistical and temporal features
Employs three complementary unsupervised methods:
- Z-score: Standardized deviation detection, flagging windows exceeding 99th percentile
- Mahalanobis Distance in PCA Subspace: Accounts for metric correlations and scale differences
- Isolation Forest: Tree-based ensemble method with 1% contamination rate
- Hardware-centric Approach: Entirely based on hardware signals, avoiding dependence on high-level observability
- Multi-detector Fusion: Reduces false positives and improves detection accuracy through inter-detector consistency
- Subsystem Attribution: Maps anomalies to specific hardware subsystems (CPU, GPU, memory, network, storage)
- Cross-layer Analysis: Single anomalous windows may involve multiple correlated signals, providing stronger anomaly evidence
- ML Applications: 30+ popular models including BERT, BART, ResNet, ViT, VGG, DeepSeek, LLaMA, Mistral
- Task Types: Text classification, table question-answering, image classification, semantic segmentation
- Datasets: GLUE/SST2, WikiSQL, PASCAL VOC, CIFAR, MNIST
- Runs: 10 executions per workload to ensure statistical reliability
- HPC Cluster:
- Dual-node with NVIDIA Tesla V100 GPUs (32GB), Intel Xeon Platinum 8628 CPUs
- Single-node with four NVIDIA H100 GPUs (96GB HBM3), Intel Sapphire Rapids CPUs
- Local Cluster:
- 9 servers with AMD EPYC 7443P CPUs (24 cores), 256GB memory
- 99 containers in distributed training setup
- Detection Accuracy: Accuracy of anomalous window identification
- Subsystem Attribution: Ability to correctly map to hardware subsystems
- Performance Improvement: End-to-end runtime improvement
- Overhead Assessment: CPU usage, storage requirements, detector runtime
- CPU Overhead: 1.2-1.4% at 100ms sampling interval, drops below 0.6% at 600ms
- Storage Requirements: 42-43 KB/s/host before filtering, 14-22 KB/s after filtering
- Detection Latency: Feature extraction 1.46±0.02s, end-to-end 2.26±0.17s
- Metric Stability: 99.75% of workload-metric pairs show statistically significant similarity (p<0.05)
- Cross-configuration Consistency: Median IoU of 0.50 between default and fine-grained settings, hit rate 0.92
- Detection: Windows 118-123 show IPC decline and increased L3 miss cycles
- Analysis: Cross-socket memory and PCIe traffic cause increased latency
- Fix: NUMA-aware binding, pinning processes to single NUMA node
- Result: DeepSeek-7B fine-tuning improved from 1823.4±46.1s to 1714.6±70.0s (5.97% improvement)
- Detection: Increased CPU Busy%, burst in ib0 TX/RX traffic, decreased GPU power
- Analysis: Single QP configuration causes completion processing bottleneck
- Fix: Increase from 1QP to 2QP configuration
- Result: Runtime improved from 1825.4±46.1s to 1769.3±16.7s (3.1% improvement)
- Detection: CPU Busy% variance and IRQ counter anomalies
- Fix: Enable irqbalance service for automatic interrupt load distribution
- Result: TCP retransmission anomalies reduced from 6.07% to 3.51%
- Detection: Cross-node memory usage anomalies
- Analysis: Pre-allocated 1GiB HugePages reported as "used" memory
- Fix: Configure to default 2MiB HugePages allocation
- Detection Capability: Distinguish workload-inherent retransmissions from failure-induced ones
- Analysis Depth: Provide cross-layer context from transport layer counters to CPU IRQ spikes and GPU stalls
- HPC Cluster: CPU-side signals (Bzy_MHz, IRQ) dominate, contributing >50% of anomaly features
- Local Cluster: Anomalies concentrated in memory and I/O subsystems, with writeback surges and dirty page accumulation
- Cross-environment: TCP retransmissions appear in both environments, typically related to NCCL imbalance
According to Table 1 in the paper, existing methods fall into three categories:
- Application-level Analyzers: TensorFlow Profiler, PyTorch Profiler - require code instrumentation
- System Tools: AWS SageMaker, Prometheus - rule-based detection
- Low-level Tracing: BCC/eBPF tools, RL-Scope - high overhead or limited coverage
- No Instrumentation Required: Entirely based on host-level telemetry
- Full Subsystem Coverage: CPU, GPU, memory, network, storage
- Automatic Anomaly Detection: Unsupervised ML methods
- Hardware Attribution: Map anomalies to specific hardware components
- Hardware-centric Approach is Feasible: Effective ML workload anomaly detection using only hardware signals
- Unsupervised Detection is Effective: Combination of three detectors accurately identifies multiple anomaly types
- Practical Performance Gains: Successfully identifies and fixes configuration issues, achieving significant performance improvements
- High Portability: 91% code reusable across platforms
- Static Configuration: Currently uses fixed sampling rates and window sizes, cannot adapt to workload dynamics
- Passive Detection: Can only detect anomalies, cannot automatically resolve issues
- Manual Remediation: Requires operator intervention for problem fixing
- Adaptive Sampling: Adjust sampling frequency based on heuristics
- Automatic Remediation: Explore lightweight runtime interventions, such as automatic IRQ rebalancing triggers
- Extended Detectors: Explore additional unsupervised anomaly detection methods
- Strong Innovation: First to propose pure hardware signal-based ML anomaly detection, addressing cloud environment observability challenges
- Comprehensive Experiments: Testing 30+ models across multiple hardware platforms with rich datasets
- High Practical Value: Low overhead (<2% CPU), high portability (91% code reuse)
- Convincing Results: 5.97% actual performance improvement validates method effectiveness
- Open-source Contribution: Provides complete datasets and toolkits
- Detection Latency: 2.26-second end-to-end latency may be unsuitable for real-time applications
- Feature Engineering: Metric selection and feature extraction processes are relatively complex, requiring domain expertise
- Evaluation Scope: Primarily tested in academic environments; production environment complexity may present new challenges
- Root Cause Analysis Depth: While capable of subsystem attribution, specific root cause analysis still requires manual intervention
- Academic Contribution: Provides new research direction for ML system performance monitoring
- Practical Value: Offers cloud service providers non-intrusive monitoring solutions without accessing user workloads
- Reproducibility: Open-source code and datasets support research reproduction and extension
- Cloud Service Providers: Need performance optimization without accessing user workloads
- HPC Centers: Need to monitor and diagnose ML workload performance issues
- Edge Computing: Lightweight monitoring in resource-constrained environments
- Research Institutions: ML system performance analysis and optimization research
The paper cites 77 related references covering:
- ML performance analysis tools: Hotline, RL-Scope, Plumber, etc.
- Anomaly detection methods: Isolation Forest, PCA, Mahalanobis distance, etc.
- System monitoring: Prometheus, AWS CloudWatch, etc.
- ML frameworks: PyTorch, TensorFlow, etc.
Overall Assessment: This is a high-quality systems research paper proposing an innovative hardware-centric anomaly detection method addressing practical ML workload monitoring challenges in cloud environments. The experimental design is comprehensive, results are convincing, and the work holds significant value for both academia and industry.