2025-11-11T10:10:09.268407

Detecting Anomalies in Machine Learning Infrastructure via Hardware Telemetry

Chen, Chien, Qian et al.
Modern machine learning (ML) has grown into a tightly coupled, full-stack ecosystem that combines hardware, software, network, and applications. Many users rely on cloud providers for elastic, isolated, and cost-efficient resources. Unfortunately, these platforms as a service use virtualization, which means operators have little insight into the users' workloads. This hinders resource optimizations by the operator, which is essential to ensure cost efficiency and minimize execution time. In this paper, we argue that workload knowledge is unnecessary for system-level optimization. We propose Reveal, which takes a hardware-centric approach, relying only on hardware signals - fully accessible by operators. Using low-level signals collected from the system, Reveal detects anomalies through an unsupervised learning pipeline. The pipeline is developed by analyzing over 30 popular ML models on various hardware platforms, ensuring adaptability to emerging workloads and unknown deployment patterns. Using Reveal, we successfully identified both network and system configuration issues, accelerating the DeepSeek model by 5.97%.
academic

Detecting Anomalies in Machine Learning Infrastructure via Hardware Telemetry

Basic Information

  • Paper ID: 2510.26008
  • Title: Detecting Anomalies in Systems for AI Using Hardware Telemetry
  • Authors: Ziji Chen, Steven W. D. Chien, Peng Qian, Noa Zilberman (University of Oxford)
  • Classification: cs.PF (Performance), cs.AR (Computer Architecture), cs.DC (Distributed Computing), cs.LG (Machine Learning)
  • Publication Date: October 31, 2025 (arXiv v2)
  • Paper Link: https://arxiv.org/abs/2510.26008v2

Abstract

Modern machine learning has evolved into a tightly-coupled full-stack ecosystem combining hardware, software, networking, and applications. Many users rely on cloud providers for elastic, isolated, and cost-efficient resources. However, these platform-as-a-service offerings employ virtualization, resulting in operators lacking visibility into user workloads. This impedes resource optimization by operators, which is critical for ensuring cost efficiency and minimizing execution time. This paper proposes that system-level optimization requires no workload knowledge. We present Reveal, adopting a hardware-centric approach that relies solely on hardware signals fully accessible to operators. By analyzing the performance of 30+ popular ML models across various hardware platforms, we developed an unsupervised learning pipeline to detect anomalies. Using Reveal, we successfully identified network and system configuration issues, accelerating the DeepSeek model by 5.97%.

Research Background and Motivation

Core Problems

  1. Missing Observability: Virtualization in cloud platforms obscures underlying hardware, preventing operators from obtaining high-level workload information, making system-level optimization difficult
  2. Performance Bottleneck Detection Challenges: ML workloads exhibit tight hardware-software coupling, where minor inefficiencies can cascade into system-level performance degradation
  3. Limitations of Existing Tools: Require application-level integration, high runtime overhead (up to 90.2%), limited coverage

Problem Significance

  • Specialized accelerators like GPUs are expensive (tens of thousands of dollars per GPU)
  • Cloud AI resource demand projected to grow 30% annually through 2030
  • Even minor configuration errors can cause 1.5x performance degradation
  • Distributed training heavily depends on collective communication, vulnerable to network issues

Limitations of Existing Approaches

  1. High-level Observability Dependency: Most tools require application-level information, unavailable in virtualized environments
  2. High Overhead: Plumber adds 21% overhead, RL-Scope adds 90.2% GPU kernel launch time
  3. Rule-driven Detection: Requires workload-specific threshold tuning, poor portability
  4. Limited Coverage: Framework analyzers typically cover only applications and framework runtime

Core Contributions

  1. Propose Reveal Framework: A hardware-centric analysis and anomaly detection framework with high portability, deployability, and analytical accuracy
  2. Identify Key Performance Indicators: Determine a set of low-level performance metrics representing ML workload behavior on hardware, with all collected datasets open-sourced
  3. Develop Unsupervised Detection Pipeline: Successfully detect performance issues in containerized ML workloads, identify system bottlenecks, and accelerate DeepSeek by 5.97%

Methodology Details

Task Definition

Input: Host-level hardware telemetry data (CPU, GPU, memory, network, storage metrics) Output: Anomalous window detection, subsystem attribution, root cause analysis reports Constraints: Use only hardware-level signals accessible to operators, without requiring high-level workload knowledge

Model Architecture

1. Telemetry Collector

  • Collects approximately 150 unique metric types using perf, procfs, nvidia-smi, and standard Linux tools
  • Scales to 700+ time series channels when replicated across CPU cores and GPUs
  • CPU overhead maintained below 1.5%

2. Metric Reanalysis and Feature Extraction

  • Metric Filtering: Correlation-driven pruning, retaining approximately 60% of metrics at |r|=0.5 threshold
  • Derived Metrics: Compute IPC (execution throughput), branch misprediction rate, cache miss rate, etc.
  • Sliding Window: 3-second window with 1-second stride, extracting statistical and temporal features

3. Anomaly Detection Engine

Employs three complementary unsupervised methods:

  • Z-score: Standardized deviation detection, flagging windows exceeding 99th percentile
  • Mahalanobis Distance in PCA Subspace: Accounts for metric correlations and scale differences
  • Isolation Forest: Tree-based ensemble method with 1% contamination rate

Technical Innovations

  1. Hardware-centric Approach: Entirely based on hardware signals, avoiding dependence on high-level observability
  2. Multi-detector Fusion: Reduces false positives and improves detection accuracy through inter-detector consistency
  3. Subsystem Attribution: Maps anomalies to specific hardware subsystems (CPU, GPU, memory, network, storage)
  4. Cross-layer Analysis: Single anomalous windows may involve multiple correlated signals, providing stronger anomaly evidence

Experimental Setup

Dataset

  • ML Applications: 30+ popular models including BERT, BART, ResNet, ViT, VGG, DeepSeek, LLaMA, Mistral
  • Task Types: Text classification, table question-answering, image classification, semantic segmentation
  • Datasets: GLUE/SST2, WikiSQL, PASCAL VOC, CIFAR, MNIST
  • Runs: 10 executions per workload to ensure statistical reliability

Experimental Environment

  1. HPC Cluster:
    • Dual-node with NVIDIA Tesla V100 GPUs (32GB), Intel Xeon Platinum 8628 CPUs
    • Single-node with four NVIDIA H100 GPUs (96GB HBM3), Intel Sapphire Rapids CPUs
  2. Local Cluster:
    • 9 servers with AMD EPYC 7443P CPUs (24 cores), 256GB memory
    • 99 containers in distributed training setup

Evaluation Metrics

  • Detection Accuracy: Accuracy of anomalous window identification
  • Subsystem Attribution: Ability to correctly map to hardware subsystems
  • Performance Improvement: End-to-end runtime improvement
  • Overhead Assessment: CPU usage, storage requirements, detector runtime

Experimental Results

Main Results

Performance Overhead

  • CPU Overhead: 1.2-1.4% at 100ms sampling interval, drops below 0.6% at 600ms
  • Storage Requirements: 42-43 KB/s/host before filtering, 14-22 KB/s after filtering
  • Detection Latency: Feature extraction 1.46±0.02s, end-to-end 2.26±0.17s

Anomaly Detection Performance

  • Metric Stability: 99.75% of workload-metric pairs show statistically significant similarity (p<0.05)
  • Cross-configuration Consistency: Median IoU of 0.50 between default and fine-grained settings, hit rate 0.92

Case Studies

Case 1: NUMA Anomaly (Memory Subsystem)

  • Detection: Windows 118-123 show IPC decline and increased L3 miss cycles
  • Analysis: Cross-socket memory and PCIe traffic cause increased latency
  • Fix: NUMA-aware binding, pinning processes to single NUMA node
  • Result: DeepSeek-7B fine-tuning improved from 1823.4±46.1s to 1714.6±70.0s (5.97% improvement)

Case 2: NCCL-QP Configuration Error (Network Subsystem)

  • Detection: Increased CPU Busy%, burst in ib0 TX/RX traffic, decreased GPU power
  • Analysis: Single QP configuration causes completion processing bottleneck
  • Fix: Increase from 1QP to 2QP configuration
  • Result: Runtime improved from 1825.4±46.1s to 1769.3±16.7s (3.1% improvement)

Case 3: IRQ Imbalance (CPU Subsystem)

  • Detection: CPU Busy% variance and IRQ counter anomalies
  • Fix: Enable irqbalance service for automatic interrupt load distribution
  • Result: TCP retransmission anomalies reduced from 6.07% to 3.51%

Case 4: HugePages Configuration Error (Memory Subsystem)

  • Detection: Cross-node memory usage anomalies
  • Analysis: Pre-allocated 1GiB HugePages reported as "used" memory
  • Fix: Configure to default 2MiB HugePages allocation

Case 5: Packet Loss Injection Test (Network Subsystem)

  • Detection Capability: Distinguish workload-inherent retransmissions from failure-induced ones
  • Analysis Depth: Provide cross-layer context from transport layer counters to CPU IRQ spikes and GPU stalls

Anomaly Pattern Analysis

  • HPC Cluster: CPU-side signals (Bzy_MHz, IRQ) dominate, contributing >50% of anomaly features
  • Local Cluster: Anomalies concentrated in memory and I/O subsystems, with writeback surges and dirty page accumulation
  • Cross-environment: TCP retransmissions appear in both environments, typically related to NCCL imbalance

Comparison of Existing Monitoring Methods

According to Table 1 in the paper, existing methods fall into three categories:

  1. Application-level Analyzers: TensorFlow Profiler, PyTorch Profiler - require code instrumentation
  2. System Tools: AWS SageMaker, Prometheus - rule-based detection
  3. Low-level Tracing: BCC/eBPF tools, RL-Scope - high overhead or limited coverage

Reveal's Advantages

  • No Instrumentation Required: Entirely based on host-level telemetry
  • Full Subsystem Coverage: CPU, GPU, memory, network, storage
  • Automatic Anomaly Detection: Unsupervised ML methods
  • Hardware Attribution: Map anomalies to specific hardware components

Conclusions and Discussion

Main Conclusions

  1. Hardware-centric Approach is Feasible: Effective ML workload anomaly detection using only hardware signals
  2. Unsupervised Detection is Effective: Combination of three detectors accurately identifies multiple anomaly types
  3. Practical Performance Gains: Successfully identifies and fixes configuration issues, achieving significant performance improvements
  4. High Portability: 91% code reusable across platforms

Limitations

  1. Static Configuration: Currently uses fixed sampling rates and window sizes, cannot adapt to workload dynamics
  2. Passive Detection: Can only detect anomalies, cannot automatically resolve issues
  3. Manual Remediation: Requires operator intervention for problem fixing

Future Directions

  1. Adaptive Sampling: Adjust sampling frequency based on heuristics
  2. Automatic Remediation: Explore lightweight runtime interventions, such as automatic IRQ rebalancing triggers
  3. Extended Detectors: Explore additional unsupervised anomaly detection methods

In-depth Evaluation

Strengths

  1. Strong Innovation: First to propose pure hardware signal-based ML anomaly detection, addressing cloud environment observability challenges
  2. Comprehensive Experiments: Testing 30+ models across multiple hardware platforms with rich datasets
  3. High Practical Value: Low overhead (<2% CPU), high portability (91% code reuse)
  4. Convincing Results: 5.97% actual performance improvement validates method effectiveness
  5. Open-source Contribution: Provides complete datasets and toolkits

Weaknesses

  1. Detection Latency: 2.26-second end-to-end latency may be unsuitable for real-time applications
  2. Feature Engineering: Metric selection and feature extraction processes are relatively complex, requiring domain expertise
  3. Evaluation Scope: Primarily tested in academic environments; production environment complexity may present new challenges
  4. Root Cause Analysis Depth: While capable of subsystem attribution, specific root cause analysis still requires manual intervention

Impact

  1. Academic Contribution: Provides new research direction for ML system performance monitoring
  2. Practical Value: Offers cloud service providers non-intrusive monitoring solutions without accessing user workloads
  3. Reproducibility: Open-source code and datasets support research reproduction and extension

Applicable Scenarios

  1. Cloud Service Providers: Need performance optimization without accessing user workloads
  2. HPC Centers: Need to monitor and diagnose ML workload performance issues
  3. Edge Computing: Lightweight monitoring in resource-constrained environments
  4. Research Institutions: ML system performance analysis and optimization research

References

The paper cites 77 related references covering:

  • ML performance analysis tools: Hotline, RL-Scope, Plumber, etc.
  • Anomaly detection methods: Isolation Forest, PCA, Mahalanobis distance, etc.
  • System monitoring: Prometheus, AWS CloudWatch, etc.
  • ML frameworks: PyTorch, TensorFlow, etc.

Overall Assessment: This is a high-quality systems research paper proposing an innovative hardware-centric anomaly detection method addressing practical ML workload monitoring challenges in cloud environments. The experimental design is comprehensive, results are convincing, and the work holds significant value for both academia and industry.