2025-11-10T02:52:47.563865

RepDL: Bit-level Reproducible Deep Learning Training and Inference

Xie, Zhang, Chen
Non-determinism and non-reproducibility present significant challenges in deep learning, leading to inconsistent results across runs and platforms. These issues stem from two origins: random number generation and floating-point computation. While randomness can be controlled through deterministic configurations, floating-point inconsistencies remain largely unresolved. To address this, we introduce RepDL, an open-source library that ensures deterministic and bitwise-reproducible deep learning training and inference across diverse computing environments. RepDL achieves this by enforcing correct rounding and order invariance in floating-point computation. The source code is available at https://github.com/microsoft/RepDL .
academic

RepDL: Bit-level Reproducible Deep Learning Training and Inference

Basic Information

Abstract

Non-determinism and irreproducibility in deep learning lead to inconsistent results across runs and platforms. These issues stem from two root causes: random number generation and floating-point arithmetic. While randomness can be controlled through deterministic configuration, floating-point inconsistency remains inadequately addressed. To this end, the authors introduce RepDL, an open-source library that ensures deterministic and bit-level reproducible deep learning training and inference across different computational environments by enforcing correct rounding and order invariance.

Research Background and Motivation

Problem Definition

Deep learning faces two critical challenges:

  1. Non-determinism: Identical tasks executed repeatedly with identical inputs and systems produce different results (cross-run inconsistency)
  2. Non-reproducibility: Identical tasks executed on different systems produce different results (cross-platform inconsistency)

Problem Significance

These issues result in:

  • Increased complexity in model deployment and debugging in production environments
  • Compromised correctness of cross-platform applications
  • Reduced credibility of published results
  • Diminished trustworthiness of AI systems in sensitive domains

Limitations of Existing Approaches

Although numerous solutions have been proposed by industry and academia, numerical inconsistency remains an open problem in deep learning. Existing approaches primarily focus on controlling random number generators but provide insufficient solutions to floating-point computation issues.

Research Motivation

The authors identify two root causes of the problem: random number generators and floating-point arithmetic. Compared to random number issues which have relatively well-established solutions, floating-point computation problems are more complex and require specialized solutions.

Core Contributions

  1. Problem Analysis: Systematically analyzes the sources of non-determinism and non-reproducibility in deep learning, categorizing them into two classes: random number generation and floating-point arithmetic
  2. Design Principles: Proposes two core design principles: correct rounding and order invariance
  3. RepDL Library: Develops an open-source library RepDL that achieves bit-level reproducible deep learning training and inference
  4. PyTorch Compatibility: Provides PyTorch-compatible APIs supporting deep learning operations, differentiable functions, neural network modules, and optimizers

Methodology Details

Root Cause Analysis

1. Random Number Generators

  • Applications: Weight initialization, data shuffling, dropout regularization, data augmentation
  • Issues: Different seeds, inconsistent RNG algorithms, non-deterministic call sequences in multi-threaded environments
  • Solutions: Employ reproducible RNG algorithms (e.g., MT19937), thread-safe implementation, fixed base seed

2. Floating-Point Arithmetic

A more complex issue, divided into two subcategories:

2.1 Basic Operation Precision

  • Varying precision of basic mathematical function implementations across systems
  • Hardware instruction precision differences (e.g., RCP instruction precision variations among x86 CPUs)

2.2 Computation Order

  • Order sensitivity caused by non-associativity of floating-point operations
  • Non-deterministic factors: atomic operations, dynamic code paths, dynamic batching, and caching
  • Non-reproducibility factors: software variability, compiler optimizations

RepDL Design Principles

Principle 1: Correct Rounding for Basic Operations

  • Adheres to IEEE-754 standard correct rounding principles
  • Applies standard IEEE-754 rounding rules to infinitely precise real number results
  • Eliminates ambiguity in numerical precision

Principle 2: Order Invariance for Composite Operations

  • Maintains order invariance for combinations of basic operations
  • Implements each operation using identical types of basic operations in identical order
  • Assigns different APIs for different computation orders

Implementation Details

1. Ensuring Correct Rounding

  • Utilizes correctly rounded mathematical libraries or high-precision algorithms
  • Implements correctly rounded versions of arithmetic operations, square root, exponential, logarithmic functions, etc.
  • Avoids hardware-dependent implementation variations

2. Fixed Summation Order

Provides two summation orders:

  • Sequential Summation: Default version, cache-friendly, suitable for most cases
  • Pairwise Summation: Alternative version, increases parallelism

For fully connected and 2D convolutional layers:

  • Fully connected layer: t_fc = B × M independent summation tasks, each summing n_fc = N elements
  • Convolutional layer: t_conv = B × O × W × H independent summation tasks, each summing n_conv = I × K_w × K_h elements

3. Computation Graph Definition

  • Explicitly defines computation order using computation graphs
  • Assigns different API names to different computation graph implementations of identical functions
  • Avoids mathematically equivalent but floating-point-different transformations

4. Compilation Options

  • Disables options causing unsafe mathematical optimizations
  • Enables floating-point expression contraction options (FMA operations)

Experimental Setup

Supported Features

  • Data Types: Support for single-precision floating-point (float32)
  • Compatibility: Provides PyTorch-compatible APIs
  • Operation Support: Deep learning operations, differentiable functions, neural network modules, optimizers

Performance Analysis

Using ResNet-50 as an example:

  • Convolutional layers dominate computational complexity
  • Multiple convolutional layers with t_conv = B × 256 × 56 × 56 = B × 802816
  • NVIDIA A100 GPU contains 6912 CUDA cores
  • Even with B=1, core count is far less than t_conv, making sequential summation efficient

Experimental Results

Reproducibility Verification

RepDL achieves bit-level consistent results, ensuring:

  • Consistency across multiple executions on the same system
  • Consistency across different CPU or GPU systems
  • Complete reproducibility of training and inference processes

Performance Impact

  • Switching from non-deterministic libraries to RepDL incurs minor performance degradation
  • Performance degradation is acceptable, with future optimizations potentially mitigating it

The paper references multiple related research areas:

  1. Reproducible Floating-Point Summation Algorithms: Ahrens et al.'s order-independent summation algorithms
  2. Deep Learning Reproducibility: Chen et al.'s work on training reproducible deep learning models
  3. Correctly Rounded Mathematical Libraries: MPFR library and high-performance correctly rounded mathematical libraries
  4. Numerical Precision Analysis: Research on mathematical function accuracy across different precisions

Conclusions and Discussion

Main Conclusions

RepDL addresses floating-point computation issues, providing a foundation for reliable model development and consistent model deployment. The library successfully achieves deterministic and reproducible deep learning across different computational environments.

Limitations

  1. Insufficient Performance Optimization: Current version lacks complete optimization with performance losses
  2. Limited Precision Support: Supports only single-precision (float32); low-precision type support presents challenges
  3. Hardware Specificity: Non-standard and hardware-specific characteristics of low-precision computation (e.g., Tensor Cores)

Future Directions

  1. Further performance optimization to mitigate performance degradation
  2. Support for low-precision floating-point data types
  3. Standardization of numerical behavior for low-precision computation
  4. Expansion of community contributions and features

In-Depth Evaluation

Strengths

  1. Accurate Problem Identification: Systematically analyzes root causes of deep learning reproducibility issues
  2. Practical Solutions: Provides feasible engineering solutions rather than purely theoretical analysis
  3. Clear Design Principles: Two principles—correct rounding and order invariance—are concise and effective
  4. Good Compatibility: PyTorch API compatibility lowers adoption barriers
  5. Open-Source Contribution: Provides open-source implementation promoting community development

Weaknesses

  1. Limited Experimental Validation: Lacks large-scale experimental verification and performance benchmarking
  2. Insufficient Theoretical Analysis: Theoretical analysis of performance losses is inadequate
  3. Restricted Applicability: Float32-only support limits modern deep learning applications
  4. Missing Comparative Experiments: Lacks comparison with other reproducibility solutions

Impact

  1. Academic Value: Provides important reference for deep learning reproducibility research
  2. Practical Value: Offers solutions for application scenarios requiring strict reproducibility
  3. Industry Impact: May promote deep learning frameworks' attention to reproducibility

Applicable Scenarios

  1. Scientific Research: Research projects requiring strictly reproducible results
  2. Financial AI: Financial applications with extremely high numerical consistency requirements
  3. Medical AI: Medical diagnostic systems requiring deterministic results
  4. Model Verification: Cross-platform model deployment consistency verification

References

The paper cites 15 related references covering:

  • Reproducible floating-point summation algorithms
  • Deep learning reproducibility research
  • Correctly rounded mathematical libraries
  • IEEE floating-point standards
  • Analysis of randomness and non-determinism in deep learning

Overall Assessment: This is a practical research paper addressing reproducibility issues in deep learning. While it has limitations in experimental validation and theoretical analysis, its proposed solution has significant practical value, particularly for application scenarios requiring strict numerical consistency. The open-source release of the RepDL library provides valuable tools to the community and is expected to advance research in deep learning reproducibility.