2025-11-24T17:34:17.619375

Same model, better performance: the impact of shuffling on DNA Language Models benchmarking

Greco, Rawlik
Large Language Models are increasingly popular in genomics due to their potential to decode complex biological sequences. Hence, researchers require a standardized benchmark to evaluate DNA Language Models (DNA LMs) capabilities. However, evaluating DNA LMs is a complex task that intersects genomic's domain-specific challenges and machine learning methodologies, where seemingly minor implementation details can significantly compromise benchmark validity. We demonstrate this through BEND (Benchmarking DNA Language Models), where hardware-dependent hyperparameters -- number of data loading workers and buffer sizes -- create spurious performance variations of up to 4% for identical models. The problem stems from inadequate data shuffling interacting with domain specific data characteristics. Experiments with three DNA language models (HyenaDNA, DNABERT-2, ResNet-LM) show these artifacts affect both absolute performance and relative model rankings. We propose a simple solution: pre-shuffling data before storage eliminates hardware dependencies while maintaining efficiency. This work highlights how standard ML practices can interact unexpectedly with domain-specific data characteristics, with broader implications for benchmark design in specialized domains.
academic

Same model, better performance: the impact of shuffling on DNA Language Models benchmarking

Basic Information

  • Paper ID: 2510.12617
  • Title: Same model, better performance: the impact of shuffling on DNA Language Models benchmarking
  • Authors: Davide Greco, Konrad Rawlik (University of Edinburgh, Baillie Gifford Pandemic Science Hub)
  • Classification: q-bio.GN cs.LG
  • Publication Date: October 15, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.12617

Abstract

Large language models are increasingly popular in genomics due to their potential for decoding complex biological sequences. Consequently, researchers require standardized benchmarks to evaluate the capabilities of DNA language models (DNA LMs). However, evaluating DNA LMs is a complex task involving the intersection of domain-specific challenges in genomics and machine learning methodology, where seemingly minor implementation details can significantly compromise benchmark validity. The authors demonstrate this through BEND (Benchmarking DNA Language Models), where hardware-related hyperparameters—number of data loading workers and buffer size—create spurious performance variations of up to 4% for identical models. The issue stems from the interaction between insufficient data shuffling and domain-specific data characteristics. Experiments using three DNA language models (HyenaDNA, DNABERT-2, ResNet-LM) show that these artifacts affect both absolute performance and relative model rankings. The authors propose a simple solution: pre-shuffling data before storage eliminates hardware dependency while maintaining efficiency.

Research Background and Motivation

Core Problem

The core problem addressed by this research is implementation bias in DNA language model benchmarking. Specifically:

  1. Hardware Dependency: Benchmark results are influenced by hardware-related hyperparameters (worker count, buffer size)
  2. Insufficient Data Shuffling: Due to the special nature of genomic data (spatial dependencies, sequence overlaps), standard machine learning practices may produce unexpected biases
  3. Evaluation Fairness: Researchers with different computational resources may obtain different benchmark results, compromising assessment fairness

Problem Significance

  1. Foundation for Scientific Progress: Standardized benchmarks are fundamental to machine learning scientific progress, enabling researchers to compare methods and track improvements
  2. Challenges in Emerging Fields: In emerging fields like genomics, domain-specific knowledge is scarce and benchmark design principles are still being established
  3. Resource Equity: Ensuring benchmarks do not favor researchers with superior computational resources

Limitations of Existing Approaches

Although the BEND benchmark framework provides a comprehensive suite of supervised genomic tasks, it has the following issues:

  1. Employs complex data loading mechanisms with a two-level shuffling strategy for handling large-scale datasets
  2. Introduces dependencies on hardware-specific hyperparameters
  3. When combined with inherent genomic data characteristics (significant overlap between continuous DNA sequence samples), results in insufficient data shuffling

Core Contributions

  1. Discovery and Quantification of Systematic Bias in Benchmarking: Demonstrates that hardware-related hyperparameters can cause performance variations of up to 4% for identical models
  2. Detailed Problem Analysis: Provides in-depth analysis of the interaction between data shuffling mechanisms in the WebDataset framework and genomic data characteristics
  3. Simple and Effective Solution: Proposes a pre-shuffling method that eliminates hardware dependency while maintaining or improving performance across all tasks
  4. Cross-Architecture Validation: Verifies the universality of the problem and effectiveness of the solution across three different DNA language model architectures
  5. Best Practice Guidance for Benchmark Design: Provides concrete empirical insights and recommendations for benchmark design in specialized domains

Methodology Details

Problem Analysis

Data Processing Pipeline in BEND Framework

  1. Embedding Generation: Extract DNA sequences from reference genomes and generate embeddings using language models
  2. Downstream Model Training: Train downstream models using generated embeddings paired with labels
  3. Evaluation: Downstream models process embeddings of test DNA sequences and compare with ground truth labels

WebDataset Storage and Loading Mechanism

BEND uses the WebDataset framework to store, load, and shuffle embeddings:

  • Shard Storage: Embeddings are stored in .tar files (shards)
  • Worker Assignment: Each shard is assigned to a single worker
  • Buffer Shuffling: Each worker has its own buffer, shuffling only samples from shards assigned to that worker

Data Access Pattern Analysis

The paper analyzes data access patterns under different configurations through visualization:

  • No Shuffling: Sequential data access
  • BEND (1 worker): Shards accessed sequentially, samples read sequentially within
  • BEND (maximum workers): Multiple shards accessed in parallel, improving inter-batch sample diversity but not intra-batch diversity
  • Pre-shuffling: Ensures good sample diversity regardless of worker count

Solution: Pre-shuffling Method

Core Concept

Shuffle data annotations before storing them into shards, ensuring samples from any part of the dataset can be stored in any shard.

Implementation Details

  1. Preprocessing Stage: Shuffle sequence annotations before embedding generation
  2. Storage Stage: Store shuffled data into shards
  3. Loading Stage: Normal WebDataset loading process, but since data is pre-shuffled, worker count no longer affects sample diversity

Advantages

  1. Hardware Independence: Eliminates dependency on worker count and buffer size
  2. Efficiency Preservation: Does not alter BEND's implementation details, maintaining original efficiency
  3. Performance Improvement: Maintains or improves performance across all tasks

Experimental Setup

Datasets

Seven tasks from the BEND benchmark framework:

  • Supervised Tasks: CpG methylation, histone modification, chromatin accessibility, gene discovery, enhancer annotation
  • Unsupervised Tasks: Noncoding variant effect prediction for expression and disease

Models

Three DNA language models with different architectures:

  1. HyenaDNA-tiny-1k: Model based on Hyena architecture
  2. DNABERT-2: BERT-based DNA language model
  3. ResNet-LM: Baseline model proposed by BEND

Evaluation Metrics

  • AUROC: For CpG methylation and histone modification tasks
  • MCC: For gene discovery task

Experimental Design

  1. Hyperparameter Impact Experiment: Compare performance impact of different worker counts and buffer sizes
  2. Cross-Architecture Validation: Verify pre-shuffling method effectiveness across three model architectures
  3. Data Characteristic Analysis: Analyze overlap of continuous sequences across different tasks

Experimental Results

Main Results

Hyperparameter Impact

Table 1: Test Results for HyenaDNA-tiny-1k under Different Hyperparameter Configurations

TaskMetricMax Workers1 Worker1000 BufferNo Buffer
CpG MethylationAUROC0.8780.868--
Histone ModificationAUROC0.7660.756--
Gene DiscoveryMCC--0.1150.076

Pre-shuffling Results: Achieved optimal or near-optimal performance across all configurations, eliminating hardware dependency.

Cross-Architecture Validation

Table 2: Comparison of Three Models on CpG Methylation Task (AUROC)

ModelBENDPre-shuffledImprovement
HyenaDNA-tiny-1k0.8680.900+3.2%
DNABERT-20.8930.910+1.7%
ResNet-LM0.8900.919+2.9%

Key Findings

Data Overlap Analysis

Table 3: Sequence Overlap Characteristics Across Tasks

TaskOverlapping Sequence %Median Overlapping Nucleotide %Weighted Overlap %
CpG Methylation51.88%87.70%45.50%
Histone Modification17.03%19.92%3.39%
Gene Discovery7.09%12.39%0.88%
Enhancer Annotation1.75%49.27%0.86%
Chromatin Accessibility28.29%20.31%5.75%

The CpG methylation task exhibits the highest sequence overlap, explaining why this task benefits most from pre-shuffling.

Model Ranking Changes

Pre-shuffling not only improves absolute performance but also changes relative model rankings:

  • Under BEND Configuration: DNABERT-2 ≈ ResNet-LM > HyenaDNA-tiny-1k
  • After Pre-shuffling: ResNet-LM > DNABERT-2 > HyenaDNA-tiny-1k

Benchmarking Frameworks

  • BEND: First comprehensive benchmark framework specifically designed for DNA language models
  • WebDataset: Large-scale deep learning framework for high-performance I/O

DNA Language Models

  • HyenaDNA: Long-range genomic sequence modeling at single nucleotide resolution
  • DNABERT-2: Efficient foundation model for multi-species genomes
  • ResNet-LM: Baseline model based on residual networks

Benchmark Design Best Practices

The paper contributes practical experience to the benchmarking field, particularly in specialized domains where standard ML practices may produce unexpected consequences.

Conclusions and Discussion

Main Conclusions

  1. Hardware Dependency Problem: Hyperparameters chosen based on computational resources (worker count and buffer size) inadvertently affect benchmark results
  2. Architecture Independence: Models with different backbone architectures all benefit from proper shuffling, with performance improvements up to 4%
  3. Ranking Impact: Improper shuffling affects not only absolute performance but also relative rankings between models
  4. Simple and Effective Solution: Pre-shuffling data is a simple fix to decouple benchmark performance from hardware-specific hyperparameters

Limitations

  1. Framework-Specific: Research primarily targets the BEND framework; other benchmark frameworks may have different issues
  2. Task Coverage: While multiple tasks are tested, they remain limited to the task set provided by BEND
  3. Model Scope: Only three model architectures are tested, potentially not covering all types of DNA language models

Future Directions

  1. Extension to Other Benchmarks: Apply discovered problems and solutions to other bioinformatics benchmarks
  2. Automated Detection: Develop tools to automatically detect potential biases in benchmark implementations
  3. Comprehensive Best Practice Guidelines: Establish more comprehensive guidance for benchmark design in specialized domains

In-Depth Evaluation

Strengths

  1. High Practical Value: Identifies important problems in actual benchmarking and provides immediately applicable solutions
  2. In-Depth Analysis: Clearly demonstrates problem origins through visualization and quantitative analysis
  3. Sufficient Validation: Verifies problem universality and solution effectiveness across multiple models and tasks
  4. Clear Writing: Well-structured paper with easily understandable problem description and solutions
  5. Open-Source Contribution: Provides publicly available code implementation

Weaknesses

  1. Accidental Problem Discovery: Paper lacks systematic methods to prevent or detect similar problems
  2. Insufficient Theoretical Analysis: Lacks theoretical explanation for why certain tasks are more affected than others
  3. Solution Limitations: While pre-shuffling is effective, it may not apply to all types of sequence data
  4. Computational Cost Analysis: Lacks detailed analysis of computational overhead of pre-shuffling method

Impact

  1. Domain Contribution: Provides important methodological improvements for DNA language model evaluation
  2. Practical Value: Directly improves BEND benchmark reliability, benefiting the entire research community
  3. Reproducibility: Provides detailed implementation and open-source code for easy reproduction and application
  4. Inspirational Value: Offers valuable experience for benchmark design in other specialized domains

Applicable Scenarios

  1. Genomics Research: All DNA language model studies using BEND benchmark
  2. Sequence Modeling: Other time series or sequence modeling tasks involving sequence overlap
  3. Benchmark Design: Benchmark framework design requiring large-scale dataset handling
  4. Distributed Training: Distributed machine learning systems requiring consideration of data loading and shuffling strategies

References

  1. Marin et al. (2024). BEND: Benchmarking DNA language models on biologically meaningful tasks.
  2. Aizman et al. (2020). High performance I/O for large scale deep learning.
  3. Nguyen et al. (2023). HyenaDNA: Long-range genomic sequence modeling at single nucleotide resolution.
  4. Zhou et al. (2023). DNABERT-2: Efficient foundation model and benchmark for multi-species genome.

Summary: This paper identifies and resolves an important practical problem in DNA language model benchmarking. While the problem itself is relatively simple, its impact is far-reaching. The paper's value lies in reminding the research community that seemingly minor implementation details can have significant effects on benchmark results, and it provides practical solutions. This is important for ensuring fairness and reliability of benchmarking.