2025-11-24T17:34:17.619375

Same model, better performance: the impact of shuffling on DNA Language Models benchmarking

Greco, Rawlik

Large Language Models are increasingly popular in genomics due to their potential to decode complex biological sequences. Hence, researchers require a standardized benchmark to evaluate DNA Language Models (DNA LMs) capabilities. However, evaluating DNA LMs is a complex task that intersects genomic's domain-specific challenges and machine learning methodologies, where seemingly minor implementation details can significantly compromise benchmark validity. We demonstrate this through BEND (Benchmarking DNA Language Models), where hardware-dependent hyperparameters -- number of data loading workers and buffer sizes -- create spurious performance variations of up to 4% for identical models. The problem stems from inadequate data shuffling interacting with domain specific data characteristics. Experiments with three DNA language models (HyenaDNA, DNABERT-2, ResNet-LM) show these artifacts affect both absolute performance and relative model rankings. We propose a simple solution: pre-shuffling data before storage eliminates hardware dependencies while maintaining efficiency. This work highlights how standard ML practices can interact unexpectedly with domain-specific data characteristics, with broader implications for benchmark design in specialized domains.

academic

Same model, better performance: the impact of shuffling on DNA Language Models benchmarking

Basic Information

Paper ID: 2510.12617
Title: Same model, better performance: the impact of shuffling on DNA Language Models benchmarking
Authors: Davide Greco, Konrad Rawlik (University of Edinburgh, Baillie Gifford Pandemic Science Hub)
Classification: q-bio.GN cs.LG
Publication Date: October 15, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.12617

Abstract

Large language models are increasingly popular in genomics due to their potential for decoding complex biological sequences. Consequently, researchers require standardized benchmarks to evaluate the capabilities of DNA language models (DNA LMs). However, evaluating DNA LMs is a complex task involving the intersection of domain-specific challenges in genomics and machine learning methodology, where seemingly minor implementation details can significantly compromise benchmark validity. The authors demonstrate this through BEND (Benchmarking DNA Language Models), where hardware-related hyperparameters—number of data loading workers and buffer size—create spurious performance variations of up to 4% for identical models. The issue stems from the interaction between insufficient data shuffling and domain-specific data characteristics. Experiments using three DNA language models (HyenaDNA, DNABERT-2, ResNet-LM) show that these artifacts affect both absolute performance and relative model rankings. The authors propose a simple solution: pre-shuffling data before storage eliminates hardware dependency while maintaining efficiency.

Research Background and Motivation

Core Problem

The core problem addressed by this research is implementation bias in DNA language model benchmarking. Specifically:

Hardware Dependency: Benchmark results are influenced by hardware-related hyperparameters (worker count, buffer size)
Insufficient Data Shuffling: Due to the special nature of genomic data (spatial dependencies, sequence overlaps), standard machine learning practices may produce unexpected biases
Evaluation Fairness: Researchers with different computational resources may obtain different benchmark results, compromising assessment fairness

Problem Significance

Foundation for Scientific Progress: Standardized benchmarks are fundamental to machine learning scientific progress, enabling researchers to compare methods and track improvements
Challenges in Emerging Fields: In emerging fields like genomics, domain-specific knowledge is scarce and benchmark design principles are still being established
Resource Equity: Ensuring benchmarks do not favor researchers with superior computational resources

Limitations of Existing Approaches

Although the BEND benchmark framework provides a comprehensive suite of supervised genomic tasks, it has the following issues:

Employs complex data loading mechanisms with a two-level shuffling strategy for handling large-scale datasets
Introduces dependencies on hardware-specific hyperparameters
When combined with inherent genomic data characteristics (significant overlap between continuous DNA sequence samples), results in insufficient data shuffling

Core Contributions

Discovery and Quantification of Systematic Bias in Benchmarking: Demonstrates that hardware-related hyperparameters can cause performance variations of up to 4% for identical models
Detailed Problem Analysis: Provides in-depth analysis of the interaction between data shuffling mechanisms in the WebDataset framework and genomic data characteristics
Simple and Effective Solution: Proposes a pre-shuffling method that eliminates hardware dependency while maintaining or improving performance across all tasks
Cross-Architecture Validation: Verifies the universality of the problem and effectiveness of the solution across three different DNA language model architectures
Best Practice Guidance for Benchmark Design: Provides concrete empirical insights and recommendations for benchmark design in specialized domains

Methodology Details

Problem Analysis

Data Processing Pipeline in BEND Framework

Embedding Generation: Extract DNA sequences from reference genomes and generate embeddings using language models
Downstream Model Training: Train downstream models using generated embeddings paired with labels
Evaluation: Downstream models process embeddings of test DNA sequences and compare with ground truth labels

WebDataset Storage and Loading Mechanism

BEND uses the WebDataset framework to store, load, and shuffle embeddings:

Shard Storage: Embeddings are stored in .tar files (shards)
Worker Assignment: Each shard is assigned to a single worker
Buffer Shuffling: Each worker has its own buffer, shuffling only samples from shards assigned to that worker

Data Access Pattern Analysis

The paper analyzes data access patterns under different configurations through visualization:

No Shuffling: Sequential data access
BEND (1 worker): Shards accessed sequentially, samples read sequentially within
BEND (maximum workers): Multiple shards accessed in parallel, improving inter-batch sample diversity but not intra-batch diversity
Pre-shuffling: Ensures good sample diversity regardless of worker count

Solution: Pre-shuffling Method

Core Concept

Shuffle data annotations before storing them into shards, ensuring samples from any part of the dataset can be stored in any shard.

Implementation Details

Preprocessing Stage: Shuffle sequence annotations before embedding generation
Storage Stage: Store shuffled data into shards
Loading Stage: Normal WebDataset loading process, but since data is pre-shuffled, worker count no longer affects sample diversity

Advantages

Hardware Independence: Eliminates dependency on worker count and buffer size
Efficiency Preservation: Does not alter BEND's implementation details, maintaining original efficiency
Performance Improvement: Maintains or improves performance across all tasks

Experimental Setup

Datasets

Seven tasks from the BEND benchmark framework:

Supervised Tasks: CpG methylation, histone modification, chromatin accessibility, gene discovery, enhancer annotation
Unsupervised Tasks: Noncoding variant effect prediction for expression and disease

Models

Three DNA language models with different architectures:

HyenaDNA-tiny-1k: Model based on Hyena architecture
DNABERT-2: BERT-based DNA language model
ResNet-LM: Baseline model proposed by BEND

Evaluation Metrics

AUROC: For CpG methylation and histone modification tasks
MCC: For gene discovery task

Experimental Design

Hyperparameter Impact Experiment: Compare performance impact of different worker counts and buffer sizes
Cross-Architecture Validation: Verify pre-shuffling method effectiveness across three model architectures
Data Characteristic Analysis: Analyze overlap of continuous sequences across different tasks

Experimental Results

Main Results

Hyperparameter Impact

Table 1: Test Results for HyenaDNA-tiny-1k under Different Hyperparameter Configurations

Task	Metric	Max Workers	1 Worker	1000 Buffer	No Buffer
CpG Methylation	AUROC	0.878	0.868	-	-
Histone Modification	AUROC	0.766	0.756	-	-
Gene Discovery	MCC	-	-	0.115	0.076

Pre-shuffling Results: Achieved optimal or near-optimal performance across all configurations, eliminating hardware dependency.

Cross-Architecture Validation

Table 2: Comparison of Three Models on CpG Methylation Task (AUROC)

Model	BEND	Pre-shuffled	Improvement
HyenaDNA-tiny-1k	0.868	0.900	+3.2%
DNABERT-2	0.893	0.910	+1.7%
ResNet-LM	0.890	0.919	+2.9%

Key Findings

Data Overlap Analysis

Table 3: Sequence Overlap Characteristics Across Tasks

Task	Overlapping Sequence %	Median Overlapping Nucleotide %	Weighted Overlap %
CpG Methylation	51.88%	87.70%	45.50%
Histone Modification	17.03%	19.92%	3.39%
Gene Discovery	7.09%	12.39%	0.88%
Enhancer Annotation	1.75%	49.27%	0.86%
Chromatin Accessibility	28.29%	20.31%	5.75%

The CpG methylation task exhibits the highest sequence overlap, explaining why this task benefits most from pre-shuffling.

Model Ranking Changes

Pre-shuffling not only improves absolute performance but also changes relative model rankings:

Under BEND Configuration: DNABERT-2 ≈ ResNet-LM > HyenaDNA-tiny-1k
After Pre-shuffling: ResNet-LM > DNABERT-2 > HyenaDNA-tiny-1k

Benchmarking Frameworks

BEND: First comprehensive benchmark framework specifically designed for DNA language models
WebDataset: Large-scale deep learning framework for high-performance I/O

DNA Language Models

HyenaDNA: Long-range genomic sequence modeling at single nucleotide resolution
DNABERT-2: Efficient foundation model for multi-species genomes
ResNet-LM: Baseline model based on residual networks

Benchmark Design Best Practices

The paper contributes practical experience to the benchmarking field, particularly in specialized domains where standard ML practices may produce unexpected consequences.

Conclusions and Discussion

Main Conclusions

Hardware Dependency Problem: Hyperparameters chosen based on computational resources (worker count and buffer size) inadvertently affect benchmark results
Architecture Independence: Models with different backbone architectures all benefit from proper shuffling, with performance improvements up to 4%
Ranking Impact: Improper shuffling affects not only absolute performance but also relative rankings between models
Simple and Effective Solution: Pre-shuffling data is a simple fix to decouple benchmark performance from hardware-specific hyperparameters

Limitations

Framework-Specific: Research primarily targets the BEND framework; other benchmark frameworks may have different issues
Task Coverage: While multiple tasks are tested, they remain limited to the task set provided by BEND
Model Scope: Only three model architectures are tested, potentially not covering all types of DNA language models

Future Directions

Extension to Other Benchmarks: Apply discovered problems and solutions to other bioinformatics benchmarks
Automated Detection: Develop tools to automatically detect potential biases in benchmark implementations
Comprehensive Best Practice Guidelines: Establish more comprehensive guidance for benchmark design in specialized domains

In-Depth Evaluation

Strengths

High Practical Value: Identifies important problems in actual benchmarking and provides immediately applicable solutions
In-Depth Analysis: Clearly demonstrates problem origins through visualization and quantitative analysis
Sufficient Validation: Verifies problem universality and solution effectiveness across multiple models and tasks
Clear Writing: Well-structured paper with easily understandable problem description and solutions
Open-Source Contribution: Provides publicly available code implementation

Weaknesses

Accidental Problem Discovery: Paper lacks systematic methods to prevent or detect similar problems
Insufficient Theoretical Analysis: Lacks theoretical explanation for why certain tasks are more affected than others
Solution Limitations: While pre-shuffling is effective, it may not apply to all types of sequence data
Computational Cost Analysis: Lacks detailed analysis of computational overhead of pre-shuffling method

Impact

Domain Contribution: Provides important methodological improvements for DNA language model evaluation
Practical Value: Directly improves BEND benchmark reliability, benefiting the entire research community
Reproducibility: Provides detailed implementation and open-source code for easy reproduction and application
Inspirational Value: Offers valuable experience for benchmark design in other specialized domains

Applicable Scenarios

Genomics Research: All DNA language model studies using BEND benchmark
Sequence Modeling: Other time series or sequence modeling tasks involving sequence overlap
Benchmark Design: Benchmark framework design requiring large-scale dataset handling
Distributed Training: Distributed machine learning systems requiring consideration of data loading and shuffling strategies

References

Marin et al. (2024). BEND: Benchmarking DNA language models on biologically meaningful tasks.
Aizman et al. (2020). High performance I/O for large scale deep learning.
Nguyen et al. (2023). HyenaDNA: Long-range genomic sequence modeling at single nucleotide resolution.
Zhou et al. (2023). DNABERT-2: Efficient foundation model and benchmark for multi-species genome.

Summary: This paper identifies and resolves an important practical problem in DNA language model benchmarking. While the problem itself is relatively simple, its impact is far-reaching. The paper's value lies in reminding the research community that seemingly minor implementation details can have significant effects on benchmark results, and it provides practical solutions. This is important for ensuring fairness and reliability of benchmarking.