2025-11-19T02:46:12.920144

Beat Detection as Object Detection

Ahn, Jung
Recent beat and downbeat tracking models (e.g., RNNs, TCNs, Transformers) output frame-level activations. We propose reframing this task as object detection, where beats and downbeats are modeled as temporal "objects." Adapting the FCOS detector from computer vision to 1D audio, we replace its original backbone with WaveBeat's temporal feature extractor and add a Feature Pyramid Network to capture multi-scale temporal patterns. The model predicts overlapping beat/downbeat intervals with confidence scores, followed by non-maximum suppression (NMS) to select final predictions. This NMS step serves a similar role to DBNs in traditional trackers, but is simpler and less heuristic. Evaluated on standard music datasets, our approach achieves competitive results, showing that object detection techniques can effectively model musical beats with minimal adaptation.
academic

Beat Tracking as Object Detection

Basic Information

  • Paper ID: 2510.14391
  • Title: Beat Tracking as Object Detection
  • Authors: Jaehoon Ahn (Sogang University), Moon-Ryul Jung (Sogang University)
  • Categories: cs.SD (Sound), cs.AI (Artificial Intelligence), cs.LG (Machine Learning)
  • Publication Date: October 16, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.14391v1

Abstract

Recent beat and downbeat tracking models (such as RNNs, TCNs, Transformers) output frame-level activation values. This paper proposes redefining this task as an object detection problem, modeling beats and downbeats as temporal "objects." By adapting the FCOS detector from computer vision to 1D audio, replacing the original backbone network with WaveBeat's temporal feature extractor, and adding a Feature Pyramid Network to capture multi-scale temporal patterns, the model predicts overlapping beat/downbeat intervals and their confidence scores, then uses Non-Maximum Suppression (NMS) to select final predictions. This NMS step serves a similar role to the Dynamic Bayesian Network (DBN) in traditional trackers but is simpler and less heuristic. Evaluation on standard music datasets demonstrates competitive results, proving that object detection techniques can effectively model musical beats with minimal adaptation.

Research Background and Motivation

Problem Definition

Beat tracking is an important research direction in Music Information Retrieval (MIR), involving computational prediction of beat and downbeat positions. Traditional methods have evolved from early onset detection to modern machine learning techniques, including RNNs, LSTMs, TCNs, and Transformers.

Limitations of Existing Methods

  1. Post-processing Complexity: Most modern beat detection networks produce per-frame activation functions, requiring post-processing with Dynamic Bayesian Networks (DBNs) to generate final beat positions
  2. DBN Deficiencies: DBNs are prone to failure during tempo changes and time signature changes, and are overly heuristic
  3. Downbeat Detection Difficulty: Downbeat detection performance is generally inferior to beat detection

Research Motivation

The authors argue that beat tracking can be viewed as a form of object detection in audio, thus attempting to use neural networks specifically designed for object detection to improve beat tracking, particularly downbeat tracking performance.

Core Contributions

  1. Paradigm Innovation: First redefines beat tracking as a 1D temporal object detection problem, modeling beats and downbeats as temporal interval objects
  2. Architecture Adaptation: Successfully adapts the FCOS object detection model to the audio domain, replacing the original ResNet-50 backbone with WaveBeat
  3. Post-processing Simplification: Replaces traditional DBN post-processing with NMS, providing a simpler and less heuristic solution
  4. Performance Improvement: Achieves competitive results on standard music datasets, with particularly strong performance in downbeat detection

Methodology Details

Task Definition

Converts 0D temporal point beat detection into interval detection in 1D audio. Input is raw audio waveform, output is beat/downbeat interval predictions with confidence scores.

Model Architecture

Overall Design

The BeatFCOS model contains the following key components:

  1. WaveBeat Backbone Network: Replaces the original FCOS ResNet-50, directly processing raw audio waveforms
  2. Feature Pyramid Network (FPN): Captures multi-scale temporal patterns
  3. Three-Head Detector: Used for classification, regression, and leftness prediction respectively

Beat Interval Representation

  • Beat Interval: Time segment between two consecutive beats
  • Downbeat Interval: Time segment between two consecutive downbeats
  • Duplicate Representation: Downbeats appear both as downbeat intervals and as regular beat intervals

WaveBeat and FPN Integration

  • Removes the final convolution and sigmoid layers from WaveBeat
  • Passes outputs from the last two TCN blocks (C7 and C8) to FPN layers P7 and P8
  • Due to memory constraints, only uses outputs from the last two backbone blocks rather than three as in the original FCOS

Technical Innovations

1. Anchor Strategy

  • Size Constraints: Each FPN layer is responsible for intervals of specific temporal scales
  • Sub-box Strategy: Uses left-biased sub-boxes rather than symmetric center regions, focusing on interval start positions

2. Leftness Mechanism

Replaces centerness in FCOS, defined as:

leftness1D(r) = √(rright / (rleft + rright))

Emphasizes the left edge of beat intervals rather than the center, better aligning with beat localization intuition.

3. Loss Function

Total loss comprises three components:

Lpoint(k,n) = Lcls(ck,n, ĉk,n, n) + 1{ck,n>0}Lreg(rk,n, r̂k,n, n) + 1{ck,n>0}Llft(rk,n, r̂k,n, n)
  • Classification Loss: focal loss
  • Regression Loss: 1D-adapted GIoU loss
  • Leftness Loss: binary cross-entropy loss

Experimental Setup

Datasets

Uses the same datasets as WaveBeat:

  • Training Set: Ballroom, Hainsworth, Beatles, RWC Popular
  • Test Set: GTZAN, SMC
  • Audio Format: 22.05kHz sampling rate, 2^21 sample length (approximately 1.6 minutes)

Evaluation Metrics

  • F1 Score: Harmonic mean of precision and recall
  • CMLt (Continuity-based Metric allowing for Metrical Level Tolerance): Continuity metric tolerating metrical level variations
  • AMLt (Accuracy-based Metric allowing for Metrical Level Tolerance): Accuracy metric tolerating metrical level variations

Comparison Methods

  • WaveBeat (Peak-picking)
  • WaveBeat (DBN)
  • Spectral TCN
  • Hung et al. (Transformer-based)

Implementation Details

  • Optimizer: Adam (lr=1e-3, weight decay=1e-4)
  • Learning Rate Schedule: Reduced by 10x after 3 epochs without improvement
  • Batch Size: 16
  • Training Environment: Google Colab, NVIDIA A100 40GB GPU
  • Training Strategy: 8-fold cross-validation

Experimental Results

Main Results

Among all WaveBeat variants, BeatFCOS demonstrates strong performance across multiple datasets:

Beat Tracking Performance

  • Ballroom Dataset: F1=0.927, CMLt=0.873, AMLt=0.898
  • Beatles Dataset: F1=0.903, CMLt=0.797, AMLt=0.866
  • RWC Popular Dataset: F1=0.862, CMLt=0.763, AMLt=0.849

Downbeat Tracking Performance

  • Ballroom Dataset: F1=0.807, CMLt=0.697, AMLt=0.756
  • Beatles Dataset: F1=0.762, CMLt=0.579, AMLt=0.659
  • RWC Popular Dataset: F1=0.779, CMLt=0.691, AMLt=0.731

Ablation Studies

Leftness vs Centerness

The leftness mechanism significantly outperforms centerness on nearly all datasets and metrics, particularly in downbeat tracking.

Soft-NMS vs Standard NMS

Soft-NMS consistently improves performance, suggesting it helps preserve valid nearby beat predictions that might be incorrectly suppressed by standard NMS.

Backbone Fine-tuning Strategy

Freezing only BatchNorm layers while allowing convolutional weight updates significantly outperforms complete backbone freezing.

NMS Threshold Selection

By analyzing histograms of predicted interval IoU distributions, an IoU threshold of 0.2 is selected in a data-driven manner, avoiding the grid search required by traditional DBN.

Traditional Methods

Early beat tracking was based on onset detection, estimating beat position sequences by identifying note onsets.

Deep Learning Methods

  • RNNs/LSTMs: Provide temporal dependency support, representing significant breakthroughs over non-machine learning methods
  • TCNs: Use extensive dilated convolution layers to provide large temporal context
  • Transformers: Learn weight distributions for important aspects of sequence data

Post-processing Techniques

Traditional methods commonly use DBNs for post-processing, but face issues such as complex parameter tuning and computational expense.

Conclusions and Discussion

Main Conclusions

  1. The object detection paradigm can be effectively applied to beat tracking tasks
  2. NMS post-processing is simpler and less heuristic than traditional DBN
  3. BeatFCOS demonstrates particularly strong performance in downbeat detection
  4. Data-driven hyperparameter selection is more efficient than grid search

Limitations

  1. Performance Constraints: While competitive, does not consistently surpass SOTA methods on all metrics
  2. Memory Constraints: Can only use two FPN layers rather than three due to memory limitations
  3. Data Dependency: Method effectiveness is significantly influenced by training data quality

Future Directions

  1. Integrate temporal adjacency constraints to better enforce regular beat spacing
  2. Explore EM-based temporal model learning as a complementary approach
  3. Further optimize architecture to reduce memory requirements

In-Depth Evaluation

Strengths

  1. Strong Innovation: First to introduce object detection paradigm to beat tracking with novel approach
  2. Solid Technical Foundation: Leftness mechanism is well-designed and aligns with beat localization intuition
  3. Comprehensive Experiments: Includes detailed ablation studies and 8-fold cross-validation
  4. Practical Value: Simplifies post-processing pipeline and reduces parameter tuning complexity

Weaknesses

  1. Limited Performance Gains: Improvements over existing SOTA methods are not sufficiently significant
  2. Limited Applicability: Primarily validated on specific datasets; generalization capability requires further verification
  3. Insufficient Theoretical Analysis: Lacks in-depth theoretical explanation for why object detection suits beat tracking

Impact

  1. Methodological Contribution: Provides new modeling perspectives for the Music Information Retrieval field
  2. Cross-domain Inspiration: Demonstrates potential of computer vision techniques in audio processing
  3. Engineering Value: Simplified post-processing pipeline has practical application value

Applicable Scenarios

  1. Music applications requiring real-time beat detection
  2. Embedded systems sensitive to post-processing complexity
  3. Music analysis tasks with high downbeat detection requirements

References

The paper cites 34 related references covering important works in beat tracking, object detection, deep learning and other domains, providing a solid theoretical foundation for the research.