2025-11-19T02:46:12.920144

Beat Detection as Object Detection

Ahn, Jung

Recent beat and downbeat tracking models (e.g., RNNs, TCNs, Transformers) output frame-level activations. We propose reframing this task as object detection, where beats and downbeats are modeled as temporal "objects." Adapting the FCOS detector from computer vision to 1D audio, we replace its original backbone with WaveBeat's temporal feature extractor and add a Feature Pyramid Network to capture multi-scale temporal patterns. The model predicts overlapping beat/downbeat intervals with confidence scores, followed by non-maximum suppression (NMS) to select final predictions. This NMS step serves a similar role to DBNs in traditional trackers, but is simpler and less heuristic. Evaluated on standard music datasets, our approach achieves competitive results, showing that object detection techniques can effectively model musical beats with minimal adaptation.

academic

Beat Tracking as Object Detection

Basic Information

Paper ID: 2510.14391
Title: Beat Tracking as Object Detection
Authors: Jaehoon Ahn (Sogang University), Moon-Ryul Jung (Sogang University)
Categories: cs.SD (Sound), cs.AI (Artificial Intelligence), cs.LG (Machine Learning)
Publication Date: October 16, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.14391v1

Abstract

Recent beat and downbeat tracking models (such as RNNs, TCNs, Transformers) output frame-level activation values. This paper proposes redefining this task as an object detection problem, modeling beats and downbeats as temporal "objects." By adapting the FCOS detector from computer vision to 1D audio, replacing the original backbone network with WaveBeat's temporal feature extractor, and adding a Feature Pyramid Network to capture multi-scale temporal patterns, the model predicts overlapping beat/downbeat intervals and their confidence scores, then uses Non-Maximum Suppression (NMS) to select final predictions. This NMS step serves a similar role to the Dynamic Bayesian Network (DBN) in traditional trackers but is simpler and less heuristic. Evaluation on standard music datasets demonstrates competitive results, proving that object detection techniques can effectively model musical beats with minimal adaptation.

Research Background and Motivation

Problem Definition

Beat tracking is an important research direction in Music Information Retrieval (MIR), involving computational prediction of beat and downbeat positions. Traditional methods have evolved from early onset detection to modern machine learning techniques, including RNNs, LSTMs, TCNs, and Transformers.

Limitations of Existing Methods

Post-processing Complexity: Most modern beat detection networks produce per-frame activation functions, requiring post-processing with Dynamic Bayesian Networks (DBNs) to generate final beat positions
DBN Deficiencies: DBNs are prone to failure during tempo changes and time signature changes, and are overly heuristic
Downbeat Detection Difficulty: Downbeat detection performance is generally inferior to beat detection

Research Motivation

The authors argue that beat tracking can be viewed as a form of object detection in audio, thus attempting to use neural networks specifically designed for object detection to improve beat tracking, particularly downbeat tracking performance.

Core Contributions

Paradigm Innovation: First redefines beat tracking as a 1D temporal object detection problem, modeling beats and downbeats as temporal interval objects
Architecture Adaptation: Successfully adapts the FCOS object detection model to the audio domain, replacing the original ResNet-50 backbone with WaveBeat
Post-processing Simplification: Replaces traditional DBN post-processing with NMS, providing a simpler and less heuristic solution
Performance Improvement: Achieves competitive results on standard music datasets, with particularly strong performance in downbeat detection

Methodology Details

Task Definition

Converts 0D temporal point beat detection into interval detection in 1D audio. Input is raw audio waveform, output is beat/downbeat interval predictions with confidence scores.

Model Architecture

Overall Design

The BeatFCOS model contains the following key components:

WaveBeat Backbone Network: Replaces the original FCOS ResNet-50, directly processing raw audio waveforms
Feature Pyramid Network (FPN): Captures multi-scale temporal patterns
Three-Head Detector: Used for classification, regression, and leftness prediction respectively

Beat Interval Representation

Beat Interval: Time segment between two consecutive beats
Downbeat Interval: Time segment between two consecutive downbeats
Duplicate Representation: Downbeats appear both as downbeat intervals and as regular beat intervals

WaveBeat and FPN Integration

Removes the final convolution and sigmoid layers from WaveBeat
Passes outputs from the last two TCN blocks (C7 and C8) to FPN layers P7 and P8
Due to memory constraints, only uses outputs from the last two backbone blocks rather than three as in the original FCOS

Technical Innovations

1. Anchor Strategy

Size Constraints: Each FPN layer is responsible for intervals of specific temporal scales
Sub-box Strategy: Uses left-biased sub-boxes rather than symmetric center regions, focusing on interval start positions

2. Leftness Mechanism

Replaces centerness in FCOS, defined as:

leftness1D(r) = √(rright / (rleft + rright))

Emphasizes the left edge of beat intervals rather than the center, better aligning with beat localization intuition.

3. Loss Function

Total loss comprises three components:

Lpoint(k,n) = Lcls(ck,n, ĉk,n, n) + 1{ck,n>0}Lreg(rk,n, r̂k,n, n) + 1{ck,n>0}Llft(rk,n, r̂k,n, n)

Classification Loss: focal loss
Regression Loss: 1D-adapted GIoU loss
Leftness Loss: binary cross-entropy loss

Experimental Setup

Datasets

Uses the same datasets as WaveBeat:

Training Set: Ballroom, Hainsworth, Beatles, RWC Popular
Test Set: GTZAN, SMC
Audio Format: 22.05kHz sampling rate, 2^21 sample length (approximately 1.6 minutes)

Evaluation Metrics

F1 Score: Harmonic mean of precision and recall
CMLt (Continuity-based Metric allowing for Metrical Level Tolerance): Continuity metric tolerating metrical level variations
AMLt (Accuracy-based Metric allowing for Metrical Level Tolerance): Accuracy metric tolerating metrical level variations

Comparison Methods

WaveBeat (Peak-picking)
WaveBeat (DBN)
Spectral TCN
Hung et al. (Transformer-based)

Implementation Details

Optimizer: Adam (lr=1e-3, weight decay=1e-4)
Learning Rate Schedule: Reduced by 10x after 3 epochs without improvement
Batch Size: 16
Training Environment: Google Colab, NVIDIA A100 40GB GPU
Training Strategy: 8-fold cross-validation

Experimental Results

Main Results

Among all WaveBeat variants, BeatFCOS demonstrates strong performance across multiple datasets:

Beat Tracking Performance

Ballroom Dataset: F1=0.927, CMLt=0.873, AMLt=0.898
Beatles Dataset: F1=0.903, CMLt=0.797, AMLt=0.866
RWC Popular Dataset: F1=0.862, CMLt=0.763, AMLt=0.849

Downbeat Tracking Performance

Ballroom Dataset: F1=0.807, CMLt=0.697, AMLt=0.756
Beatles Dataset: F1=0.762, CMLt=0.579, AMLt=0.659
RWC Popular Dataset: F1=0.779, CMLt=0.691, AMLt=0.731

Ablation Studies

Leftness vs Centerness

The leftness mechanism significantly outperforms centerness on nearly all datasets and metrics, particularly in downbeat tracking.

Soft-NMS vs Standard NMS

Soft-NMS consistently improves performance, suggesting it helps preserve valid nearby beat predictions that might be incorrectly suppressed by standard NMS.

Backbone Fine-tuning Strategy

Freezing only BatchNorm layers while allowing convolutional weight updates significantly outperforms complete backbone freezing.

NMS Threshold Selection

By analyzing histograms of predicted interval IoU distributions, an IoU threshold of 0.2 is selected in a data-driven manner, avoiding the grid search required by traditional DBN.

Traditional Methods

Early beat tracking was based on onset detection, estimating beat position sequences by identifying note onsets.

Deep Learning Methods

RNNs/LSTMs: Provide temporal dependency support, representing significant breakthroughs over non-machine learning methods
TCNs: Use extensive dilated convolution layers to provide large temporal context
Transformers: Learn weight distributions for important aspects of sequence data

The object detection paradigm can be effectively applied to beat tracking tasks
NMS post-processing is simpler and less heuristic than traditional DBN
BeatFCOS demonstrates particularly strong performance in downbeat detection
Data-driven hyperparameter selection is more efficient than grid search

Limitations

Performance Constraints: While competitive, does not consistently surpass SOTA methods on all metrics
Memory Constraints: Can only use two FPN layers rather than three due to memory limitations
Data Dependency: Method effectiveness is significantly influenced by training data quality

Future Directions

Integrate temporal adjacency constraints to better enforce regular beat spacing
Explore EM-based temporal model learning as a complementary approach
Further optimize architecture to reduce memory requirements

In-Depth Evaluation

Strengths

Strong Innovation: First to introduce object detection paradigm to beat tracking with novel approach
Solid Technical Foundation: Leftness mechanism is well-designed and aligns with beat localization intuition
Comprehensive Experiments: Includes detailed ablation studies and 8-fold cross-validation
Practical Value: Simplifies post-processing pipeline and reduces parameter tuning complexity

Weaknesses

Limited Performance Gains: Improvements over existing SOTA methods are not sufficiently significant
Limited Applicability: Primarily validated on specific datasets; generalization capability requires further verification
Insufficient Theoretical Analysis: Lacks in-depth theoretical explanation for why object detection suits beat tracking

Impact

Methodological Contribution: Provides new modeling perspectives for the Music Information Retrieval field
Cross-domain Inspiration: Demonstrates potential of computer vision techniques in audio processing
Engineering Value: Simplified post-processing pipeline has practical application value

Applicable Scenarios

Music applications requiring real-time beat detection
Embedded systems sensitive to post-processing complexity
Music analysis tasks with high downbeat detection requirements

References

The paper cites 34 related references covering important works in beat tracking, object detection, deep learning and other domains, providing a solid theoretical foundation for the research.