Recent beat and downbeat tracking models (e.g., RNNs, TCNs, Transformers) output frame-level activations. We propose reframing this task as object detection, where beats and downbeats are modeled as temporal "objects." Adapting the FCOS detector from computer vision to 1D audio, we replace its original backbone with WaveBeat's temporal feature extractor and add a Feature Pyramid Network to capture multi-scale temporal patterns. The model predicts overlapping beat/downbeat intervals with confidence scores, followed by non-maximum suppression (NMS) to select final predictions. This NMS step serves a similar role to DBNs in traditional trackers, but is simpler and less heuristic. Evaluated on standard music datasets, our approach achieves competitive results, showing that object detection techniques can effectively model musical beats with minimal adaptation.
Recent beat and downbeat tracking models (such as RNNs, TCNs, Transformers) output frame-level activation values. This paper proposes redefining this task as an object detection problem, modeling beats and downbeats as temporal "objects." By adapting the FCOS detector from computer vision to 1D audio, replacing the original backbone network with WaveBeat's temporal feature extractor, and adding a Feature Pyramid Network to capture multi-scale temporal patterns, the model predicts overlapping beat/downbeat intervals and their confidence scores, then uses Non-Maximum Suppression (NMS) to select final predictions. This NMS step serves a similar role to the Dynamic Bayesian Network (DBN) in traditional trackers but is simpler and less heuristic. Evaluation on standard music datasets demonstrates competitive results, proving that object detection techniques can effectively model musical beats with minimal adaptation.
Beat tracking is an important research direction in Music Information Retrieval (MIR), involving computational prediction of beat and downbeat positions. Traditional methods have evolved from early onset detection to modern machine learning techniques, including RNNs, LSTMs, TCNs, and Transformers.
Post-processing Complexity: Most modern beat detection networks produce per-frame activation functions, requiring post-processing with Dynamic Bayesian Networks (DBNs) to generate final beat positions
DBN Deficiencies: DBNs are prone to failure during tempo changes and time signature changes, and are overly heuristic
Downbeat Detection Difficulty: Downbeat detection performance is generally inferior to beat detection
The authors argue that beat tracking can be viewed as a form of object detection in audio, thus attempting to use neural networks specifically designed for object detection to improve beat tracking, particularly downbeat tracking performance.
Paradigm Innovation: First redefines beat tracking as a 1D temporal object detection problem, modeling beats and downbeats as temporal interval objects
Architecture Adaptation: Successfully adapts the FCOS object detection model to the audio domain, replacing the original ResNet-50 backbone with WaveBeat
Post-processing Simplification: Replaces traditional DBN post-processing with NMS, providing a simpler and less heuristic solution
Performance Improvement: Achieves competitive results on standard music datasets, with particularly strong performance in downbeat detection
Converts 0D temporal point beat detection into interval detection in 1D audio. Input is raw audio waveform, output is beat/downbeat interval predictions with confidence scores.
Soft-NMS consistently improves performance, suggesting it helps preserve valid nearby beat predictions that might be incorrectly suppressed by standard NMS.
By analyzing histograms of predicted interval IoU distributions, an IoU threshold of 0.2 is selected in a data-driven manner, avoiding the grid search required by traditional DBN.
The paper cites 34 related references covering important works in beat tracking, object detection, deep learning and other domains, providing a solid theoretical foundation for the research.