SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision
Hao, Yuan, Yao et al.
Music structure analysis (MSA) underpins music understanding and controllable generation, yet progress has been limited by small, inconsistent corpora. We present SongFormer, a scalable framework that learns from heterogeneous supervision. SongFormer (i) fuses short- and long-window self-supervised audio representations to capture both fine-grained and long-range dependencies, and (ii) introduces a learned source embedding to enable training with partial, noisy, and schema-mismatched labels. To support scaling and fair evaluation, we release SongFormDB, the largest MSA corpus to date (over 10k tracks spanning languages and genres), and SongFormBench, a 300-song expert-verified benchmark. On SongFormBench, SongFormer sets a new state of the art in strict boundary detection (HR.5F) and achieves the highest functional label accuracy, while remaining computationally efficient; it surpasses strong baselines and Gemini 2.5 Pro on these metrics and remains competitive under relaxed tolerance (HR3F). Code, datasets, and model are publicly available.
academic
SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision
Music Structure Analysis (MSA) is fundamental to music understanding and controllable generation, yet progress has been limited by small-scale, inconsistent datasets. This paper proposes SongFormer, a scalable heterogeneous supervision learning framework. SongFormer (i) fuses short-window and long-window self-supervised audio representations to capture fine-grained and long-range dependencies, (ii) introduces learned source embeddings to support training with partial, noisy, and pattern-mismatched labels. To enable scaling and fair evaluation, the authors release SongFormDB, the largest MSA corpus to date (over 10,000 multilingual and multi-genre tracks), and SongFormBench, a 300-track expert-validated benchmark. On SongFormBench, SongFormer achieves state-of-the-art results in strict boundary detection (HR.5F) and highest functional label accuracy while maintaining computational efficiency, surpassing strong baselines and Gemini 2.5 Pro on these metrics while remaining competitive under loose tolerance (HR3F).
Music Structure Analysis (MSA) aims to segment songs into functionally meaningful sections (e.g., intro, verse, chorus) and detect their boundaries, serving as a core task for music understanding and controllable generation. With the rapid development of music generation systems, incorporating MSA as a structural prior has become increasingly important.
Data Scarcity: Public corpora are small-scale and heterogeneous, such as HarmonixSet with only 912 songs; annotation patterns and formats are inconsistent with restricted access
Method Limitations: Many systems train from scratch rather than leveraging powerful self-supervised/foundation audio models; they rely on complex preprocessing (beat tracking, source separation)
Temporal Resolution Issues: General-purpose multimodal LLMs (e.g., Gemini 2.5 Pro) can produce structural annotations but with insufficient temporal resolution for precise boundary detection
This work aims to address the data bottleneck and methodological limitations in MSA, proposing a simple, scalable framework that learns from heterogeneous supervision while maintaining temporal precision.
Local Representation: Audio is divided into consecutive 30s blocks to obtain fine-grained local features
Global Representation: Processes 420s long windows to capture overall global context
Feature Fusion: Temporally concatenates 14 consecutive 30s blocks aligned with 420s global representation; fuses MuQ and MusicFM representations in the feature dimension
Downsampling: Reduces temporal resolution from 25Hz to approximately 8.33Hz through residual downsampling modules
Compared to existing work, SongFormer is the first to systematically fuse multi-resolution SSL representations and introduce heterogeneous supervision strategies, while constructing the largest MSA dataset to date.
Music structure analysis survey (Nieto et al., 2020)
MuQ and MusicFM self-supervised models (Zhu et al., 2025; Won et al., 2024)
Related deep learning methods (Wang et al., 2022; Kim & Nam, 2023)
Overall Assessment: This is a high-quality paper with significant contributions to the music structure analysis field. The technical approach is innovative and practical, the experimental design is rigorous and comprehensive, and the dataset contribution is substantial, providing important momentum for field development. The open-source strategy also demonstrates commendable academic sharing principles.