2025-11-20T07:43:14.963491

SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision

Hao, Yuan, Yao et al.
Music structure analysis (MSA) underpins music understanding and controllable generation, yet progress has been limited by small, inconsistent corpora. We present SongFormer, a scalable framework that learns from heterogeneous supervision. SongFormer (i) fuses short- and long-window self-supervised audio representations to capture both fine-grained and long-range dependencies, and (ii) introduces a learned source embedding to enable training with partial, noisy, and schema-mismatched labels. To support scaling and fair evaluation, we release SongFormDB, the largest MSA corpus to date (over 10k tracks spanning languages and genres), and SongFormBench, a 300-song expert-verified benchmark. On SongFormBench, SongFormer sets a new state of the art in strict boundary detection (HR.5F) and achieves the highest functional label accuracy, while remaining computationally efficient; it surpasses strong baselines and Gemini 2.5 Pro on these metrics and remains competitive under relaxed tolerance (HR3F). Code, datasets, and model are publicly available.
academic

SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision

Basic Information

  • Paper ID: 2510.02797
  • Title: SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision
  • Authors: Chunbo Hao, Ruibin Yuan, Jixun Yao, Qixin Deng, Xinyi Bai, Wei Xue, Lei Xie
  • Category: eess.AS (Audio and Speech Processing)
  • Publication Date: October 11, 2025 (arXiv v2)
  • Paper Link: https://arxiv.org/abs/2510.02797

Abstract

Music Structure Analysis (MSA) is fundamental to music understanding and controllable generation, yet progress has been limited by small-scale, inconsistent datasets. This paper proposes SongFormer, a scalable heterogeneous supervision learning framework. SongFormer (i) fuses short-window and long-window self-supervised audio representations to capture fine-grained and long-range dependencies, (ii) introduces learned source embeddings to support training with partial, noisy, and pattern-mismatched labels. To enable scaling and fair evaluation, the authors release SongFormDB, the largest MSA corpus to date (over 10,000 multilingual and multi-genre tracks), and SongFormBench, a 300-track expert-validated benchmark. On SongFormBench, SongFormer achieves state-of-the-art results in strict boundary detection (HR.5F) and highest functional label accuracy while maintaining computational efficiency, surpassing strong baselines and Gemini 2.5 Pro on these metrics while remaining competitive under loose tolerance (HR3F).

Research Background and Motivation

Problem Definition

Music Structure Analysis (MSA) aims to segment songs into functionally meaningful sections (e.g., intro, verse, chorus) and detect their boundaries, serving as a core task for music understanding and controllable generation. With the rapid development of music generation systems, incorporating MSA as a structural prior has become increasingly important.

Existing Challenges

  1. Data Scarcity: Public corpora are small-scale and heterogeneous, such as HarmonixSet with only 912 songs; annotation patterns and formats are inconsistent with restricted access
  2. Method Limitations: Many systems train from scratch rather than leveraging powerful self-supervised/foundation audio models; they rely on complex preprocessing (beat tracking, source separation)
  3. Temporal Resolution Issues: General-purpose multimodal LLMs (e.g., Gemini 2.5 Pro) can produce structural annotations but with insufficient temporal resolution for precise boundary detection

Research Motivation

This work aims to address the data bottleneck and methodological limitations in MSA, proposing a simple, scalable framework that learns from heterogeneous supervision while maintaining temporal precision.

Core Contributions

  1. Proposes SongFormer Framework: Fuses multi-resolution self-supervised representations (30s and 420s windows) to capture fine-grained and long-range dependencies
  2. Heterogeneous Supervision Strategy: Introduces learned data source embeddings to support training with partial, noisy, and pattern-mismatched labels
  3. Constructs Large-Scale Datasets: Releases SongFormDB (over 10,000 tracks) and SongFormBench (300 expert-validated benchmark tracks)
  4. Achieves SOTA Performance: Sets new records in strict boundary detection and functional label accuracy, surpassing strong baselines and Gemini 2.5 Pro

Methodology Details

Task Definition

MSA is modeled as a temporal annotation task, with audio waveform as input and structured annotation sequence as output:

{(t₀, l₀), (t₁, l₁), ..., (tₙ₋₁, lₙ₋₁), (tₙ, end)}

where tᵢ and lᵢ represent the start time and label of each segment, respectively.

Model Architecture

1. Multi-Resolution SSL Representation Fusion

  • Local Representation: Audio is divided into consecutive 30s blocks to obtain fine-grained local features
  • Global Representation: Processes 420s long windows to capture overall global context
  • Feature Fusion: Temporally concatenates 14 consecutive 30s blocks aligned with 420s global representation; fuses MuQ and MusicFM representations in the feature dimension
  • Downsampling: Reduces temporal resolution from 25Hz to approximately 8.33Hz through residual downsampling modules

2. Heterogeneous Supervision Strategy

  • Data Source Embeddings: Adds learned data source embeddings to downsampled feature sequences, indicating training sample origins
  • Conditional Learning: Model learns source-specific annotation patterns and noise characteristics
  • Fixed Inference: During inference, data source embeddings are fixed to high-quality HarmonixSet

3. Transformer Encoder

  • 4-layer Transformer encoder with RoPE positional encoding to capture temporal dependencies
  • Hidden dimension of 512 with two task-specific heads: boundary detection and functional label prediction

Training Objectives

The total loss function is:

L = λ(L_BCE + λ_TV L_TV) + (1-λ)(L_CE + λ_Focal L_Focal)

where:

  • Boundary Detection: Binary cross-entropy loss + boundary-aware 1D total variation loss (prevents over-smoothing at true boundaries)
  • Functional Prediction: Frame-level cross-entropy loss + softmax focal loss (focuses on uncertain frames)
  • Hyperparameters: λ=0.2, λ_TV=0.05, λ_Focal=0.2

Experimental Setup

Datasets

SongFormDB (Training Set, >10k tracks)

  1. SongForm-HX: 512 training and 200 validation tracks reconstructed from HarmonixSet with refined annotations
  2. SongForm-Private: 4,314 tracks with structure labels derived from lyrics, timestamps corrected using SOFA aligner
  3. SongForm-Hook: 5,933 tracks with precise structural annotations for partial segments
  4. SongForm-Gem: 4,387 tracks across 47 languages with annotations generated using Gemini 2.5 Pro API

SongFormBench (Test Set, 300 tracks)

  • SongFormBench-HarmonixSet: 200 expert-revised HarmonixSet songs
  • SongFormBench-CN: 100 Chinese songs, addressing data scarcity for MSA in Chinese

Evaluation Metrics

  1. HR.5F: F-score of boundary hit rate within 0.5 seconds (strict boundary detection)
  2. HR3F: F-score of boundary hit rate within 3 seconds (loose boundary detection)
  3. ACC: Frame-level functional label accuracy

Implementation Details

  • Maximum input duration 420s, sampling rate 8.33Hz
  • Boundaries smoothed with Gaussian kernel (10-frame window, approximately 2.4s)
  • Batch size 8, cosine learning rate schedule (peak 1×10⁻⁴)
  • Single NVIDIA L40 GPU, averaged over three random seeds

Experimental Results

Main Results

SongFormBench-HarmonixSet

MethodACCHR.5FHR3F
All-In-One0.7400.5960.730
LinkSeg-7Labels0.7800.6300.762
TA (Zhang et al.)0.7870.6100.801
Gemini 2.5 Pro0.7480.4230.813
SongFormer (HX)0.7950.7030.784
SongFormer (HX+P+H+G)0.8070.6960.780

SongFormBench-CN

MethodACCHR.5FHR3F
All-In-One0.8340.5630.771
Gemini 2.5 Pro0.8060.4120.833
SongFormer (HX+P+H)0.8900.6900.852
SongFormer (HX+P+H+G)0.8910.6880.851

Ablation Studies

  1. Multi-Resolution Representations: Combining 30s and 420s windows outperforms single-window approaches
  2. Data Source Embeddings: Removal decreases ACC from 0.848 to 0.825
  3. Transformer vs. Linear Layers: Transformer backend significantly outperforms simple linear layers
  4. Downsampling Strategy: Moderate downsampling achieves optimal balance between efficiency and accuracy

Key Findings

  1. Strongest Label Accuracy: SongFormer achieves highest ACC on both benchmarks
  2. More Precise Boundary Detection: Provides sharper and more reliable boundary predictions under strict evaluation
  3. Data Scaling Benefits: Increased training data improves robustness, though annotation inaccuracies slightly impact boundary precision
  4. Superior to LLMs: Significantly outperforms Gemini 2.5 Pro on precision metrics

MSA Method Evolution

  1. Traditional Methods: Rule-based and machine learning approaches using audio features
  2. Deep Learning: CNNs and RNNs for boundary detection and functional annotation
  3. Self-Supervised Learning: Leverages pretrained audio models, though most still train from scratch

Dataset Development

  • HarmonixSet: 912 Western popular music tracks with high annotation quality but limited scale
  • Other Datasets: Smaller scale, inconsistent annotations, restricted access

Novelty of This Work

Compared to existing work, SongFormer is the first to systematically fuse multi-resolution SSL representations and introduce heterogeneous supervision strategies, while constructing the largest MSA dataset to date.

Conclusions and Discussion

Main Conclusions

  1. SongFormer achieves SOTA performance through multi-resolution SSL fusion and heterogeneous supervision
  2. Large-scale dataset SongFormDB and high-quality benchmark SongFormBench advance the field
  3. The method significantly outperforms existing approaches in strict boundary detection and functional label accuracy

Limitations

  1. Annotation Quality Trade-offs: Introducing additional datasets improves overall performance but annotation inaccuracies impact boundary precision
  2. Computational Complexity: Multi-resolution fusion increases computational overhead in feature extraction
  3. Language Coverage: While including Chinese data, coverage of other non-English languages remains limited

Future Directions

  1. Integrate MSA into controllable music generation and music information retrieval systems
  2. Explore structure analysis for more languages and music genres
  3. Investigate joint optimization of end-to-end music generation and structure analysis

In-Depth Evaluation

Strengths

  1. Strong Technical Innovation: Multi-resolution SSL fusion elegantly balances short and long-range context
  2. Practical Heterogeneous Supervision Strategy: Data source embeddings effectively handle inconsistent annotation quality
  3. Significant Data Contribution: SongFormDB and SongFormBench fill critical gaps in the field
  4. Comprehensive Experiments: Detailed ablation studies validate the effectiveness of each component
  5. Open-Source Friendly: Code, data, and models are publicly available for reproducibility

Weaknesses

  1. Method Complexity: Fusion of multiple SSL models increases system complexity
  2. Evaluation Limitations: Primarily evaluated on popular music; coverage of other genres like classical music is insufficient
  3. Real-Time Processing Analysis: Lacks discussion of real-time processing capabilities and applicability to practical deployment

Impact

  1. Academic Value: Provides new technical paradigm and large-scale data resources for MSA research
  2. Practical Value: Directly applicable to music recommendation, generation, and editing systems
  3. Reproducibility: Complete open-source release ensures reproducibility and future development

Applicable Scenarios

  1. Intelligent recommendation and playlist generation for music streaming platforms
  2. Automatic structure analysis and editing in music production software
  3. Auxiliary teaching tools for music structure theory in music education
  4. Structural constraints for controllable music generation systems

References

Key references include:

  • HarmonixSet dataset (Nieto et al., 2019)
  • Music structure analysis survey (Nieto et al., 2020)
  • MuQ and MusicFM self-supervised models (Zhu et al., 2025; Won et al., 2024)
  • Related deep learning methods (Wang et al., 2022; Kim & Nam, 2023)

Overall Assessment: This is a high-quality paper with significant contributions to the music structure analysis field. The technical approach is innovative and practical, the experimental design is rigorous and comprehensive, and the dataset contribution is substantial, providing important momentum for field development. The open-source strategy also demonstrates commendable academic sharing principles.