2025-11-24T19:49:17.532907

Performance of heavy-flavour jet identification in Lorentz-boosted topologies in proton-proton collisions at $\sqrt{s}$ = 13 TeV

CMS Collaboration
Measurements in the highly Lorentz-boosted regime provoke increased interest in probing the Higgs boson properties and in searching for particles beyond the standard model at the LHC. In the CMS Collaboration, various boosted-object tagging algorithms, designed to identify hadronic jets originating from a massive particle decaying to $\mathrm{b\overline{b}}$ or $\mathrm{c\overline{c}}$, have been developed and deployed across a range of physics analyses. This paper highlights their performance on simulated events, and summarizes novel calibration techniques using proton-proton collision data collected at $\sqrt{s}$ = 13 TeV during the 2016$-$2018 LHC data-taking period. Three dedicated methods are used for the calibration in multijet events, leveraging either machine learning techniques, the presence of muons within energetic boosted jets, or the reconstruction of hadronically decaying high-energy Z bosons. The calibration results, obtained through a combination of these approaches, are presented and discussed.
academic

Performance of heavy-flavour jet identification in Lorentz-boosted topologies in proton-proton collisions at s\sqrt{s} = 13 TeV

Basic Information

  • Paper ID: 2510.10228
  • Title: Performance of heavy-flavour jet identification in Lorentz-boosted topologies in proton-proton collisions at s\sqrt{s} = 13 TeV
  • Authors: CMS Collaboration
  • Classification: physics.ins-det hep-ex
  • Publication Date: October 14, 2025
  • Journal: Journal of Instrumentation (under submission)
  • Paper Link: https://arxiv.org/abs/2510.10228

Abstract

This paper investigates the performance of heavy-flavour hadron jet identification in highly Lorentz-boosted topologies, which is crucial for probing Higgs boson properties and searching for beyond-Standard-Model particles at the LHC. The CMS Collaboration has developed multiple advanced tagging algorithms to identify hadronic jets from the decay of massive particles into bb\mathrm{b\overline{b}} or cc\mathrm{c\overline{c}} pairs. The paper primarily demonstrates the performance of these algorithms on simulated events and summarizes novel calibration techniques using proton-proton collision data collected at s\sqrt{s} = 13 TeV during the 2016-2018 LHC running period.

Research Background and Motivation

Physics Background

  1. Boosted topologies in high-energy physics: At the TeV scale, decay products of heavy particles (such as Higgs bosons and beyond-Standard-Model particles) carry high momentum, causing their decay products to be collimated within a single large-radius jet
  2. Importance of heavy-flavour jet tagging: Accurate identification of bb\mathrm{b\overline{b}} and cc\mathrm{c\overline{c}} jets is critical for Higgs physics studies and new physics searches
  3. Calibration requirements: Discrepancies exist between jet tagging efficiencies in simulated events and actual data, necessitating precise data-driven calibration methods

Research Motivation

  1. Precision Standard Model measurements: Precise measurement of Higgs boson decay to heavy-flavour quarks
  2. New physics searches: Discovery of new resonant states decaying to heavy-flavour quark pairs
  3. Detector performance optimization: Enhancement of the CMS detector's physics object reconstruction performance in boosted topologies

Core Contributions

  1. Comprehensive performance assessment: First comprehensive comparison of seven heavy-flavour jet tagging algorithms developed by CMS during Run 2
  2. Innovative calibration methods: Development of three independent data-driven calibration approaches:
    • sfBDT method (machine learning-based gluon-splitting jet selection)
    • μ-tagging method (utilizing soft muons within jets)
    • Boosted Z boson method (employing Z→bb decay)
  3. Precise scale factor measurements: Combination of multiple measurements via the BLUE method, providing high-precision efficiency correction factors
  4. Comprehensive systematic uncertainty assessment: Thorough evaluation of various systematic uncertainty sources and their impacts

Methodology Details

Task Definition

Input: Physical characteristics of large-radius jets (AK8 jets, R=0.8) Output: Jet origin classification probabilities (X→bb, X→cc, QCD, etc.) Objective: Maximize signal efficiency while suppressing QCD multijet background while maintaining mass decorrelation

Tagging Algorithm Architecture

1. ParticleNet-MD

  • Architecture: Graph neural network-based particle-level feature processing
  • Input: Kinematic and geometric features of particle flow candidates and secondary vertices
  • Innovation: Permutation-invariant convolutional operations with local feature extraction in η-φ space
  • Output: Mass-decorrelated probability scores

2. DeepDoubleX

  • Architecture: Combination of 1D convolutional layers and gated recurrent units
  • Feature engineering: Feature selection using layer-wise relevance propagation techniques
  • Mass decorrelation: Achieved through reweighting to match signal jet mass distribution to QCD background

3. DeepAK8-MD

  • Architecture: Multi-class classifier based on 1D residual convolutional layers
  • Adversarial training: Mass decorrelation implemented using a mass prediction network as a penalty term in the loss function

4. Double-b Tagger

  • Architecture: Boosted decision tree (BDT)-based approach
  • Features: High-level track and secondary vertex construction variables

Calibration Methods

1. sfBDT Method

Core concept: Use BDT to select gluon-splitting bb/cc jets similar to signal jets as proxies

Key innovations:
- Definition of hadron-level N-subjettiness variable τ^h_31 to distinguish signal from background
- Automated sfBDT selection threshold determination procedure
- 81 different selection combinations for systematic uncertainty assessment

2. μ-Tagging Method

Physical principle: Semi-leptonic decay modes of b(c) hadrons produce soft muons
Selection criteria:
- Presence of soft muons with pT > 5 GeV within the jet
- τ21 < 0.3 (selecting double-prong jet structure)
- Relative isolation Irel > 0.15

3. Boosted Z Boson Method

Signal extraction: Extract Z→bb signal peak from QCD multijet background
Fitting strategy:
- 2D fit (mPNet, pT)
- QCD background modeled with polynomial function
- Simultaneous fit of regions passing and failing tagger selection

Experimental Setup

Datasets

  • Experimental data: Proton-proton collision data collected by CMS in 2016-2018
    • 2016 pre-VFP: 19.5 fb⁻¹
    • 2016 post-VFP: 16.8 fb⁻¹
    • 2017: 41.5 fb⁻¹
    • 2018: 59.8 fb⁻¹
  • Simulated samples:
    • QCD multijet processes (MADGRAPH5 aMC@NLO)
    • V+jets processes (Z+jets, W+jets)
    • Higgs boson production (HJ-MINLO + PYTHIA)

Evaluation Metrics

  • Signal efficiency: Fraction of correctly tagged X→bb(cc) jets
  • Background rejection: Fraction of incorrectly tagged QCD jets
  • Scale factors (SF): Ratio of data to simulation efficiency SF = ε_data/ε_sim
  • ROC curves: Trade-off relationship between signal efficiency and background efficiency

Working Point Definition

Three working points defined for each tagging algorithm:

  • High purity (HP): 40%(bb)/15%(cc) signal efficiency
  • Medium purity (MP): 60%(bb)/30%(cc) signal efficiency
  • Low purity (LP): 80%(bb)/50%(cc) signal efficiency

Experimental Results

Algorithm Performance Comparison

AlgorithmX→bb PerformanceX→cc PerformanceMass Decorrelation
ParticleNet-MDOptimalOptimalExcellent
DeepDoubleXGoodGoodGood
DeepAK8-MDModerateModerateGood
Double-bPoor-Moderate

Scale Factor Measurement Results

ParticleNet-MD X→bb (2018 Data)

pT Range GeVHP WPMP WPLP WP
450-5000.95±0.080.98±0.061.02±0.05
500-6000.97±0.091.00±0.071.01±0.06
>6000.94±0.110.99±0.081.03±0.07

Inter-method Consistency

Results from the three calibration methods remain consistent within uncertainty ranges:

  • sfBDT method: Generally yields higher SF values
  • μ-tagging method: Intermediate SF values with larger uncertainties
  • Boosted Z boson method: Limited by statistics, largest uncertainties

Systematic Uncertainty Decomposition

Primary uncertainty sources (using ParticleNet-MD HP WP as example):

  1. Statistical uncertainty: ~6%
  2. sfBDT selection dependence: ~5%
  3. Reweighting scheme effects: ~9%
  4. Theoretical uncertainty (ISR/FSR): ~1-4%

Traditional Methods

  • BDT with high-level variables: Using manually engineered jet shape variables
  • Simple b-tagging: Based on secondary vertex and track information

Deep Learning Method Evolution

  1. DeepCSV/DeepJet: Deep learning tagging for AK4 jets
  2. CNN approaches: Image-based jet processing
  3. Graph neural networks: Direct particle-level information processing
  4. Transformer architectures: Attention mechanisms in jet tagging

Calibration Method Development

  • Early methods: Based on simple kinematic selections
  • Template fitting: Signal extraction using invariant mass spectra
  • ML-assisted approaches: Using ML methods to improve proxy jet selection

Conclusions and Discussion

Main Conclusions

  1. ParticleNet-MD shows optimal performance: Achieves best performance in both X→bb and X→cc tagging tasks
  2. Neural networks outperform traditional methods: Deep learning methods significantly surpass traditional BDT-based approaches
  3. Calibration methods are effective: Three independent methods provide consistent scale factor measurements
  4. Mass decorrelation successfully implemented: All modern algorithms successfully achieve decorrelation from jet mass

Limitations

  1. Statistical precision constraints: Particularly in high-pT regions and high-purity working points
  2. Systematic uncertainties: Primarily from model dependence in proxy jet selection
  3. Applicability scope: Calibration results mainly applicable to similar boosted topologies
  4. Computational complexity: Deep learning methods incur higher computational costs

Future Directions

  1. Run 3 data analysis: Improved measurement precision using larger statistical samples
  2. New architecture exploration: Novel neural network architectures such as Transformers
  3. End-to-end optimization: Full-chain optimization from detector signals to physics analysis
  4. Real-time applications: Implementation of advanced jet tagging in trigger systems

In-Depth Evaluation

Strengths

  1. Comprehensive scope: First comprehensive comparison of all major CMS heavy-flavour jet tagging algorithms
  2. Methodological innovation: Three independent calibration methods provide mutual verification, enhancing result reliability
  3. Advanced technology: Represents the state-of-the-art in jet tagging technology
  4. High practical value: Provides important calibration tools for CMS physics analyses
  5. Complete uncertainty assessment: Systematic evaluation of various uncertainty sources

Weaknesses

  1. Limited theoretical understanding: Lacks deep physical insight into why certain methods perform better
  2. Insufficient computational efficiency discussion: Limited discussion of computational cost trade-offs among algorithms
  3. Limited generalization assessment: Insufficient evaluation of algorithm generalization across different physics processes
  4. Statistical limitations: Some measurements constrained by statistical precision

Impact

  1. Academic influence: Establishes new standards for jet tagging technology in high-energy physics experiments
  2. Practical value: Directly serves Higgs physics and new physics searches
  3. Technology transfer: Methods generalizable to other experiments and object identification tasks
  4. Industrial application potential: Deep learning techniques applicable to other pattern recognition problems

Applicable Scenarios

  1. Higgs physics research: Precision measurements of H→bb, H→cc decay channels
  2. New physics searches: Discovery of new resonant states decaying to heavy-flavour quarks
  3. Precision measurements: Analyses requiring high-precision heavy-flavour jet identification
  4. Methodological research: Benchmark testing and comparison of jet tagging algorithms

Technical Innovation Highlights

sfBDT Method Innovation

  • Hadron-level τ^h_31 variable: First use of N-subjettiness based on first-generation hadrons to distinguish signal from background
  • Automated threshold selection: Development of algorithm for automatic optimal sfBDT selection determination
  • Multiple selection strategies: Quantification of selection-dependent systematic uncertainties through 81 selection combinations

Combined Measurement Technique

  • BLUE method extension: Extension of Best Linear Unbiased Estimate method to simultaneous fitting across multiple pT intervals
  • Correlation handling: Proper treatment of systematic uncertainty correlations among different methods
  • Cross-validation: Three independent methods provide strong mutual validation

References

The paper cites 72 important references, covering:

  • CMS detector technical literature
  • Historical development of jet tagging algorithms
  • Deep learning applications in high-energy physics
  • Statistical methods and uncertainty treatment
  • Related physics analysis results

Overall Assessment: This is a high-quality experimental physics paper representing the state-of-the-art in jet tagging technology in particle physics experiments. The paper not only provides important technical tools but also establishes a solid foundation for future algorithm development and physics analyses. Its methodological innovations and systematic performance evaluation hold significant value for the entire high-energy physics community.