2025-11-19T20:19:14.203751

Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning

Liu, Wang, Liu et al.
Few-shot named entity recognition can identify new types of named entities based on a few labeled examples. Previous methods employing token-level or span-level metric learning suffer from the computational burden and a large number of negative sample spans. In this paper, we propose the Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning (MsFNER), which splits the general NER into two stages: entity-span detection and entity classification. There are 3 processes for introducing MsFNER: training, finetuning, and inference. In the training process, we train and get the best entity-span detection model and the entity classification model separately on the source domain using meta-learning, where we create a contrastive learning module to enhance entity representations for entity classification. During finetuning, we finetune the both models on the support dataset of target domain. In the inference process, for the unlabeled data, we first detect the entity-spans, then the entity-spans are jointly determined by the entity classification model and the KNN. We conduct experiments on the open FewNERD dataset and the results demonstrate the advance of MsFNER.
academic

Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning

Basic Information

  • Paper ID: 2404.06970
  • Title: Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning
  • Authors: Congying Liu, Gaosheng Wang, Peipei Liu, Xingyuan Wei, Hongsong Zhu
  • Category: cs.CL
  • Publication Date: April 2024 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2404.06970

Abstract

Few-shot named entity recognition can identify new types of named entities based on a few labeled examples. Previous methods employing token-level or span-level metric learning suffer from the computational burden and a large number of negative sample spans. In this paper, we propose the Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning (MsFNER), which splits the general NER into two stages: entity-span detection and entity classification. There are 3 processes for introducing MsFNER: training, finetuning, and inference. In the training process, we train and get the best entity-span detection model and the entity classification model separately on the source domain using meta-learning, where we create a contrastive learning module to enhance entity representations for entity classification. During finetuning, we finetune both models on the support dataset of target domain. In the inference process, for the unlabeled data, we first detect the entity-spans, then the entity-spans are jointly determined by the entity classification model and the KNN. We conduct experiments on the open FewNERD dataset and the results demonstrate the advance of MsFNER.

Research Background and Motivation

Problem Definition

Few-shot Named Entity Recognition (Few-shot NER) aims to rapidly identify new types of named entities based on a limited number of annotated examples. This task is crucial for adapting to dynamically changing real-world application scenarios, particularly when models need to quickly adapt to new data or environmental changes.

Limitations of Existing Methods

  1. Token-level Methods: While methods based on token distances to prototypes or support set tokens are intuitive and straightforward, they suffer from high computational costs and inability to preserve semantic integrity of entity tokens, making them susceptible to interference from non-entity markers.
  2. Span-level Methods: Although evaluating entire spans alleviates some issues of token-level methods, enumerating all possible spans results in O(N²) complexity and introduces substantial noise from negative samples.

Research Motivation

The authors aim to address two core problems:

  1. How to improve few-shot NER recognition efficiency by enhancing semantic differences between entities and non-entities to determine effective entity spans
  2. How to improve entity span classification by controlling and coordinating semantic distances among different entity types, bringing same-type entities closer while pushing different-type entities farther apart

Core Contributions

  1. Proposed the MsFNER Framework: Decomposes traditional NER tasks into entity span detection and entity classification stages, effectively reducing computational complexity and mitigating negative sample effects
  2. Designed Entity-aware Contrastive Learning Module: Enhances entity representation learning, improving consistency of same-type entities while increasing distances between different-type entities
  3. Constructed Hybrid Inference Mechanism: Combines entity classification model and KNN method for joint prediction, improving classification accuracy
  4. Achieved State-of-the-Art Performance: Significantly outperforms existing methods on FewNERD and FewAPTER datasets, with comprehensive comparison against ChatGPT

Methodology Details

Task Definition

Few-shot NER is defined as: the model first trains on source domain dataset Dsource=(Ssource,Qsource)D_{source} = (S_{source}, Q_{source}), then transfers to target domain dataset Dtarget=(Starget,Qtarget)D_{target} = (S_{target}, Q_{target}) for inference. Here StargetS_{target} is the support set containing N entity types (N-way) with K annotated examples per type (K-shot); QtargetQ_{target} is the query set containing the same entity types as the support set.

Model Architecture

MsFNER comprises three main processes:

1. Training Process

Entity Span Detection (ESD) Module:

  • Formulates entity span detection as a sequence labeling task using BIOES tagging scheme
  • For input sentence x=(x1,x2,...,xn)x = (x_1, x_2, ..., x_n), obtains contextual representations h=(h1,h2,...,hn)h = (h_1, h_2, ..., h_n) using BERT encoder
  • Performs entity span detection through CRF layer with training loss:

LESD=logP(yx)L_{ESD} = -\sum \log P(y|x)

where: P(yx)=i=1xϕi(yi1,yi,x)yi=1xϕi(yi1,yi,x)P(y|x) = \frac{\prod_{i=1}^{|x|} \phi_i(y_{i-1}, y_i, x)}{\sum_{y'} \prod_{i=1}^{|x|} \phi_i(y'_{i-1}, y'_i, x)}

  • Employs MAML meta-learning with inner and outer loop updates

Entity Classification (EC) Module:

  • For entity ek=(xf,...,xf+l)e_k = (x_f, ..., x_{f+l}), obtains representation using max pooling: e^k=max(hf,...,hf+l)\hat{e}_k = \max(h_f, ..., h_{f+l})
  • Introduces entity-aware contrastive learning with loss function: LCL=j1P(j)pP(j)logexp(sim(zj,zp)/τ)aA(j)exp(sim(zj,za)/τ)L_{CL} = \sum_j -\frac{1}{|P(j)|} \sum_{p \in P(j)} \log \frac{\exp(\text{sim}(z_j, z_p)/\tau)}{\sum_{a \in A(j)} \exp(\text{sim}(z_j, z_a)/\tau)}
  • Constructs prototype representations and performs classification: ct(S)=1StemSte^mc_t(S) = \frac{1}{|S_t|} \sum_{e_m \in S_t} \hat{e}_m

psoft(ek)=exp(d(ct(S),e^k))i=1ϕexp(d(ci(S),e^k))p_{soft}(e_k) = \frac{\exp(-d(c_t(S), \hat{e}_k))}{\sum_{i=1}^{|\phi|} \exp(-d(c_i(S), \hat{e}_k))}

2. Finetuning Process

Finetunes the trained entity detection and classification models on target domain support set StargetS_{target}, following the same pattern as the training process.

3. Inference Process

Comprises four stages:

  1. Constructs key-value data store DknnD_{knn} with entity representations as keys and corresponding labels as values
  2. Obtains entity spans using the entity detection model
  3. Inputs detected entity representations to both classification model and KNN module
  4. Joint prediction: p(yek)=λpknn(yek)+(1λ)psoft(yek)p(y|e'_k) = \lambda p_{knn}(y|e'_k) + (1-\lambda) p_{soft}(y|e'_k)

Technical Innovations

  1. Two-stage Decomposition Strategy: Decomposes NER into span detection and classification subtasks, avoiding the complexity of enumerating all possible spans in traditional methods
  2. Entity-aware Contrastive Learning: Specially designed contrastive learning module enhances entity representations, improving aggregation of same-type entities and discrimination between different-type entities
  3. Hybrid Inference Mechanism: Combines parametric models with non-parametric KNN methods, fully leveraging support set information

Experimental Setup

Datasets

FewNERD Dataset:

  • Contains 8 coarse-grained and 66 fine-grained entity types
  • Evaluates both FewNERD-INTRA and FewNERD-INTER settings
  • Employs N-way K~2K-shot sampling for task construction

FewAPTER Dataset:

  • Constructed from network security threat intelligence dataset APTER
  • Consolidates original 37 entity types into 21 classes with 28,250 entities total
  • Splits training/validation/test sets in 7:7:7 ratio
  • Constructs four settings: 4-way 1-shot, 4-way 3-shot, 6-way 1-shot, 6-way 3-shot

Evaluation Metrics

Uses F1 score as primary evaluation metric with standard deviation reported.

Baseline Methods

  • ProtoBERT: Token-level method based on BERT hidden state similarity
  • CONTAINER: Token-level contrastive learning method
  • NNShot/StructShot: Nearest neighbor-based methods
  • ESD: Span-level matching method
  • MAML-ProtoNet: Meta-learning combining MAML and prototypical networks
  • BDCP: Boundary discrimination and correlation purification method
  • ChatGPT: Large language model baseline

Implementation Details

  • Encoder: BERT-base
  • Optimizer: AdamW with learning rate 3e-5
  • Batch size: 32, maximum sequence length: 128
  • K=10 in KNN, λ=0.1
  • Trains for 1000 steps, selects best model on validation set

Experimental Results

Main Results

FewNERD Dataset:

  • Average F1 improvement of 2.65% on FewNERD-INTRA
  • Average F1 improvement of 4.44% on FewNERD-INTER
  • Significant improvements over previous best method MAML-ProtoNet

FewAPTER Dataset:

  • Average F1 score improvement of 11.42%
  • Outperforms ChatGPT in most settings

Comparison with ChatGPT:

  • Overall outperforms ChatGPT on FewNERD
  • Slightly underperforms ChatGPT on FewAPTER but with significantly faster inference speed

Ablation Studies

  1. Removing Contrastive Learning Module:
    • Average decrease of 0.905% on FewNERD
    • Average decrease of 0.745% on FewAPTER
  2. Removing KNN Module:
    • Average decrease of 0.524% on FewNERD
    • Average decrease of 0.635% on FewAPTER

Results demonstrate positive contributions from both modules.

Efficiency Analysis

MsFNER's inference time is significantly faster than ChatGPT across various settings, demonstrating superior efficiency consistent with Occam's Razor principle.

Experimental Findings

  1. K-shot Impact: Increasing K-shot samples significantly improves performance
  2. N-way Impact: Increasing N-way decreases performance, as expected
  3. Domain Adaptability: Model performs well on cross-domain tasks
  4. LLM Stability: ChatGPT performance is relatively stable with minimal impact from data and domain variations

Main Directions in Few-shot NER

  1. Token-level Methods: Such as ProtoBERT and CONTAINER, making predictions based on token similarity
  2. Span-level Methods: Such as ESD, treating entities as whole spans
  3. Meta-learning Methods: Such as MAML-ProtoNet, employing meta-learning frameworks for rapid task adaptation

Advantages of This Work

Compared to existing work, MsFNER effectively addresses computational complexity and negative sample issues through two-stage decomposition while introducing contrastive learning to enhance representation learning.

Conclusions and Discussion

Main Conclusions

  1. Effectiveness: MsFNER achieves state-of-the-art performance on multiple datasets, validating the effectiveness of the two-stage decomposition strategy
  2. Efficiency: Significantly reduces computational complexity compared to traditional span-level methods
  3. Generalizability: Performs well across different domains and settings

Limitations

  1. Domain Adaptation Constraints: Generalization ability in certain specific domains (e.g., FewAPTER) still has room for improvement
  2. Hyperparameter Sensitivity: Hyperparameters such as λ require adjustment for different datasets
  3. Computational Resources: Still requires pre-trained BERT models as foundation

Future Directions

  1. Enhanced Domain Adaptation: Explore better cross-domain transfer methods
  2. End-to-end Optimization: Investigate joint optimization strategies for both stages
  3. Larger-scale Evaluation: Validate method effectiveness across more domains and languages

In-depth Evaluation

Strengths

  1. Strong Methodological Innovation: The two-stage decomposition strategy is novel and effectively addresses core issues of existing methods
  2. Reasonable Technical Design: Entity-aware contrastive learning and hybrid inference mechanism are ingeniously designed
  3. Comprehensive Experiments: Thorough evaluation on multiple datasets including comparison with LLMs
  4. In-depth Analysis: Provides detailed ablation studies and efficiency analysis

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks theoretical explanation for method effectiveness
  2. Missing Computational Complexity Analysis: While claiming reduced complexity, lacks quantitative analysis
  3. Absent Error Analysis: No in-depth analysis of model failure cases

Impact

  1. Academic Contribution: Provides new perspectives for solving few-shot NER
  2. Practical Value: Simple and effective method, easy to implement and deploy
  3. Reproducibility: Provides detailed implementation details and hyperparameter settings

Applicable Scenarios

  1. Resource-constrained Environments: More suitable than large language models for scenarios with limited computational resources
  2. Rapid Deployment Requirements: Can quickly adapt to new entity types
  3. Domain-specific Applications: Shows promising application prospects in vertical domains such as cybersecurity

References

The paper cites important works in related fields, including:

  • Few-shot learning foundational methods (Prototypical Networks, MAML)
  • Classical named entity recognition methods (BERT-based approaches)
  • Contrastive learning related work (Supervised Contrastive Learning)
  • Few-shot NER specialized methods (ProtoBERT, ESD, MAML-ProtoNet, etc.)

Overall Assessment: This is an excellent paper with solid technical contributions and comprehensive experiments. The authors' proposed two-stage decomposition strategy effectively addresses key issues of existing methods, achieving significant performance improvements on multiple datasets. The method design is reasonable with high practical value, providing valuable contributions to the few-shot NER field.