2025-11-19T20:19:14.203751

Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning

Liu, Wang, Liu et al.

Few-shot named entity recognition can identify new types of named entities based on a few labeled examples. Previous methods employing token-level or span-level metric learning suffer from the computational burden and a large number of negative sample spans. In this paper, we propose the Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning (MsFNER), which splits the general NER into two stages: entity-span detection and entity classification. There are 3 processes for introducing MsFNER: training, finetuning, and inference. In the training process, we train and get the best entity-span detection model and the entity classification model separately on the source domain using meta-learning, where we create a contrastive learning module to enhance entity representations for entity classification. During finetuning, we finetune the both models on the support dataset of target domain. In the inference process, for the unlabeled data, we first detect the entity-spans, then the entity-spans are jointly determined by the entity classification model and the KNN. We conduct experiments on the open FewNERD dataset and the results demonstrate the advance of MsFNER.

academic

Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning

Basic Information

Paper ID: 2404.06970
Title: Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning
Authors: Congying Liu, Gaosheng Wang, Peipei Liu, Xingyuan Wei, Hongsong Zhu
Category: cs.CL
Publication Date: April 2024 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2404.06970

Abstract

Few-shot named entity recognition can identify new types of named entities based on a few labeled examples. Previous methods employing token-level or span-level metric learning suffer from the computational burden and a large number of negative sample spans. In this paper, we propose the Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning (MsFNER), which splits the general NER into two stages: entity-span detection and entity classification. There are 3 processes for introducing MsFNER: training, finetuning, and inference. In the training process, we train and get the best entity-span detection model and the entity classification model separately on the source domain using meta-learning, where we create a contrastive learning module to enhance entity representations for entity classification. During finetuning, we finetune both models on the support dataset of target domain. In the inference process, for the unlabeled data, we first detect the entity-spans, then the entity-spans are jointly determined by the entity classification model and the KNN. We conduct experiments on the open FewNERD dataset and the results demonstrate the advance of MsFNER.

Research Background and Motivation

Problem Definition

Few-shot Named Entity Recognition (Few-shot NER) aims to rapidly identify new types of named entities based on a limited number of annotated examples. This task is crucial for adapting to dynamically changing real-world application scenarios, particularly when models need to quickly adapt to new data or environmental changes.

Limitations of Existing Methods

Token-level Methods: While methods based on token distances to prototypes or support set tokens are intuitive and straightforward, they suffer from high computational costs and inability to preserve semantic integrity of entity tokens, making them susceptible to interference from non-entity markers.
Span-level Methods: Although evaluating entire spans alleviates some issues of token-level methods, enumerating all possible spans results in O(N²) complexity and introduces substantial noise from negative samples.

Research Motivation

The authors aim to address two core problems:

How to improve few-shot NER recognition efficiency by enhancing semantic differences between entities and non-entities to determine effective entity spans
How to improve entity span classification by controlling and coordinating semantic distances among different entity types, bringing same-type entities closer while pushing different-type entities farther apart

Core Contributions

Proposed the MsFNER Framework: Decomposes traditional NER tasks into entity span detection and entity classification stages, effectively reducing computational complexity and mitigating negative sample effects
Designed Entity-aware Contrastive Learning Module: Enhances entity representation learning, improving consistency of same-type entities while increasing distances between different-type entities
Constructed Hybrid Inference Mechanism: Combines entity classification model and KNN method for joint prediction, improving classification accuracy
Achieved State-of-the-Art Performance: Significantly outperforms existing methods on FewNERD and FewAPTER datasets, with comprehensive comparison against ChatGPT

Methodology Details

Task Definition

Few-shot NER is defined as: the model first trains on source domain dataset $D_{source} = (S_{source}, Q_{source})$ , then transfers to target domain dataset $D_{target} = (S_{target}, Q_{target})$ for inference. Here $S_{target}$ is the support set containing N entity types (N-way) with K annotated examples per type (K-shot); $Q_{target}$ is the query set containing the same entity types as the support set.

Model Architecture

MsFNER comprises three main processes:

1. Training Process

Entity Span Detection (ESD) Module:

Formulates entity span detection as a sequence labeling task using BIOES tagging scheme
For input sentence $x = (x_1, x_2, ..., x_n)$ , obtains contextual representations $h = (h_1, h_2, ..., h_n)$ using BERT encoder
Performs entity span detection through CRF layer with training loss:

$L_{ESD} = -\sum \log P(y|x)$

where: $P(y|x) = \frac{\prod_{i=1}^{|x|} \phi_i(y_{i-1}, y_i, x)}{\sum_{y'} \prod_{i=1}^{|x|} \phi_i(y'_{i-1}, y'_i, x)}$

Employs MAML meta-learning with inner and outer loop updates

Entity Classification (EC) Module:

For entity $e_k = (x_f, ..., x_{f+l})$ , obtains representation using max pooling: $\hat{e}_k = \max(h_f, ..., h_{f+l})$
Introduces entity-aware contrastive learning with loss function: $L_{CL} = \sum_j -\frac{1}{|P(j)|} \sum_{p \in P(j)} \log \frac{\exp(\text{sim}(z_j, z_p)/\tau)}{\sum_{a \in A(j)} \exp(\text{sim}(z_j, z_a)/\tau)}$
Constructs prototype representations and performs classification: $c_t(S) = \frac{1}{|S_t|} \sum_{e_m \in S_t} \hat{e}_m$

$p_{soft}(e_k) = \frac{\exp(-d(c_t(S), \hat{e}_k))}{\sum_{i=1}^{|\phi|} \exp(-d(c_i(S), \hat{e}_k))}$

2. Finetuning Process

Finetunes the trained entity detection and classification models on target domain support set $S_{target}$ , following the same pattern as the training process.

3. Inference Process

Comprises four stages:

Constructs key-value data store $D_{knn}$ with entity representations as keys and corresponding labels as values
Obtains entity spans using the entity detection model
Inputs detected entity representations to both classification model and KNN module
Joint prediction: $p(y|e'_k) = \lambda p_{knn}(y|e'_k) + (1-\lambda) p_{soft}(y|e'_k)$

Technical Innovations

Two-stage Decomposition Strategy: Decomposes NER into span detection and classification subtasks, avoiding the complexity of enumerating all possible spans in traditional methods
Entity-aware Contrastive Learning: Specially designed contrastive learning module enhances entity representations, improving aggregation of same-type entities and discrimination between different-type entities
Hybrid Inference Mechanism: Combines parametric models with non-parametric KNN methods, fully leveraging support set information

Experimental Setup

Datasets

FewNERD Dataset:

Contains 8 coarse-grained and 66 fine-grained entity types
Evaluates both FewNERD-INTRA and FewNERD-INTER settings
Employs N-way K~2K-shot sampling for task construction

FewAPTER Dataset:

Constructed from network security threat intelligence dataset APTER
Consolidates original 37 entity types into 21 classes with 28,250 entities total
Splits training/validation/test sets in 7:7:7 ratio
Constructs four settings: 4-way 1-shot, 4-way 3-shot, 6-way 1-shot, 6-way 3-shot

Evaluation Metrics

Uses F1 score as primary evaluation metric with standard deviation reported.

Baseline Methods

ProtoBERT: Token-level method based on BERT hidden state similarity
CONTAINER: Token-level contrastive learning method
NNShot/StructShot: Nearest neighbor-based methods
ESD: Span-level matching method
MAML-ProtoNet: Meta-learning combining MAML and prototypical networks
BDCP: Boundary discrimination and correlation purification method
ChatGPT: Large language model baseline

Implementation Details

Encoder: BERT-base
Optimizer: AdamW with learning rate 3e-5
Batch size: 32, maximum sequence length: 128
K=10 in KNN, λ=0.1
Trains for 1000 steps, selects best model on validation set

Experimental Results

Main Results

FewNERD Dataset:

Average F1 improvement of 2.65% on FewNERD-INTRA
Average F1 improvement of 4.44% on FewNERD-INTER
Significant improvements over previous best method MAML-ProtoNet

FewAPTER Dataset:

Average F1 score improvement of 11.42%
Outperforms ChatGPT in most settings

Comparison with ChatGPT:

Overall outperforms ChatGPT on FewNERD
Slightly underperforms ChatGPT on FewAPTER but with significantly faster inference speed

Ablation Studies

Removing Contrastive Learning Module:
- Average decrease of 0.905% on FewNERD
- Average decrease of 0.745% on FewAPTER
Removing KNN Module:
- Average decrease of 0.524% on FewNERD
- Average decrease of 0.635% on FewAPTER

Results demonstrate positive contributions from both modules.

Efficiency Analysis

MsFNER's inference time is significantly faster than ChatGPT across various settings, demonstrating superior efficiency consistent with Occam's Razor principle.

Experimental Findings

K-shot Impact: Increasing K-shot samples significantly improves performance
N-way Impact: Increasing N-way decreases performance, as expected
Domain Adaptability: Model performs well on cross-domain tasks
LLM Stability: ChatGPT performance is relatively stable with minimal impact from data and domain variations

Main Directions in Few-shot NER

Token-level Methods: Such as ProtoBERT and CONTAINER, making predictions based on token similarity
Span-level Methods: Such as ESD, treating entities as whole spans
Meta-learning Methods: Such as MAML-ProtoNet, employing meta-learning frameworks for rapid task adaptation

Advantages of This Work

Compared to existing work, MsFNER effectively addresses computational complexity and negative sample issues through two-stage decomposition while introducing contrastive learning to enhance representation learning.

Conclusions and Discussion

Main Conclusions

Effectiveness: MsFNER achieves state-of-the-art performance on multiple datasets, validating the effectiveness of the two-stage decomposition strategy
Efficiency: Significantly reduces computational complexity compared to traditional span-level methods
Generalizability: Performs well across different domains and settings

Limitations

Domain Adaptation Constraints: Generalization ability in certain specific domains (e.g., FewAPTER) still has room for improvement
Hyperparameter Sensitivity: Hyperparameters such as λ require adjustment for different datasets
Computational Resources: Still requires pre-trained BERT models as foundation

Future Directions

Enhanced Domain Adaptation: Explore better cross-domain transfer methods
End-to-end Optimization: Investigate joint optimization strategies for both stages
Larger-scale Evaluation: Validate method effectiveness across more domains and languages

In-depth Evaluation

Strengths

Strong Methodological Innovation: The two-stage decomposition strategy is novel and effectively addresses core issues of existing methods
Reasonable Technical Design: Entity-aware contrastive learning and hybrid inference mechanism are ingeniously designed
Comprehensive Experiments: Thorough evaluation on multiple datasets including comparison with LLMs
In-depth Analysis: Provides detailed ablation studies and efficiency analysis

Weaknesses

Insufficient Theoretical Analysis: Lacks theoretical explanation for method effectiveness
Missing Computational Complexity Analysis: While claiming reduced complexity, lacks quantitative analysis
Absent Error Analysis: No in-depth analysis of model failure cases

Impact

Academic Contribution: Provides new perspectives for solving few-shot NER
Practical Value: Simple and effective method, easy to implement and deploy
Reproducibility: Provides detailed implementation details and hyperparameter settings

Applicable Scenarios

Resource-constrained Environments: More suitable than large language models for scenarios with limited computational resources
Rapid Deployment Requirements: Can quickly adapt to new entity types
Domain-specific Applications: Shows promising application prospects in vertical domains such as cybersecurity

References

The paper cites important works in related fields, including:

Few-shot learning foundational methods (Prototypical Networks, MAML)
Classical named entity recognition methods (BERT-based approaches)
Contrastive learning related work (Supervised Contrastive Learning)
Few-shot NER specialized methods (ProtoBERT, ESD, MAML-ProtoNet, etc.)

Overall Assessment: This is an excellent paper with solid technical contributions and comprehensive experiments. The authors' proposed two-stage decomposition strategy effectively addresses key issues of existing methods, achieving significant performance improvements on multiple datasets. The method design is reasonable with high practical value, providing valuable contributions to the few-shot NER field.