2025-11-23T23:25:17.435156

Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments

Hahm, Kim, Lee et al.
To ensure a balance between open access to justice and personal data protection, the South Korean judiciary mandates the de-identification of court judgments before they can be publicly disclosed. However, the current de-identification process is inadequate for handling court judgments at scale while adhering to strict legal requirements. Additionally, the legal definitions and categorizations of personal identifiers are vague and not well-suited for technical solutions. To tackle these challenges, we propose a de-identification framework called Thunder-DeID, which aligns with relevant laws and practices. Specifically, we (i) construct and release the first Korean legal dataset containing annotated judgments along with corresponding lists of entity mentions, (ii) introduce a systematic categorization of Personally Identifiable Information (PII), and (iii) develop an end-to-end deep neural network (DNN)-based de-identification pipeline. Our experimental results demonstrate that our model achieves state-of-the-art performance in the de-identification of court judgments.
academic

Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments

Basic Information

  • Paper ID: 2506.15266
  • Title: Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments
  • Authors: Sungeun Hahm, Heejin Kim, Gyuseong Lee, Hyunji M. Park, Jaejin Lee (Seoul National University)
  • Classification: cs.CL (Computational Linguistics)
  • Publication Date: October 16, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2506.15266v3

Abstract

To balance judicial transparency with personal data protection, the Korean judicial system requires de-identification of court judgments before public disclosure. However, current de-identification workflows fall short in handling large-scale court judgments while strictly adhering to legal requirements. Additionally, the legal definition and classification of personal identifiers are ambiguous and unsuitable for technical solutions. To address these challenges, this paper proposes the Thunder-DeID de-identification framework, which aligns with relevant laws, regulations, and practices. Specifically, the paper (i) constructs and releases the first Korean legal dataset containing annotated judgments and corresponding entity mention lists, (ii) introduces a systematic classification scheme for personally identifiable information (PII), and (iii) develops an end-to-end deep neural network (DNN) de-identification pipeline. Experimental results demonstrate that the model achieves state-of-the-art performance on the court judgment de-identification task.

Research Background and Motivation

Problem Definition

This research addresses three core challenges in Korean court judgment de-identification:

  1. Efficiency Bottleneck: Over-reliance on manual methods leads to administrative burden and delayed judgment publication, resulting in significantly low public accessibility to court judgments in Korea
  2. Poor Technical Performance: Between 2019-2025, existing automated de-identification tools achieved only 8-15% overall accuracy
  3. Ambiguous Legal Definitions: Current legal definitions and classifications of personal identifiers are vague, particularly unsuitable for automated technical solutions

Research Significance

Judicial transparency is an important democratic principle enshrined in the constitutions of many countries, including Korea. Korea requires a broader range and stricter conditions for anonymizing personal identifiers in court contexts. Effective de-identification technology is crucial for balancing judicial transparency and privacy protection.

Limitations of Existing Approaches

  • Prompt-based LLM Methods: Alter original sentence structure, risking sentence and contextual distortion
  • API Restrictions: Korean government agencies restrict the use of services like ChatGPT due to privacy and information security concerns
  • Insufficient Large-scale Processing Capability: Existing methods cannot effectively handle large-scale court judgments

Core Contributions

  1. First Korean Legal Dataset: Creates a bipartite dataset containing 6,700 annotated judgments (covering civil, criminal, and administrative cases) and 48,306 named entities
  2. Three-tier PII Classification Framework: Based on inductive analysis of 48,306 named entities, proposes a systematic personal identifiable information classification scheme
  3. Specialized Tokenizer: Integrates the morphological analyzer Mecab-ko with Byte Pair Encoding (BPE), leveraging unique characteristics of Korean language
  4. End-to-end DNN Pipeline: Develops a complete de-identification framework achieving best performance on court judgment de-identification tasks

Methodology

Task Definition

Input: Original Korean court judgment text containing personally identifiable information Output: De-identified judgment text where sensitive information is appropriately replaced or removed Constraints: Must comply with relevant Korean laws and regulations (e.g., Article 59-3 of the Korean Criminal Procedure Act, Article 163-2 of the Civil Procedure Act, etc.)

Model Architecture

1. Data Construction Pipeline

Anonymized Judgments → Placeholder Detection and Annotation → PII Classification Scheme → Replacement List Generation → Training Data Generation

2. Thunder-DeID Model Family

Based on DeBERTa-v3 architecture, comprising three model sizes:

  • Thunder-DeID-370M: 370 million parameters, hidden dimension 1024, 24 Transformer layers
  • Thunder-DeID-800M: 800 million parameters, hidden dimension 1280, 36 Transformer layers
  • Thunder-DeID-1.5B: 1.5 billion parameters, hidden dimension 2048, 24 Transformer layers

3. Tokenization Strategy

Integrates Mecab-ko morphological analyzer with BPE:

  • Mecab-ko: Handles Korean agglutinative morphology, accurately separating word roots and particles
  • BPE: Addresses out-of-vocabulary (OOV) issues, representing unseen words as subword units

4. Training Data Generation Algorithm

# Pseudocode example
def generate_training_data(annotated_text, replacement_lists):
    # 1. Identify special marker pairs
    start_tokens, end_tokens = detect_markers(annotated_text)
    
    # 2. Scan and replace placeholders
    for start_token, end_token in zip(start_tokens, end_tokens):
        placeholder_range = extract_range(start_token, end_token)
        entity_type = get_entity_type(start_token)
        replacement = sample_from_list(replacement_lists[entity_type])
        replace_placeholder(placeholder_range, replacement)
    
    # 3. Generate label sequence
    label_sequence = generate_labels(replaced_text)
    return tokenized_sequence, label_sequence

Technical Innovations

  1. Three-tier PII Classification System:
    • First tier: Direct identifiers vs. quasi-identifiers
    • Second tier: 16 subcategories (e.g., person names, geographic information, organizations, etc.)
    • Third tier: 80 fine-grained categories, corresponding to 729 labels
  2. Korean-specific Tokenization:
    • Utilizes Mecab-ko to precisely separate "홍길동이" into "홍길동" + "이"
    • Ensures only target entities are de-identified while preserving particle integrity
  3. Data Augmentation Strategy:
    • Per-Epoch Replacement: Replaces different entity mentions each epoch, increasing data diversity
    • Single Replacement: Fixed replacement, serving as a baseline for comparison

Experimental Setup

Dataset

  • Scale: 6,700 judgments (3,000 civil, 3,000 criminal, 700 administrative)
  • Entity Count: 48,306 annotated entities
  • Data Sources: Korean Government Legislative Department, AI-hub, public datasets
  • Split Ratio: 80% training, 10% validation, 10% testing

Evaluation Metrics

  1. Binary Token-level: Measures the model's ability to identify tokens requiring de-identification
  2. Token-level: Measures the model's accuracy in classifying specific entity types
  3. Metrics: Precision, Recall, F1-score

Baseline Methods

  • Polyglot-Ko (1.3B parameters): Korean-specific language model
  • EXAONE-3.5 (2.4B parameters): Korean-specific decoder model

Implementation Details

  • Pre-training Corpus: 76.7GB bilingual corpus (Korean + English)
  • Sequence Length: 512 → 2048 tokens
  • Optimizer: AdamW, β=(0.9, 0.999)
  • Learning Rate Schedule: 10% warmup steps + cosine decay
  • Hardware: 32×NVIDIA H100 80GB GPUs

Experimental Results

Main Results

ModelParametersBinary Token-level F1Token-level Micro F1
Polyglot-ko1.3B0.97010.8765
EXAONE2.4B0.96770.8752
Thunder-DeID-370M370M0.96540.8871
Thunder-DeID-800M800M0.97910.9105
Thunder-DeID-1.5B1.5B0.98080.9071

Key Findings

  1. Significant Performance Improvement: Thunder-DeID outperforms baseline models across all scales
  2. Per-Epoch Advantage: Per-Epoch replacement strategy significantly outperforms Single replacement across all models
  3. Scale Effect: Even the smallest Thunder-DeID-370M surpasses larger baseline models on token-level metrics
  4. Practical Breakthrough: Compared to the existing 8-15% accuracy of the Korean National Court Administration system, this represents a substantial improvement

Error Analysis

The model shows weaknesses in recognizing low-frequency labels:

  • Frequently misclassifies "뷔페(buffet restaurant)" as "기계설비회사(machinery equipment company)"
  • Exhibits confusion between "불특정제품명(unspecified product name)" and "불특정회사명(unspecified company name)"

Medical De-identification

  • HIPAA Guidance: Safe Harbor method and expert determination
  • Technical Evolution: Rule systems → BiLSTM-CRF → BERT → LLM
  • Limitations: HIPAA regulations restrict practical LLM deployment

Court Judgment De-identification

Performance comparison across languages:

  • Arabic: F1=96.14%
  • German/French/Italian: F1=92.40%
  • Spanish: F1=91.90%
  • Hindi: F1=91.10%
  • Italian: F1=88.60%

This work fills the gap in Korean legal text de-identification.

Conclusions and Discussion

Main Conclusions

  1. Thunder-DeID successfully addresses technical challenges in Korean court judgment de-identification
  2. The three-tier PII classification scheme provides a systematic framework for legal text de-identification
  3. Korean-specific tokenization and data augmentation strategies significantly enhance model performance
  4. Achieves state-of-the-art performance on this task with practical deployment potential

Limitations

  1. Data Constraints: Due to legal restrictions, original unredacted judgments cannot be obtained for real-world evaluation
  2. Domain Limitations: The model is specifically trained on civil, criminal, and administrative law; generalization to other legal domains is unknown
  3. Context Sensitivity: Legal de-identification is highly context-dependent; model performance may degrade across different legal dispute types

Future Directions

  1. Synthetic Data Generation: Develop synthetic data augmentation methods more closely resembling real court judgments
  2. Cross-domain Adaptation: Evaluate and improve model performance across different legal domains
  3. Practical Deployment: Collaborate with Korean judicial institutions for real-world deployment testing

In-depth Evaluation

Strengths

  1. Significant Practical Value: Addresses real pain points in the Korean judicial system with direct social value
  2. Technical Innovation: Korean-specific tokenization, three-tier PII classification, and data augmentation strategies all demonstrate innovation
  3. Comprehensive Experiments: Thorough ablation studies, multiple baseline comparisons, and detailed error analysis
  4. Dataset Contribution: First Korean legal de-identification dataset, advancing the field
  5. Legal Compliance: Strictly adheres to relevant Korean laws and regulations, ensuring practical applicability

Shortcomings

  1. Evaluation Limitations: Cannot be validated on real data, risking domain gap issues
  2. Reproducibility: Some implementation details (e.g., specific replacement list construction) lack sufficient description
  3. Computational Cost: Requires large-scale GPU resources, potentially limiting practical application
  4. Generalization Ability: Applicability to languages other than Korean remains unknown

Impact

  1. Academic Contribution: Provides new benchmarks and methods for legal NLP and de-identification research
  2. Practical Value: Promises to significantly improve efficiency and transparency in the Korean judicial system
  3. International Reference: Provides a reference framework for legal text de-identification in other countries
  4. Technology Advancement: Important progress in Korean NLP technology

Application Scenarios

  1. Judicial Institutions: Automated de-identification of court judgments
  2. Legal Research: Large-scale legal text analysis and research
  3. Government Departments: Other public services requiring text de-identification
  4. Academic Research: Related research in legal NLP and privacy protection

References

This paper cites multiple important related works, including:

  • Classical medical de-identification work (Uzuner et al., 2007; Liu et al., 2017)
  • Legal text de-identification research across countries (Niklaus et al., 2023; Salierno et al., 2024)
  • Korean NLP foundational work (Park et al., 2020; Ko et al., 2023)
  • Relevant laws, regulations, and policy documents

Overall Assessment: This is a high-quality application-oriented research paper that not only demonstrates technical innovation but, more importantly, addresses real social problems. The paper balances engineering value and academic value, making significant contributions to the legal NLP field. Despite some limitations, the work is outstanding and deserves attention.