Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments
Hahm, Kim, Lee et al.
To ensure a balance between open access to justice and personal data protection, the South Korean judiciary mandates the de-identification of court judgments before they can be publicly disclosed. However, the current de-identification process is inadequate for handling court judgments at scale while adhering to strict legal requirements. Additionally, the legal definitions and categorizations of personal identifiers are vague and not well-suited for technical solutions. To tackle these challenges, we propose a de-identification framework called Thunder-DeID, which aligns with relevant laws and practices. Specifically, we (i) construct and release the first Korean legal dataset containing annotated judgments along with corresponding lists of entity mentions, (ii) introduce a systematic categorization of Personally Identifiable Information (PII), and (iii) develop an end-to-end deep neural network (DNN)-based de-identification pipeline. Our experimental results demonstrate that our model achieves state-of-the-art performance in the de-identification of court judgments.
academic
Thunder-DeID: Accurate and Efficient De-identification Framework for Korean Court Judgments
To balance judicial transparency with personal data protection, the Korean judicial system requires de-identification of court judgments before public disclosure. However, current de-identification workflows fall short in handling large-scale court judgments while strictly adhering to legal requirements. Additionally, the legal definition and classification of personal identifiers are ambiguous and unsuitable for technical solutions. To address these challenges, this paper proposes the Thunder-DeID de-identification framework, which aligns with relevant laws, regulations, and practices. Specifically, the paper (i) constructs and releases the first Korean legal dataset containing annotated judgments and corresponding entity mention lists, (ii) introduces a systematic classification scheme for personally identifiable information (PII), and (iii) develops an end-to-end deep neural network (DNN) de-identification pipeline. Experimental results demonstrate that the model achieves state-of-the-art performance on the court judgment de-identification task.
This research addresses three core challenges in Korean court judgment de-identification:
Efficiency Bottleneck: Over-reliance on manual methods leads to administrative burden and delayed judgment publication, resulting in significantly low public accessibility to court judgments in Korea
Poor Technical Performance: Between 2019-2025, existing automated de-identification tools achieved only 8-15% overall accuracy
Ambiguous Legal Definitions: Current legal definitions and classifications of personal identifiers are vague, particularly unsuitable for automated technical solutions
Judicial transparency is an important democratic principle enshrined in the constitutions of many countries, including Korea. Korea requires a broader range and stricter conditions for anonymizing personal identifiers in court contexts. Effective de-identification technology is crucial for balancing judicial transparency and privacy protection.
First Korean Legal Dataset: Creates a bipartite dataset containing 6,700 annotated judgments (covering civil, criminal, and administrative cases) and 48,306 named entities
Three-tier PII Classification Framework: Based on inductive analysis of 48,306 named entities, proposes a systematic personal identifiable information classification scheme
Specialized Tokenizer: Integrates the morphological analyzer Mecab-ko with Byte Pair Encoding (BPE), leveraging unique characteristics of Korean language
End-to-end DNN Pipeline: Develops a complete de-identification framework achieving best performance on court judgment de-identification tasks
Input: Original Korean court judgment text containing personally identifiable information
Output: De-identified judgment text where sensitive information is appropriately replaced or removed
Constraints: Must comply with relevant Korean laws and regulations (e.g., Article 59-3 of the Korean Criminal Procedure Act, Article 163-2 of the Civil Procedure Act, etc.)
Significant Performance Improvement: Thunder-DeID outperforms baseline models across all scales
Per-Epoch Advantage: Per-Epoch replacement strategy significantly outperforms Single replacement across all models
Scale Effect: Even the smallest Thunder-DeID-370M surpasses larger baseline models on token-level metrics
Practical Breakthrough: Compared to the existing 8-15% accuracy of the Korean National Court Administration system, this represents a substantial improvement
This paper cites multiple important related works, including:
Classical medical de-identification work (Uzuner et al., 2007; Liu et al., 2017)
Legal text de-identification research across countries (Niklaus et al., 2023; Salierno et al., 2024)
Korean NLP foundational work (Park et al., 2020; Ko et al., 2023)
Relevant laws, regulations, and policy documents
Overall Assessment: This is a high-quality application-oriented research paper that not only demonstrates technical innovation but, more importantly, addresses real social problems. The paper balances engineering value and academic value, making significant contributions to the legal NLP field. Despite some limitations, the work is outstanding and deserves attention.