2025-11-24T13:49:17.165773

Confidence Calibration in Large Language Model-Based Entity Matching

Kamsteeg, Cardenas-Cartagena, van Beers et al.

This research aims to explore the intersection of Large Language Models and confidence calibration in Entity Matching. To this end, we perform an empirical study to compare baseline RoBERTa confidences for an Entity Matching task against confidences that are calibrated using Temperature Scaling, Monte Carlo Dropout and Ensembles. We use the Abt-Buy, DBLP-ACM, iTunes-Amazon and Company datasets. The findings indicate that the proposed modified RoBERTa model exhibits a slight overconfidence, with Expected Calibration Error scores ranging from 0.0043 to 0.0552 across datasets. We find that this overconfidence can be mitigated using Temperature Scaling, reducing Expected Calibration Error scores by up to 23.83%.

academic

Confidence Calibration in Large Language Model-Based Entity Matching

Basic Information

Paper ID: 2509.19557
Title: Confidence Calibration in Large Language Model-Based Entity Matching
Authors: Iris Kamsteeg, Juan Cardenas-Cartagena, Floris van Beers, Gineke ten Holt, Tsegaye Misikir Tashu, Matias Valdenegro-Toro
Classification: cs.CL cs.LG
Publication Date: October 15, 2025 (arXiv v2)
Institutions: Bernoulli Institute, University of Groningen, The Netherlands; Independent Researcher
Paper Link: https://arxiv.org/abs/2509.19557

Abstract

This study explores the intersection of large language models and confidence calibration in entity matching. Through empirical investigation, it compares baseline confidence scores from RoBERTa in entity matching tasks with calibrated confidence scores using temperature scaling, Monte Carlo Dropout, and ensemble methods. Experiments are conducted on the Abt-Buy, DBLP-ACM, iTunes-Amazon, and Company datasets. Results demonstrate that the improved RoBERTa model exhibits slight overconfidence, with Expected Calibration Error (ECE) ranging from 0.0043 to 0.0552 across different datasets. The study finds that temperature scaling mitigates this overconfidence, reducing ECE scores by up to 23.83%.

Research Background and Motivation

Problem Definition

Entity Matching (EM) is a critical subtask of entity resolution, aiming to determine whether data entries from different sources refer to the same real-world entity. This is a binary classification problem requiring judgment of whether entity pairs are "matches" or "non-matches."

Significance

Multi-domain Application Value: Improves patient care in healthcare, connects birth, marriage, and death records in historical population reconstruction, and is critical for investigation and crime prevention in law enforcement
Transparency Requirements: Models must provide reliable confidence scores alongside predictions to enable users to understand model reliability
Downstream Task Guidance: Accurate confidence scores can guide decision-making in subsequent tasks

Limitations of Existing Methods

Overconfidence Problem: Modern large language models exhibit overconfidence in other NLP tasks, struggling to accurately express prediction uncertainty
Research Gap: While LLMs have been studied for confidence calibration, their application in entity matching remains insufficiently explored
Lack of Systematic Evaluation: Absence of systematic comparative studies of confidence calibration methods for entity matching tasks

Research Motivation

Provide model prediction transparency, help understand model internal mechanisms, identify model weaknesses, and improve performance. When explicitly knowing where models are uncertain, improvement directions become clearer.

Core Contributions

First Systematic Study: First systematic investigation of confidence calibration for LLMs in entity matching
Multi-method Comparison: Comprehensive comparison of temperature scaling, Monte Carlo Dropout, and ensemble methods for confidence calibration in entity matching
Multi-dataset Validation: Validates method effectiveness and generalization across six datasets from different domains and structures
Practical Guidance: Provides best practice recommendations for confidence calibration in real applications, particularly highlighting advantages of temperature scaling

Methodology Details

Task Definition

Input: Entity pairs from different data sources
Output: Binary classification labels ("match"/"non-match") and corresponding confidence scores
Objective: Ensure confidence scores accurately reflect the true probability of correct predictions

Model Architecture

Base Architecture

Pre-trained RoBERTa: Uses HuggingFace's RoBERTa-base model as encoder
Fully Connected Layer: Single-layer fully connected network added after RoBERTa
Sigmoid Output Layer: Produces confidence scores between 0-1
Data Serialization: Converts structured data to text sequences using the method from Li et al. (2020)

Confidence Calibration Methods

1. Temperature Scaling

Applies temperature parameter T to scale logits after Sigmoid output
Optimizes temperature parameter through grid search on validation set: T ∈ {0.1, 0.2, ..., 10.0}
Selects temperature value minimizing ECE
Advantages: lightweight, easy to implement, does not alter F1 scores

2. Monte Carlo Dropout

Applies dropout (probability p) to fully connected layer during inference
Performs 10 forward passes and averages outputs
Grid searches optimal dropout probability: p ∈ {0.05, 0.10, ..., 0.95}
Selects p with minimum ECE while maintaining F1 scores

3. Ensemble Method

Trains 5 fully connected layers with different random initializations
Averages outputs from 5 models as final prediction
Applies ensemble only to fully connected and Sigmoid layers to minimize computational cost

Technical Innovations

Lightweight Implementation: Monte Carlo Dropout and ensemble methods applied only to fully connected layers, minimizing computational overhead
Multi-metric Optimization: Can optimize ECE, MCE, or RMSCE depending on application requirements
Statistical Significance Verification: Uses paired t-tests (temperature scaling, Monte Carlo Dropout) and unpaired t-tests (ensemble method) to assess improvement significance

Experimental Setup

Datasets

Uses six entity matching datasets from different domains:

Dataset	Domain	Training	Validation	Test
Abt-Buy	Products	5,743 (10.72%)	1,916 (10.75%)	1,916 (10.75%)
DBLP-ACM-S/D	Citations	7,417 (17.96%)	2,473 (17.96%)	2,473 (17.96%)
iTunes-Amazon-S/D	Songs	321 (24.30%)	109 (27.78%)	109 (27.78%)
Company	Companies	67,596 (24.94%)	22,533 (25.30%)	22,503 (25.06%)

Note: S/D denotes structured/dirty data versions; percentages in parentheses indicate positive sample ratios

Evaluation Metrics

Expected Calibration Error (ECE): Primary metric measuring average difference between predicted and empirical probabilities
Maximum Calibration Error (MCE): Measures worst-case deviation, suitable for high-risk applications
Root Mean Square Calibration Error (RMSCE): Emphasizes larger errors more strongly
F1 Score: Ensures calibration improvements do not compromise classification performance
Visual Analysis: Confidence histograms and reliability diagrams

Baseline Methods

Baseline: Uncalibrated RoBERTa Sigmoid output
Calibration Methods: Temperature scaling, Monte Carlo Dropout, ensemble method

Implementation Details

Training Epochs: 40 (following Li et al. 2020 settings)
Model Selection: Selects checkpoint with highest F1 on validation set
Experiment Repetition: Each experiment repeated 5 times with mean and standard deviation reported
Bin Count: √|D| (where D is dataset size)

Experimental Results

Main Results

Baseline Performance Analysis

RoBERTa exhibits slight overconfidence across all datasets:

ECE Range: 0.0043-0.0552, lowest on DBLP-ACM, highest on Company dataset
Confidence Distribution: Model tends to produce extremely high or low prediction probabilities
F1 Performance: Exceeds 98% on DBLP-ACM, approximately 82% on Company dataset

Calibration Method Comparison

Dataset	Baseline ECE	Temp. Scaling ECE	MC Dropout ECE	Ensemble ECE
Abt-Buy	0.0193±0.0018	0.0147±0.0017	0.0193±0.0016	0.0173±0.0005
DBLP-ACM-S	0.0041±0.0010	0.0036±0.0011	0.0038±0.0010	0.0057±0.0023
Company	0.0552±0.0099	0.0424±0.0102	0.0543±0.0085	-

Temperature Scaling Performs Best:

Significantly reduces ECE by 23.83% on Abt-Buy dataset
Achieves significant improvements on 4 datasets
Does not affect F1 score performance

Ablation Studies

Temperature Parameter Analysis

Optimal Temperature Values: Typically greater than 1.0 (average 1.72±0.51), confirming baseline model overconfidence
Parameter Stability: Clear optimal temperature value exists for each dataset and run

Dropout Probability Analysis

Optimal Probability Range: Between 0.5-1.0, exceeding 0.8 for some datasets
Generalization Issues: Optimal dropout probabilities vary significantly across datasets, lacking consistency

Case Analysis

Confidence histograms reveal:

Correct Predictions: Primarily concentrated in high confidence intervals
Incorrect Predictions: More dispersed distribution, but substantial proportion still shows high confidence
Overlap Problem: Significant overlap between correct and incorrect prediction confidence distributions, indicating insufficient calibration

Experimental Findings

Widespread Overconfidence: RoBERTa exhibits varying degrees of overconfidence across all datasets
Temperature Scaling Most Effective: Outperforms other methods in improving ECE
Computational Efficiency Advantage: Temperature scaling has minimal computational overhead and easy deployment
Performance Preservation: Calibration methods essentially do not affect classification performance

LLMs in Entity Matching

BERT Series Models: Brunner and Stockinger (2020) found BERT, RoBERTa and similar models achieve 35.9% F1 improvement over traditional methods
DITTO System: Li et al. (2020) combined LLMs with optimization techniques for entity matching
Decoder Models: Applications of GPT-3, ChatGPT, GPT-4 in entity matching research

LLM Confidence Calibration

Early Findings: Guo et al. (2017) discovered widespread calibration issues in modern neural networks
BERT/RoBERTa Research: Calibration studies by Desai and Durrett (2020), Xiao et al. (2022) across multiple NLP tasks
Calibration Methods: Development of temperature scaling, Monte Carlo Dropout, and ensemble methods

Novel Contributions of This Work

Domain First: First systematic application of confidence calibration to entity matching tasks
Method Comparison: Comprehensive comparison of multiple calibration methods
Practical Guidance: Provides best practice recommendations for real applications

Conclusions and Discussion

Main Conclusions

Overconfidence Confirmed: RoBERTa exhibits overconfidence in entity matching tasks with ECE scores of 0.0043-0.0552
Temperature Scaling Optimal: Temperature scaling is the most effective calibration method, reducing ECE by up to 23.83%
Performance Preservation: Confidence calibration does not compromise classification performance
Strong Practicality: Temperature scaling is simple to implement and suitable for real deployment

Limitations

Model Scale Constraints: Research focuses on relatively small RoBERTa models, not addressing larger modern LLMs
Evaluation Metric Limitations: ECE, MCE, RMSCE may not accurately reflect calibration quality in certain cases
Computational Constraints: Ensemble method experiments incomplete on Company dataset due to computational limitations
Method Singularity: Does not explore combined use of multiple calibration methods

Future Directions

Large Model Extension: Extend research to larger-scale language models like GPT-4
Method Combination: Explore combinations of temperature scaling with other methods, such as Ensembles+Temperature Scaling
Variance Utilization: Leverage variance information from Monte Carlo Dropout and ensemble methods to improve calibration
New Evaluation Metrics: Develop evaluation metrics more accurately reflecting calibration quality

In-Depth Evaluation

Strengths

High Research Value: Fills research gap in confidence calibration for entity matching
Rigorous Experimental Design: Comprehensive comparison across multiple datasets, methods, and metrics
Statistical Rigor: Uses appropriate statistical tests to verify result significance
Strong Practicality: Provides directly applicable methods and parameter selection guidance
Clear Writing: Well-structured paper with accurate technical details

Weaknesses

Limited Model Coverage: Studies only RoBERTa architecture
Insufficient Theoretical Analysis: Lacks in-depth explanation of why temperature scaling performs best
Dataset Scale: Some datasets (e.g., iTunes-Amazon) are relatively small, potentially affecting generalization
Computational Resource Constraints: Affects completeness of certain experiments

Impact

Academic Contribution: Introduces important confidence calibration research direction to entity matching field
Practical Value: Temperature scaling method is simple, effective, and easily deployable in real systems
Reproducibility: Detailed experimental settings facilitate reproduction and extension
Inspirational Value: Provides important foundation and direction guidance for subsequent research

Applicable Scenarios

High-Risk Applications: Medical record matching and similar scenarios requiring reliable confidence estimates
Human-Machine Collaboration: Applications needing model uncertainty information to assist human decision-making
Quality Control: Using confidence scores to identify difficult samples requiring manual review
Model Optimization: Leveraging confidence information to improve model training and data collection strategies

References

Guo, C., et al. (2017). On Calibration of Modern Neural Networks. ICML.
Li, Y., et al. (2020). Deep Entity Matching with Pre-Trained Language Models. VLDB.
Desai, S., & Durrett, G. (2020). Calibration of Pre-trained Transformers. EMNLP.
Brunner, U., & Stockinger, K. (2020). Entity Matching with Transformer Architectures. EDBT.
Peeters, R., & Bizer, C. (2024). Entity Matching using Large Language Models. arXiv.

Summary: This paper makes important contributions to confidence calibration research in entity matching, providing systematic method comparison and practical solutions. The excellent performance of temperature scaling offers valuable guidance for practical applications. Despite certain limitations, this research establishes a solid foundation for subsequent work and possesses significant academic and practical value.