This research aims to explore the intersection of Large Language Models and confidence calibration in Entity Matching. To this end, we perform an empirical study to compare baseline RoBERTa confidences for an Entity Matching task against confidences that are calibrated using Temperature Scaling, Monte Carlo Dropout and Ensembles. We use the Abt-Buy, DBLP-ACM, iTunes-Amazon and Company datasets. The findings indicate that the proposed modified RoBERTa model exhibits a slight overconfidence, with Expected Calibration Error scores ranging from 0.0043 to 0.0552 across datasets. We find that this overconfidence can be mitigated using Temperature Scaling, reducing Expected Calibration Error scores by up to 23.83%.
- Paper ID: 2509.19557
- Title: Confidence Calibration in Large Language Model-Based Entity Matching
- Authors: Iris Kamsteeg, Juan Cardenas-Cartagena, Floris van Beers, Gineke ten Holt, Tsegaye Misikir Tashu, Matias Valdenegro-Toro
- Classification: cs.CL cs.LG
- Publication Date: October 15, 2025 (arXiv v2)
- Institutions: Bernoulli Institute, University of Groningen, The Netherlands; Independent Researcher
- Paper Link: https://arxiv.org/abs/2509.19557
This study explores the intersection of large language models and confidence calibration in entity matching. Through empirical investigation, it compares baseline confidence scores from RoBERTa in entity matching tasks with calibrated confidence scores using temperature scaling, Monte Carlo Dropout, and ensemble methods. Experiments are conducted on the Abt-Buy, DBLP-ACM, iTunes-Amazon, and Company datasets. Results demonstrate that the improved RoBERTa model exhibits slight overconfidence, with Expected Calibration Error (ECE) ranging from 0.0043 to 0.0552 across different datasets. The study finds that temperature scaling mitigates this overconfidence, reducing ECE scores by up to 23.83%.
Entity Matching (EM) is a critical subtask of entity resolution, aiming to determine whether data entries from different sources refer to the same real-world entity. This is a binary classification problem requiring judgment of whether entity pairs are "matches" or "non-matches."
- Multi-domain Application Value: Improves patient care in healthcare, connects birth, marriage, and death records in historical population reconstruction, and is critical for investigation and crime prevention in law enforcement
- Transparency Requirements: Models must provide reliable confidence scores alongside predictions to enable users to understand model reliability
- Downstream Task Guidance: Accurate confidence scores can guide decision-making in subsequent tasks
- Overconfidence Problem: Modern large language models exhibit overconfidence in other NLP tasks, struggling to accurately express prediction uncertainty
- Research Gap: While LLMs have been studied for confidence calibration, their application in entity matching remains insufficiently explored
- Lack of Systematic Evaluation: Absence of systematic comparative studies of confidence calibration methods for entity matching tasks
Provide model prediction transparency, help understand model internal mechanisms, identify model weaknesses, and improve performance. When explicitly knowing where models are uncertain, improvement directions become clearer.
- First Systematic Study: First systematic investigation of confidence calibration for LLMs in entity matching
- Multi-method Comparison: Comprehensive comparison of temperature scaling, Monte Carlo Dropout, and ensemble methods for confidence calibration in entity matching
- Multi-dataset Validation: Validates method effectiveness and generalization across six datasets from different domains and structures
- Practical Guidance: Provides best practice recommendations for confidence calibration in real applications, particularly highlighting advantages of temperature scaling
- Input: Entity pairs from different data sources
- Output: Binary classification labels ("match"/"non-match") and corresponding confidence scores
- Objective: Ensure confidence scores accurately reflect the true probability of correct predictions
- Pre-trained RoBERTa: Uses HuggingFace's RoBERTa-base model as encoder
- Fully Connected Layer: Single-layer fully connected network added after RoBERTa
- Sigmoid Output Layer: Produces confidence scores between 0-1
- Data Serialization: Converts structured data to text sequences using the method from Li et al. (2020)
1. Temperature Scaling
- Applies temperature parameter T to scale logits after Sigmoid output
- Optimizes temperature parameter through grid search on validation set: T ∈ {0.1, 0.2, ..., 10.0}
- Selects temperature value minimizing ECE
- Advantages: lightweight, easy to implement, does not alter F1 scores
2. Monte Carlo Dropout
- Applies dropout (probability p) to fully connected layer during inference
- Performs 10 forward passes and averages outputs
- Grid searches optimal dropout probability: p ∈ {0.05, 0.10, ..., 0.95}
- Selects p with minimum ECE while maintaining F1 scores
3. Ensemble Method
- Trains 5 fully connected layers with different random initializations
- Averages outputs from 5 models as final prediction
- Applies ensemble only to fully connected and Sigmoid layers to minimize computational cost
- Lightweight Implementation: Monte Carlo Dropout and ensemble methods applied only to fully connected layers, minimizing computational overhead
- Multi-metric Optimization: Can optimize ECE, MCE, or RMSCE depending on application requirements
- Statistical Significance Verification: Uses paired t-tests (temperature scaling, Monte Carlo Dropout) and unpaired t-tests (ensemble method) to assess improvement significance
Uses six entity matching datasets from different domains:
| Dataset | Domain | Training | Validation | Test |
|---|
| Abt-Buy | Products | 5,743 (10.72%) | 1,916 (10.75%) | 1,916 (10.75%) |
| DBLP-ACM-S/D | Citations | 7,417 (17.96%) | 2,473 (17.96%) | 2,473 (17.96%) |
| iTunes-Amazon-S/D | Songs | 321 (24.30%) | 109 (27.78%) | 109 (27.78%) |
| Company | Companies | 67,596 (24.94%) | 22,533 (25.30%) | 22,503 (25.06%) |
Note: S/D denotes structured/dirty data versions; percentages in parentheses indicate positive sample ratios
- Expected Calibration Error (ECE): Primary metric measuring average difference between predicted and empirical probabilities
- Maximum Calibration Error (MCE): Measures worst-case deviation, suitable for high-risk applications
- Root Mean Square Calibration Error (RMSCE): Emphasizes larger errors more strongly
- F1 Score: Ensures calibration improvements do not compromise classification performance
- Visual Analysis: Confidence histograms and reliability diagrams
- Baseline: Uncalibrated RoBERTa Sigmoid output
- Calibration Methods: Temperature scaling, Monte Carlo Dropout, ensemble method
- Training Epochs: 40 (following Li et al. 2020 settings)
- Model Selection: Selects checkpoint with highest F1 on validation set
- Experiment Repetition: Each experiment repeated 5 times with mean and standard deviation reported
- Bin Count: √|D| (where D is dataset size)
RoBERTa exhibits slight overconfidence across all datasets:
- ECE Range: 0.0043-0.0552, lowest on DBLP-ACM, highest on Company dataset
- Confidence Distribution: Model tends to produce extremely high or low prediction probabilities
- F1 Performance: Exceeds 98% on DBLP-ACM, approximately 82% on Company dataset
| Dataset | Baseline ECE | Temp. Scaling ECE | MC Dropout ECE | Ensemble ECE |
|---|
| Abt-Buy | 0.0193±0.0018 | 0.0147±0.0017 | 0.0193±0.0016 | 0.0173±0.0005 |
| DBLP-ACM-S | 0.0041±0.0010 | 0.0036±0.0011 | 0.0038±0.0010 | 0.0057±0.0023 |
| Company | 0.0552±0.0099 | 0.0424±0.0102 | 0.0543±0.0085 | - |
Temperature Scaling Performs Best:
- Significantly reduces ECE by 23.83% on Abt-Buy dataset
- Achieves significant improvements on 4 datasets
- Does not affect F1 score performance
- Optimal Temperature Values: Typically greater than 1.0 (average 1.72±0.51), confirming baseline model overconfidence
- Parameter Stability: Clear optimal temperature value exists for each dataset and run
- Optimal Probability Range: Between 0.5-1.0, exceeding 0.8 for some datasets
- Generalization Issues: Optimal dropout probabilities vary significantly across datasets, lacking consistency
Confidence histograms reveal:
- Correct Predictions: Primarily concentrated in high confidence intervals
- Incorrect Predictions: More dispersed distribution, but substantial proportion still shows high confidence
- Overlap Problem: Significant overlap between correct and incorrect prediction confidence distributions, indicating insufficient calibration
- Widespread Overconfidence: RoBERTa exhibits varying degrees of overconfidence across all datasets
- Temperature Scaling Most Effective: Outperforms other methods in improving ECE
- Computational Efficiency Advantage: Temperature scaling has minimal computational overhead and easy deployment
- Performance Preservation: Calibration methods essentially do not affect classification performance
- BERT Series Models: Brunner and Stockinger (2020) found BERT, RoBERTa and similar models achieve 35.9% F1 improvement over traditional methods
- DITTO System: Li et al. (2020) combined LLMs with optimization techniques for entity matching
- Decoder Models: Applications of GPT-3, ChatGPT, GPT-4 in entity matching research
- Early Findings: Guo et al. (2017) discovered widespread calibration issues in modern neural networks
- BERT/RoBERTa Research: Calibration studies by Desai and Durrett (2020), Xiao et al. (2022) across multiple NLP tasks
- Calibration Methods: Development of temperature scaling, Monte Carlo Dropout, and ensemble methods
- Domain First: First systematic application of confidence calibration to entity matching tasks
- Method Comparison: Comprehensive comparison of multiple calibration methods
- Practical Guidance: Provides best practice recommendations for real applications
- Overconfidence Confirmed: RoBERTa exhibits overconfidence in entity matching tasks with ECE scores of 0.0043-0.0552
- Temperature Scaling Optimal: Temperature scaling is the most effective calibration method, reducing ECE by up to 23.83%
- Performance Preservation: Confidence calibration does not compromise classification performance
- Strong Practicality: Temperature scaling is simple to implement and suitable for real deployment
- Model Scale Constraints: Research focuses on relatively small RoBERTa models, not addressing larger modern LLMs
- Evaluation Metric Limitations: ECE, MCE, RMSCE may not accurately reflect calibration quality in certain cases
- Computational Constraints: Ensemble method experiments incomplete on Company dataset due to computational limitations
- Method Singularity: Does not explore combined use of multiple calibration methods
- Large Model Extension: Extend research to larger-scale language models like GPT-4
- Method Combination: Explore combinations of temperature scaling with other methods, such as Ensembles+Temperature Scaling
- Variance Utilization: Leverage variance information from Monte Carlo Dropout and ensemble methods to improve calibration
- New Evaluation Metrics: Develop evaluation metrics more accurately reflecting calibration quality
- High Research Value: Fills research gap in confidence calibration for entity matching
- Rigorous Experimental Design: Comprehensive comparison across multiple datasets, methods, and metrics
- Statistical Rigor: Uses appropriate statistical tests to verify result significance
- Strong Practicality: Provides directly applicable methods and parameter selection guidance
- Clear Writing: Well-structured paper with accurate technical details
- Limited Model Coverage: Studies only RoBERTa architecture
- Insufficient Theoretical Analysis: Lacks in-depth explanation of why temperature scaling performs best
- Dataset Scale: Some datasets (e.g., iTunes-Amazon) are relatively small, potentially affecting generalization
- Computational Resource Constraints: Affects completeness of certain experiments
- Academic Contribution: Introduces important confidence calibration research direction to entity matching field
- Practical Value: Temperature scaling method is simple, effective, and easily deployable in real systems
- Reproducibility: Detailed experimental settings facilitate reproduction and extension
- Inspirational Value: Provides important foundation and direction guidance for subsequent research
- High-Risk Applications: Medical record matching and similar scenarios requiring reliable confidence estimates
- Human-Machine Collaboration: Applications needing model uncertainty information to assist human decision-making
- Quality Control: Using confidence scores to identify difficult samples requiring manual review
- Model Optimization: Leveraging confidence information to improve model training and data collection strategies
- Guo, C., et al. (2017). On Calibration of Modern Neural Networks. ICML.
- Li, Y., et al. (2020). Deep Entity Matching with Pre-Trained Language Models. VLDB.
- Desai, S., & Durrett, G. (2020). Calibration of Pre-trained Transformers. EMNLP.
- Brunner, U., & Stockinger, K. (2020). Entity Matching with Transformer Architectures. EDBT.
- Peeters, R., & Bizer, C. (2024). Entity Matching using Large Language Models. arXiv.
Summary: This paper makes important contributions to confidence calibration research in entity matching, providing systematic method comparison and practical solutions. The excellent performance of temperature scaling offers valuable guidance for practical applications. Despite certain limitations, this research establishes a solid foundation for subsequent work and possesses significant academic and practical value.