2025-11-14T11:40:11.153329

One Sentence, Two Embeddings: Contrastive Learning of Explicit and Implicit Semantic Representations

Oda, Chuang, Shirai et al.
Sentence embedding methods have made remarkable progress, yet they still struggle to capture the implicit semantics within sentences. This can be attributed to the inherent limitations of conventional sentence embedding methods that assign only a single vector per sentence. To overcome this limitation, we propose DualCSE, a sentence embedding method that assigns two embeddings to each sentence: one representing the explicit semantics and the other representing the implicit semantics. These embeddings coexist in the shared space, enabling the selection of the desired semantics for specific purposes such as information retrieval and text classification. Experimental results demonstrate that DualCSE can effectively encode both explicit and implicit meanings and improve the performance of the downstream task.
academic

One Sentence, Two Embeddings: Contrastive Learning of Explicit and Implicit Semantic Representations

Basic Information

  • Paper ID: 2510.09293
  • Title: One Sentence, Two Embeddings: Contrastive Learning of Explicit and Implicit Semantic Representations
  • Authors: Kohei Oda¹, Po-Min Chuang², Kiyoaki Shirai¹, Natthawut Kertkeidkachorn¹
  • Affiliations: ¹Japan Advanced Institute of Science and Technology, ²Toshiba Corporation
  • Classification: cs.CL (Computation and Language)
  • Publication Date: October 10, 2025
  • Paper Link: https://arxiv.org/abs/2510.09293v1

Abstract

Sentence embedding methods have achieved significant progress, yet still face difficulties in capturing implicit semantics within sentences. This can be attributed to the inherent limitation of traditional sentence embedding methods, which assign only a single vector to each sentence. To overcome this limitation, this paper proposes DualCSE, a method that assigns two embeddings to each sentence: one representing explicit semantics and another representing implicit semantics. These embeddings coexist in a shared space, enabling the selection of desired semantics for specific purposes such as information retrieval and text classification. Experimental results demonstrate that DualCSE effectively encodes both explicit and implicit meanings, improving performance on downstream tasks.

Research Background and Motivation

Problem Definition

Existing sentence embedding methods exhibit significant deficiencies in handling implicit semantics. Sun et al. (2025) point out that even state-of-the-art sentence embedding methods show approximately a 20% performance gap between explicit and implicit semantics on the MTEB classification benchmark.

Problem Significance

  1. Completeness of Semantic Understanding: Natural language contains both literal meanings (explicit semantics) and figurative or pragmatic meanings (implicit semantics)
  2. Practical Application Requirements: Tasks such as information retrieval and text classification require understanding different levels of semantics
  3. Model Limitations: Traditional methods represent sentences with only a single vector, overlooking the existence of multiple interpretations

Limitations of Existing Methods

  • Single-Vector Constraint: Each sentence is assigned only one embedding vector
  • Semantic Conflation: Inability to distinguish between explicit and implicit semantics
  • Insufficient Representational Capacity: Difficulty in capturing multi-layered meanings of sentences

Core Contributions

  1. Proposes DualCSE Framework: Generates two embedding vectors for each sentence, representing explicit and implicit semantics respectively
  2. Designs Novel Contrastive Loss Function: Simultaneously optimizes inter-sentence and intra-sentence relationships
  3. Constructs Dual-Semantic Shared Space: Enables comparison of explicit and implicit embeddings in the same space
  4. Validates Method Effectiveness: Demonstrates superiority on RTE and EIS tasks
  5. Provides Implicitness Assessment Capability: Enables estimation of the degree of implicitness in sentences

Methodology Details

Task Definition

Given a sentence s, DualCSE encodes it into two embeddings:

  • r: Embedding representing explicit semantics
  • u: Embedding representing implicit semantics

Model Architecture

Encoder Design

The paper proposes two encoder architectures:

  1. Cross-encoder:
    • Uses a single BERT/RoBERTa model
    • Input "CLS s SEP explicit" generates explicit embedding r
    • Input "CLS s SEP implicit" generates implicit embedding u
  2. Bi-encoder:
    • Uses two independent BERT/RoBERTa models
    • Separately trained to generate r and u

Contrastive Loss Function

Loss function designed based on the INLI dataset:

v(h₁,h₂) = e^(sim(h₁,h₂)/τ)

lᵢ = -log(v(rᵢ,r⁺ᵢ₁)/∑ⱼ(v(rᵢ,r⁺ⱼ₁) + v(rᵢ,r⁻ⱼ) + v(rᵢ,uⱼ)))
     -log(v(uᵢ,r⁺ᵢ₂)/∑ⱼ(v(uᵢ,r⁺ⱼ₂) + v(uᵢ,r⁻ⱼ) + v(uᵢ,rⱼ)))
     -log(v(r⁺ᵢ₁,u⁺ᵢ₁)/∑ⱼv(r⁺ᵢ₁,u⁺ⱼ₁))
     -log(v(r⁺ᵢ₂,u⁺ᵢ₂)/∑ⱼv(r⁺ᵢ₂,u⁺ⱼ₂))
     -log(v(r⁻ᵢ,u⁻ᵢ)/∑ⱼv(r⁻ᵢ,u⁻ⱼ))

Technical Innovations

  1. Dual Semantic Representation: Breaks through single-vector limitations, providing two different dimensional representations for sentences
  2. Inter-sentence and Intra-sentence Relationship Modeling:
    • Inter-sentence: Premises are similar to entailing hypotheses and dissimilar to contradicting hypotheses
    • Intra-sentence: Explicit and implicit semantics of hypotheses are similar; explicit and implicit semantics of premises are dissimilar
  3. Shared Space Design: Enables comparison of different semantic types in the same space

Experimental Setup

Datasets

INLI Dataset

  • Scale: 32,000 training pairs, 4,000 development pairs, 4,000 test pairs
  • Characteristics: Provides four hypothesis labels for each premise
    • implied-entailment: Implicit entailment
    • explicit-entailment: Explicit entailment
    • neutral: Neutral
    • contradiction: Contradiction

Wang et al. Dataset

  • Scale: 101,320 training pairs, 5,630 development/test pairs each
  • Purpose: Implicitness scoring task

Evaluation Metrics

  • RTE Task: Accuracy
  • EIS Task: Accuracy

Baseline Methods

  1. SimCSE (SNLI+MNLI): Trained on standard NLI datasets
  2. SimCSE (INLI): SimCSE trained on INLI dataset
  3. ImpScore: Method specifically designed for implicitness scoring
  4. Large Language Models: GPT-4, Gemini-1.5-Pro as reference

Implementation Details

  • Base Models: BERT-base, RoBERTa-base
  • Batch Size: 64 for Cross-encoder, 32 for Bi-encoder
  • Learning Rate: 5e-5 for Cross-encoder, 3e-5 for Bi-encoder
  • Temperature Parameter τ: 0.05

Experimental Results

Main Results

RTE Task Results

ModelExplicitImplicitNeutralContradictionAverage
SimCSE (SNLI+MNLI)79.8049.0074.3067.6067.68
SimCSE (INLI)90.6069.1066.9091.0079.40
DualCSE-Cross90.2073.4068.4088.7080.18
DualCSE-Bi91.9069.9072.1087.6080.38
Gemini-1.5-Pro97.9080.3092.0095.4091.40

EIS Task Results

ModelINLIWang et al. Dataset
LENGTH99.9073.37
ImpScore (original)80.5595.20
ImpScore (INLI)99.9781.56
DualCSE-Cross99.9779.31
DualCSE-Bi10077.48

Ablation Study

Ablation experiments validate the importance of each component in the loss function:

Loss Function ConfigurationRTEEIS
Complete DualCSE80.1899.97
Without Contradiction Term64.5799.88
Without Intra-sentence Relations80.1092.25
Without Contradiction and Intra-sentence Relations64.6832.75

Findings:

  • Contradiction term is more important for RTE task
  • Intra-sentence relations are more important for EIS task

Case Analysis

Retrieval Experiment Example

Query Sentence: "She conquered his heart."

Explicit Semantic Retrieval Results:

  1. "She defeated his heart in battle." (Literal battle meaning)
  2. "She overcame his cardiac defenses."
  3. "She vanquished his emotional barriers."

Implicit Semantic Retrieval Results:

  1. "She won his affection and love." (Romantic meaning)
  2. "She captured his romantic interest."
  3. "She gained his deep emotional attachment."

Sentence Embedding Methods

  • BERT-based Methods: Sentence-BERT, SimCSE, etc.
  • Contrastive Learning: Applications in sentence embeddings
  • Multi-semantic Representations: Limited work attempting to capture multiple meanings

Implicit Semantic Understanding

  • Pragmatics Research: Conversational implicature, indirect speech acts
  • NLI Extensions: From explicit reasoning to implicit reasoning
  • Implicitness Assessment: Quantifying the degree of implicitness in sentences

Advantages of This Work

  1. First Systematic Approach: Specifically addresses dual representation of explicit/implicit semantics
  2. End-to-End Training: Unified framework simultaneously learns both semantic types
  3. Strong Practicality: Directly applicable to multiple downstream tasks

Conclusions and Discussion

Main Conclusions

  1. DualCSE Effectiveness: Outperforms baseline methods on both RTE and EIS tasks
  2. Value of Dual Representation: Separated representation of explicit and implicit semantics indeed facilitates understanding
  3. Reasonable Loss Function Design: Modeling both inter-sentence and intra-sentence relationships is important
  4. Architecture Flexibility: Both Cross-encoder and Bi-encoder work effectively

Limitations

  1. Dataset Dependency: Trained only on INLI dataset, limited domain diversity
  2. Limited Evaluation Tasks: Validated on only two tasks, lacking broader evaluation
  3. Computational Overhead: Requires generating two embeddings per sentence, increasing computational cost
  4. Cross-domain Generalization: Performance on Wang et al. dataset is inferior to specialized methods

Future Directions

  1. Dataset Expansion: Convert hate speech detection, sentiment analysis, and other datasets to INLI format
  2. Large Model Integration: Extend method to large language models
  3. Practical Applications: Validate in scenarios such as customer review analysis and search engines
  4. Theoretical Analysis: Investigate mathematical properties of explicit/implicit semantics

In-Depth Evaluation

Strengths

  1. Clear Problem Definition: Accurately identifies core issues with existing methods
  2. Strong Method Innovation: Dual semantic representation is a novel and reasonable approach
  3. Comprehensive Experimental Design: Includes main experiments, ablation studies, and qualitative analysis
  4. Feasible Technical Implementation: Provides two different architectural choices
  5. Open Source Code: Enhances reproducibility

Weaknesses

  1. Weak Theoretical Foundation: Lacks theoretical analysis of explicit/implicit semantic distinction
  2. Limited Evaluation Scope: Validation on only two tasks lacks sufficient persuasiveness
  3. Insufficient Baseline Comparisons: Lacks comparison with other multi-semantic representation methods
  4. Missing Efficiency Analysis: Does not analyze computational overhead of dual embeddings
  5. Unknown Cross-lingual Capability: Validated only on English

Impact

  1. Academic Value: Provides new perspective for sentence embedding research
  2. Practical Value: Applicable to NLP tasks requiring understanding of implicit meanings
  3. Inspirational Value: May inspire more research on multi-semantic representations
  4. Limited Scope: Impact may be constrained by method generalizability

Applicable Scenarios

  1. Information Retrieval: Search requiring consideration of both literal and implicit meanings
  2. Text Classification: Sentiment analysis, intent recognition, and similar tasks
  3. Dialogue Systems: Understanding user's implied meanings
  4. Content Moderation: Detecting subtle inappropriate content
  5. Language Education: Facilitating understanding of multi-layered language meanings

References

This paper cites important works from multiple domains including sentence embeddings, natural language inference, and contrastive learning, including:

  • Gao et al. (2021): SimCSE method
  • Havaldar et al. (2025): INLI dataset
  • Wang et al. (2025): Implicitness scoring method
  • Reimers and Gurevych (2019): Sentence-BERT

Overall Assessment: This is a paper with strong technical innovation, proposing an interesting and practical dual semantic representation method. While there is room for improvement in theoretical depth and evaluation breadth, it opens new directions for sentence embedding research and possesses certain academic value and application potential.