2025-11-14T11:40:11.153329

One Sentence, Two Embeddings: Contrastive Learning of Explicit and Implicit Semantic Representations

Oda, Chuang, Shirai et al.

Sentence embedding methods have made remarkable progress, yet they still struggle to capture the implicit semantics within sentences. This can be attributed to the inherent limitations of conventional sentence embedding methods that assign only a single vector per sentence. To overcome this limitation, we propose DualCSE, a sentence embedding method that assigns two embeddings to each sentence: one representing the explicit semantics and the other representing the implicit semantics. These embeddings coexist in the shared space, enabling the selection of the desired semantics for specific purposes such as information retrieval and text classification. Experimental results demonstrate that DualCSE can effectively encode both explicit and implicit meanings and improve the performance of the downstream task.

academic

One Sentence, Two Embeddings: Contrastive Learning of Explicit and Implicit Semantic Representations

基本信息

论文ID: 2510.09293
标题: One Sentence, Two Embeddings: Contrastive Learning of Explicit and Implicit Semantic Representations
作者: Kohei Oda¹, Po-Min Chuang², Kiyoaki Shirai¹, Natthawut Kertkeidkachorn¹
机构: ¹日本先端科学技术研究院, ²东芝公司
分类: cs.CL (Computation and Language)
发表时间: 2025年10月10日
论文链接: https://arxiv.org/abs/2510.09293v1

语义理解的完整性：自然语言中既包含字面意思（显式语义）也包含比喻或语用含义（隐式语义）
实际应用需求：信息检索、文本分类等任务需要理解不同层次的语义
模型局限性：传统方法仅用单一向量表示句子，忽略了多重解释的存在

现有方法局限性

单向量限制：每个句子只分配一个嵌入向量
语义混合：无法区分显式和隐式语义
表示能力不足：难以捕获句子的多层含义

核心贡献

提出DualCSE框架：为每个句子生成两个嵌入向量，分别表示显式和隐式语义
设计新颖的对比损失函数：同时优化句间关系和句内关系
构建双语义共享空间：使显式和隐式嵌入能够在同一空间中进行比较
验证方法有效性：在RTE和EIS任务上证明了方法的优越性
提供隐含性评估能力：能够估计句子的隐含程度

方法详解

任务定义

给定句子s，DualCSE将其编码为两个嵌入：

r：表示显式语义的嵌入
u：表示隐式语义的嵌入

模型架构

编码器设计

论文提出两种编码器架构：

Cross-encoder：
- 使用单个BERT/RoBERTa模型
- 输入"CLS s SEP explicit"生成显式嵌入r
- 输入"CLS s SEP implicit"生成隐式嵌入u
Bi-encoder：
- 使用两个独立的BERT/RoBERTa模型
- 分别训练生成r和u

对比损失函数

基于INLI数据集设计的损失函数：

v(h₁,h₂) = e^(sim(h₁,h₂)/τ)

lᵢ = -log(v(rᵢ,r⁺ᵢ₁)/∑ⱼ(v(rᵢ,r⁺ⱼ₁) + v(rᵢ,r⁻ⱼ) + v(rᵢ,uⱼ)))
     -log(v(uᵢ,r⁺ᵢ₂)/∑ⱼ(v(uᵢ,r⁺ⱼ₂) + v(uᵢ,r⁻ⱼ) + v(uᵢ,rⱼ)))
     -log(v(r⁺ᵢ₁,u⁺ᵢ₁)/∑ⱼv(r⁺ᵢ₁,u⁺ⱼ₁))
     -log(v(r⁺ᵢ₂,u⁺ᵢ₂)/∑ⱼv(r⁺ᵢ₂,u⁺ⱼ₂))
     -log(v(r⁻ᵢ,u⁻ᵢ)/∑ⱼv(r⁻ᵢ,u⁻ⱼ))

技术创新点

双重语义表示：突破单向量限制，为句子提供两个不同维度的表示
句间和句内关系建模：
- 句间：前提与蕴含假设相似，与矛盾假设不相似
- 句内：假设的显式和隐式语义相近，前提的显式和隐式语义相远
共享空间设计：使不同类型的语义能够在同一空间中比较

规模：训练集32,000对，开发集4,000对，测试集4,000对
特点：为每个前提提供四种假设标签
- implied-entailment：隐式蕴含
- explicit-entailment：显式蕴含
- neutral：中性
- contradiction：矛盾

Wang等人数据集

规模：训练集101,320对，开发/测试集各5,630对
用途：隐含性评分任务

评价指标

RTE任务：准确率（Accuracy）
EIS任务：准确率（Accuracy）

对比方法

SimCSE (SNLI+MNLI)：基于标准NLI数据集训练
SimCSE (INLI)：基于INLI数据集训练的SimCSE
ImpScore：专门用于隐含性评分的方法
大语言模型：GPT-4、Gemini-1.5-Pro等作为参考

实现细节

基础模型：BERT-base、RoBERTa-base
批次大小：Cross-encoder为64，Bi-encoder为32
学习率：Cross-encoder为5e-5，Bi-encoder为3e-5
温度参数τ：0.05

模型	显式	隐式	中性	矛盾	平均
SimCSE (SNLI+MNLI)	79.80	49.00	74.30	67.60	67.68
SimCSE (INLI)	90.60	69.10	66.90	91.00	79.40
DualCSE-Cross	90.20	73.40	68.40	88.70	80.18
DualCSE-Bi	91.90	69.90	72.10	87.60	80.38
Gemini-1.5-Pro	97.90	80.30	92.00	95.40	91.40