2025-11-20T21:55:15.461429

Diffusion Generative Recommendation with Continuous Tokens

Qu, Lin, Ding et al.

Recent advances in generative artificial intelligence, particularly large language models (LLMs), have opened new opportunities for enhancing recommender systems (RecSys). Most existing LLM-based RecSys approaches operate in a discrete space, using vector-quantized tokenizers to align with the inherent discrete nature of language models. However, these quantization methods often result in lossy tokenization and suboptimal learning, primarily due to inaccurate gradient propagation caused by the non-differentiable argmin operation in standard vector quantization. Inspired by the emerging trend of embracing continuous tokens in language models, we propose ContRec, a novel framework that seamlessly integrates continuous tokens into LLM-based RecSys. Specifically, ContRec consists of two key modules: a sigma-VAE Tokenizer, which encodes users/items with continuous tokens; and a Dispersive Diffusion module, which captures implicit user preference. The tokenizer is trained with a continuous Variational Auto-Encoder (VAE) objective, where three effective techniques are adopted to avoid representation collapse. By conditioning on the previously generated tokens of the LLM backbone during user modeling, the Dispersive Diffusion module performs a conditional diffusion process with a novel Dispersive Loss, enabling high-quality user preference generation through next-token diffusion. Finally, ContRec leverages both the textual reasoning output from the LLM and the latent representations produced by the diffusion model for Top-K item retrieval, thereby delivering comprehensive recommendation results. Extensive experiments on four datasets demonstrate that \ourname{} consistently outperforms both traditional and SOTA LLM-based recommender systems. Our results highlight the potential of continuous tokenization and generative modeling for advancing the next generation of recommender systems.

academic

Diffusion Generative Recommendation with Continuous Tokens

基本信息

论文ID: 2504.12007
标题: Diffusion Generative Recommendation with Continuous Tokens
作者: Haohao Qu, Shanru Lin, Yujuan Ding, Yiqi Wang, Wenqi Fan
分类: cs.IR cs.AI
发表时间/会议: arXiv预印本 (2025年10月10日修订版)
论文链接: https://arxiv.org/abs/2504.12007

摘要

本文针对基于大语言模型(LLM)的推荐系统中离散标记化方法的局限性，提出了ContRec框架，该框架将连续标记无缝集成到LLM推荐系统中。ContRec包含两个核心模块：σ-VAE标记器(用连续标记编码用户/物品)和分散扩散模块(捕获隐式用户偏好)。通过结合LLM的文本推理输出和扩散模型生成的潜在表示进行Top-K物品检索，在四个数据集上的实验表明ContRec显著优于传统和最先进的LLM推荐系统。

研究背景与动机

问题定义

现有基于LLM的推荐系统主要面临两个关键问题：

有损标记化：向量量化方法在压缩过程中不可避免地丢失信息
梯度传播不准确：标准向量量化中的不可微argmin操作导致"直通"技巧的使用，产生不准确的梯度

研究重要性

LLM在推荐系统中展现出强大的泛化能力和上下文学习能力
用户和物品集合通常达到百万级别，传统索引方法效率低下
量化方法虽然实用，但存在重构质量和生成性能的限制

现有方法局限性

离散方法：如TIGER、UTGRec等使用VQ-VAE构建离散词汇表，存在信息压缩损失
连续投影方法：如CoLLM、LlaRA仅在输入部分使用连续标记，输出仍依赖离散生成器，存在离散-连续差异

研究动机

受语言模型中拥抱连续标记趋势的启发，探索在推荐场景中使用连续标记和扩散模型的潜力，实现更高质量的用户偏好建模。

核心贡献

提出ContRec框架：首个将连续标记无缝集成到LLM推荐系统的框架，突破量化限制
设计两个关键模块：
- σ-VAE标记器：采用三种技术防止表示坍塌的鲁棒连续标记器
- 分散扩散模块：通过对比自监督学习生成隐式用户偏好表示
引入分散损失：无需显式负正样本对的对比学习机制
实验验证：在四个数据集上平均提升11.76% HR@10和10.11% NDCG@10

方法详解

任务定义

给定用户集合U = {u₁, u₂, ..., uₙ}和物品集合V = {v₁, v₂, ..., vₘ}，目标是通过分析历史交互预测用户未来偏好，将序列推荐重新表述为语言模型范式：

Yᵢ = LLM(P(Tᵢ, {Tⱼ|vⱼ ∈ V(uᵢ)}))

模型架构

1. σ-VAE标记器

采用VAE框架进行非量化标记化，包含三个关键技术：

掩码操作：基于伯努利分布的元素级掩码策略

μₖ = Encₖ(Mask(x, ρ))

K路编码器：并行编码通道实现隐式编码

zₖ = μₖ + σₖ ⊙ ε, where ε ~ N(0,1), σₖ ~ N(0,Σ)

高斯核：防止方差坍塌

x̂ = Dec(Concat{zₖ}ᴷ)

损失函数：

Lvae = ||x̂ - x||₂² + (β/K)∑ᵏ₌₁ᴷ ||μₖ||₂²

2. LLM用户建模

结合离散语义信息和连续协作知识：

Xᵢ := P(Tᵢ, {Tⱼ|vⱼ ∈ V(uᵢ)})

使用特殊标记⟨z_start⟩和⟨z_end⟩标记连续标记序列的开始和结束。

3. 分散扩散模块

条件扩散过程：

Ldiff = E(yᵢ,cᵢ,t) ||ε - εθ(y^t_i, cᵢ, t)||₂²

分散损失：

Ldisp = log E_{i,j}[exp(-D(hᵢ, hⱼ)/τ)]

这是一种"无正样本对的对比损失"，鼓励批次内表示的分散性。

技术创新点

连续标记化：完全避免量化操作，保持信息完整性
混合检索机制：结合LLM文本推理和扩散生成的隐式表示
端到端优化：统一优化目标整合三个损失函数
分类器自由引导：在推理时控制个性化强度

实验设置

数据集

使用四个基准数据集：

数据集	用户数	物品数	交互数	平均长度	密度(%)
LastFM	1,091	3,685	52,670	48.3	1.31
ML1M	6,040	3,416	447,294	165.5	2.17
Beauty	22,363	12,101	278,641	8.9	0.07
Games	47,568	16,834	266,139	9.5	0.03

评价指标

HR@K (Hit Ratio)：Top-K命中率
NDCG@K (Normalized Discounted Cumulative Gain)：归一化折扣累积增益
K值设置为10和20

对比方法

传统序列推荐：GRU4Rec, SASRec, SSD4Rec, DreamRec LLM推荐系统：P5, CoLLM, TIGER, TokenRec, LLaRA

实现细节

基础模型：Llama-3.2-1B-Instruct
优化器：AdamW (学习率 1e-5/1e-4)
批次大小：24
最大序列长度：20
扩散步数：训练1000步，推理100步

实验结果

主要结果

ContRec在所有数据集上均达到最佳性能：

数据集	指标	最佳基线	ContRec	提升
Beauty	HR@10	0.0442	0.0473±0.0017	7.74%
Games	HR@10	0.1018	0.1041±0.0036	8.66%
LastFM	HR@10	0.0525	0.0539±0.0034	15.42%
ML1M	HR@10	0.1076	0.1099±0.0066	15.20%

相比TIGER(典型离散方法)平均提升11.76% HR@10和10.11% NDCG@10。

消融实验

关键组件贡献分析：

组件	Beauty HR@10	ML1M HR@10	影响
完整模型	0.0473	0.1099	-
w/o 扩散	0.0431	0.1007	显著下降
w/o 分散损失	0.0448	0.1042	明显下降
w/o σ	0.0457	0.1051	性能下降
w/ VQ-VAE	0.0426	0.0974	大幅下降