2025-11-22T10:40:16.215584

What Makes LLMs Effective Sequential Recommenders? A Study on Preference Intensity and Temporal Context

Ouyang, Wen, Zhang et al.

Sequential recommendation systems aspire to profile users by interpreting their interaction histories, echoing how humans make decisions by weighing experience, relative preference strength, and situational relevance. Yet, existing large language model (LLM)-based recommenders often fall short of mimicking the flexible, context-aware decision strategies humans exhibit, neglecting the structured, dynamic, and context-aware mechanisms fundamental to human behaviors. To bridge this gap, we propose RecPO, a preference optimization framework that models structured feedback and contextual delay to emulate human-like prioritization in sequential recommendation. RecPO exploits adaptive reward margins based on inferred preference hierarchies and temporal signals, enabling the model to favor immediately relevant items and to distinguish between varying degrees of preference and aversion. Extensive experiments across five real-world datasets demonstrate that RecPO not only yields performance gains over state-of-the-art baselines, but also mirrors key characteristics of human decision-making: favoring timely satisfaction, maintaining coherent preferences, and exercising discernment under shifting contexts.

academic

What Makes LLMs Effective Sequential Recommenders? A Study on Preference Intensity and Temporal Context

基本信息

论文ID: 2506.02261
标题: What Makes LLMs Effective Sequential Recommenders? A Study on Preference Intensity and Temporal Context
作者: Zhongyu Ouyang, Qianlong Wen, Chunhui Zhang, Yanfang Ye, Soroush Vosoughi
机构: Dartmouth College, University of Notre Dame
分类: cs.IR, cs.LG
发表时间: 2025年10月10日 (arXiv v2)
论文链接: https://arxiv.org/abs/2506.02261v2

摘要

研究背景与动机

问题定义

现有的基于大语言模型(LLM)的序列推荐系统主要存在以下问题：

二元化偏好建模：现有方法如DPO及其变体通过二元成对比较来处理所有偏好，忽略了偏好强度的差异
时间上下文缺失：缺乏对时间敏感性的建模，无法区分即时满足与延迟满足的差异
人类决策机制的忽视：未能模拟人类在决策过程中权衡经验、相对偏好强度和情境相关性的复杂机制

研究动机

人类的决策行为体现出分级偏好（强烈喜爱 vs 轻微喜欢）和时间敏感性（即时 vs 延迟满足），这些特征在行为经济学和认知科学中得到充分证实，但在当前LLM推荐系统的偏好对齐中被大量忽视。本文通过系统性实证研究发现，整合全面反馈（包括负面交互）和结构化偏好信号（如评分）能显著提升性能。

核心洞察

通过概念验证实验，作者识别出两个关键因素：

偏好强度：用户亲和或厌恶的分级强度
时间上下文：满足的即时性

核心贡献

理论贡献：系统性证明了偏好强度和时间上下文是LLM推荐系统中细粒度偏好建模的关键因素，挑战了现有的二元偏好范式
方法贡献：提出RecPO框架，通过基于偏好强度和时间上下文的自适应奖励边际来整合这些因素
实证贡献：在五个数据集上的实验表明，RecPO不仅提升了准确性，还表现出与人类偏好一致的行为特征：优先考虑及时满足，在变化的上下文中保持偏好一致性

方法详解

任务定义

给定用户u在时刻t的交互历史 $H_u^t$ 和候选物品集合 $C = \{i^{(j)}\}_{j=1}^K$ ，其中 $H_u^t \cap C = \emptyset$ 且 $i_p^{t+1} \in C$ ，模型 $\pi_\theta$ 需要预测用户最可能喜欢的物品 $i_p^{t+1}$ 。

核心方法：RecPO框架

1. 自适应奖励边际

RecPO的核心创新在于定义自适应目标奖励边际 $\gamma_r$ ，该边际由结构化偏好和相对时新性动态决定：

$\gamma_r = \lambda \frac{\phi(s_p, \Delta t_p)}{\phi(s_d, \Delta t_d)}$

其中：

$s_p, s_d$ 分别为偏好和非偏好物品的结构化偏好分数
$\Delta t_p = t_p^+ - t$ 表示交互的时间延迟
$\phi(s, \Delta t) = s/(\Delta t)^{0.5}$ 为效用函数
$\lambda$ 控制边际的幅度

2. 偏好分布建模

基于Bradley-Terry模型，RecPO将偏好概率建模为：

$P^*(y_p \succ y_d | x_u) = \sigma(r(x_u, y_p) - r(x_u, y_d) - \gamma_r)$

3. 目标函数

采用Plackett-Luce模型将成对比较泛化为列表级排序框架，最终目标函数为：

$L(\pi_\theta; \pi_{ref}) = -E_{(x_u,y_p,T_d)\sim D}\left[\log \sigma\left(-\log \sum_{y_d \in T_d} \exp\left(\beta \log \frac{\pi_\theta(y_d|x_u)}{\pi_{ref}(y_d|x_u)} - \beta \log \frac{\pi_\theta(y_p|x_u)}{\pi_{ref}(y_p|x_u)} - \lambda \frac{\phi(s_p,\Delta t_p)}{\phi(s_d,\Delta t_d)}\right)\right)\right]$