Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting
Hu, Van Durme, Andreas et al.
Language model (LM) agents deployed in novel environments often exhibit poor sample efficiency when learning from sequential interactions. This significantly hinders the usefulness of such agents in environments where interaction is costly (for example, when they interact with humans or reset physical systems). While a number of existing LM agent architectures incorporate various mechanisms for experience storage and reflection, they make limited use of LMs' abilities to directly generate or reason about full counterfactual trajectories. We introduce ECHO (Experience Consolidation via Hindsight Optimization), a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents. ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts, effectively creating synthetic positive examples from unsuccessful interactions. Our approach consists of two components: a hindsight rule that uses the language model itself to identify relevant subgoals and generate optimized trajectories, and an update rule that maintains compressed trajectory representations in memory. We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation. Across both domains, ECHO outperforms vanilla language agent baselines by up to 80%; in XMiniGrid, it also outperforms a number of sophisticated agent architectures including Reflexion and AWM, demonstrating faster adaptation to novel environments through more effective utilization of past experiences.
academic
Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting
语言模型(LM)智能体在新环境中部署时,往往在序列交互学习中表现出较差的样本效率。这严重阻碍了此类智能体在交互成本高昂的环境中的实用性(例如与人类交互或重置物理系统时)。虽然现有的LM智能体架构结合了各种经验存储和反思机制,但它们对LM直接生成或推理完整反事实轨迹能力的利用有限。本文引入了ECHO(Experience Consolidation via Hindsight Optimization),这是一个将强化学习中的后见经验回放适配到语言模型智能体的提示框架。ECHO为失败尝试中可能实现的替代目标生成优化轨迹,有效地从不成功的交互中创建合成正例。该方法包含两个组件:使用语言模型本身识别相关子目标并生成优化轨迹的后见规则,以及在内存中维护压缩轨迹表示的更新规则。