2025-11-20T03:49:14.865400

Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting

Hu, Van Durme, Andreas et al.
Language model (LM) agents deployed in novel environments often exhibit poor sample efficiency when learning from sequential interactions. This significantly hinders the usefulness of such agents in environments where interaction is costly (for example, when they interact with humans or reset physical systems). While a number of existing LM agent architectures incorporate various mechanisms for experience storage and reflection, they make limited use of LMs' abilities to directly generate or reason about full counterfactual trajectories. We introduce ECHO (Experience Consolidation via Hindsight Optimization), a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents. ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts, effectively creating synthetic positive examples from unsuccessful interactions. Our approach consists of two components: a hindsight rule that uses the language model itself to identify relevant subgoals and generate optimized trajectories, and an update rule that maintains compressed trajectory representations in memory. We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation. Across both domains, ECHO outperforms vanilla language agent baselines by up to 80%; in XMiniGrid, it also outperforms a number of sophisticated agent architectures including Reflexion and AWM, demonstrating faster adaptation to novel environments through more effective utilization of past experiences.
academic

Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting

Basic Information

  • Paper ID: 2510.10304
  • Title: Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting
  • Authors: Michael Y. Hu (NYU), Benjamin Van Durme (Microsoft), Jacob Andreas (Microsoft), Harsh Jhamtani (Microsoft)
  • Classification: cs.LG cs.AI cs.CL
  • Publication Date: October 11, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.10304
  • Code Link: https://github.com/michahu/echo

Abstract

Language model (LM) agents often exhibit poor sample efficiency when learning through sequential interactions in new environments. This severely limits the practical deployment of such agents in high-cost interaction scenarios (e.g., human-agent interaction or physical system resets). While existing LM agent architectures incorporate various experience storage and reflection mechanisms, they underutilize the LM's capacity to directly generate or reason about complete counterfactual trajectories. This paper introduces ECHO (Experience Consolidation via Hindsight Optimization), a prompting framework that adapts hindsight experience replay from reinforcement learning to language model agents. ECHO generates optimized trajectories for alternative goals that could have been achieved in failed attempts, effectively creating synthetic positive examples from unsuccessful interactions. The method comprises two components: hindsight rules that use the language model itself to identify relevant subgoals and generate optimized trajectories, and update rules that maintain compressed trajectory representations in memory.

Research Background and Motivation

Core Problems

  1. Low Sample Efficiency: LM agents exhibit poor sample efficiency when learning in new environments, particularly in high-cost interaction scenarios
  2. Limited Counterfactual Reasoning: Existing methods primarily focus on storing or synthesizing experiences without fully leveraging the LM's ability to reason about counterfactual trajectories
  3. Sparse Reward Environments: Agents struggle to learn from failed experiences in reward-sparse settings

Problem Significance

  • Practical Application Needs: Improving sample efficiency is critical in high-cost scenarios such as human-agent interaction or physical system resets
  • Adaptation Requirements: Agents need to rapidly adapt to new environments, such as conversational assistants learning information retrieval and communication strategies in new organizations

Limitations of Existing Methods

  1. Reflexion: Provides high-level reflections, but feedback is often too generic to change model performance
  2. AWM (Agent Workflow Memory): Only stores workflows of successful trajectories, underutilizing failed experiences
  3. Traditional Experience Replay: Primarily focuses on numerical rewards and states, unable to perform flexible trajectory editing

Core Contributions

  1. Proposes ECHO Framework: The first prompting framework to adapt hindsight experience replay (HER) to language model agents
  2. Innovative Trajectory Rewriting Mechanism: Enables arbitrary rewriting of failed trajectories, including changing goals and intermediate steps
  3. Constructs Stateful Benchmarks: Creates XMiniGrid-Stateful and PeopleJoinQA-Stateful environments requiring exploration
  4. Significant Performance Gains: 80% improvement over ReAct baseline on XMiniGrid, 42% improvement over the second-best baseline

Methodology Details

Task Definition

Consider an online setting where an LM agent sequentially processes a query sequence from time t=0 to T without access to true reward functions or demonstration data. The agent must learn through environment interaction and improve the efficiency of future decisions.

ECHO Architecture

Core Components

ECHO comprises two main components:

  1. Hindsight Rule:
    • Proposes achievable goals from a given trajectory
    • Generates optimized trajectories or descriptions for these goals
    • Takes no action if no goals can be proposed
  2. Update Rule:
    • Compares newly generated descriptions with previous descriptions
    • Saves the shorter workflow (based on minimum description length principle)
    • Maintains compressed trajectory representations

Algorithm Flow

def ECHO(LM, trajectory, replay_buf={}):
    # Hindsight rule
    summary = LM.summarize(trajectory)
    goals = LM.identify_goals(trajectory)
    for goal in goals:
        new_traj = LM.infer_traj(goal, trajectory)
        
    # Update rule
    old_traj = replay_buf[goal]
    if old_traj and len(new_traj) < len(old_traj):
        replay_buf[goal] = new_traj
    return replay_buf

Technical Innovations

  1. Enhanced Expressiveness: Unlike traditional HER which only relabels goals, ECHO can arbitrarily rewrite trajectory structure
  2. Leveraging Pretrained Knowledge: Uses the LM's world knowledge to fill information gaps and propose reasonable counterfactual information
  3. Compressed Representations: Based on Kolmogorov complexity, maintains the shortest possible description of goal achievement
  4. Adaptive Mechanisms: The LM can choose abstraction levels to avoid adding invalid trajectories

Experimental Setup

Datasets

XMiniGrid-Stateful

  • Base Environment: Procedurally generated 2D GridWorld navigation and planning tasks
  • Stateful Modification: Agents execute randomly sampled goals in the same environment, learning locations of unseen objects
  • Scale: 10 unique environments, each with 4 rooms and 4 objects, 16 queries per environment
  • Task: Pick randomly sampled objects within 64 steps; partial observability increases challenge

PeopleJoinQA-Stateful

  • Base Environment: Multi-agent collaborative information gathering question-answering task
  • Stateful Modification: Fixed organizational structure; agent answers all questions about that organization
  • Scale: 5 organizations, 248 total queries, averaging 7.98 messages per query
  • Task: Contact simulated personnel via tool calls, synthesize information to answer questions

Evaluation Metrics

  1. Final Average Reward (Accuracy): Measures final performance
  2. Cumulative Average Reward: Measures sample efficiency
    Cumulative Average Reward at τ = (1/(τ+1)) × Σ(t=0 to τ) Rt
    
  3. Improvement Relative to ReAct Baseline: Normalizes problem difficulty

Comparison Methods

  1. ReAct: Reasoning-acting baseline agent
  2. Reflexion: Language-based reinforcement learning for language agents
  3. AWM: Agent Workflow Memory
  4. AWM++: AWM + ECHO's update rule

Implementation Details

  • Model: GPT-4o
  • Temperature Settings: 0 for ReAct, 0.7 for offline reasoning in PeopleJoin
  • Max Tokens: 3800-4000
  • Trajectory Validity: 85% of synthetic trajectories are executable in XMiniGrid

Experimental Results

Main Results

XMiniGrid-Stateful

  • vs. ReAct: 80% improvement in average reward
  • vs. Second-Best Baseline: 42% improvement
  • Sample Efficiency: Cumulative reward exceeds ReAct baseline after 3 interactions
  • Strictly Superior: Outperforms all comparison methods including Reflexion and AWM

PeopleJoinQA-Stateful

  • Accuracy: Slightly lower than Reflexion by 4.6%, but still superior to ReAct
  • Efficiency: Average reduction of 1.6 messages, on par with AWM
  • Sample Efficiency: Exceeds ReAct baseline after first query

Trajectory Validity Analysis

Among 40 sampled examples in XMiniGrid:

  • 85% Success Rate: Agents successfully achieve synthesized goals
  • Failure Causes: 4 cases due to execution deviation, 2 cases due to infeasible steps
  • Conclusion: Counterfactual workflows generated by ECHO are mostly correct and effective

Case Analysis

Failed Trajectory Example: Agent fails to pick up gray key

  • Reflexion Output: Generic feedback lacking specific improvement suggestions
  • AWM Output: Correctly generates no workflow due to failure
  • ECHO Output: Identifies that agent observed gray star, generates optimized trajectory for picking up gray star

Cross-Organization Variability

In PeopleJoinQA, optimal methods vary across organizations:

  • No method strictly dominates across all organizations
  • ECHO becomes the most efficient method in certain organizations (e.g., department stores)
  • Indicates need for improved robustness of offline methods

Language Model Agents

  • Current Status: Transition from static knowledge dependency to dynamic environment adaptation
  • Main Challenges: Insufficient exploration and adaptation capabilities in new environments
  • Application Domains: Web navigation, tool use, multi-agent collaboration, code generation

Memory System Classification

Following Sumers et al.'s taxonomy:

  1. Semantic Memory: Environmental facts (e.g., Reflexion's reflections)
  2. Episodic Memory: Past actions (e.g., AWM's workflows)
  • ECHO primarily improves the construction and update mechanisms of episodic memory

Experience Replay Techniques

  • Traditional HER: Relabels trajectory goals without modifying trajectory structure
  • Sparse Reward Advantages: Extracts maximum learning signal from few positive examples
  • ECHO Extension: Not only relabels goals but can edit arbitrary aspects of trajectories

Conclusions and Discussion

Main Conclusions

  1. Effectiveness Validation: ECHO significantly improves sample efficiency in two exploration-requiring environments
  2. Mechanism Advantages: Better utilizes past experience by converting failures into synthetic successes
  3. Applicable Scenarios: Particularly effective in sparse reward environments where baselines perform poorly

Limitations

  1. Representation Constraints: Primarily uses natural language representation; code-based representations may be more effective
  2. Simplified Update Rules: Length-based heuristic update rules may be overly simplistic
  3. Environment Dependency: Performance varies across different organizations/environments
  4. Incomplete World Models: LM may lack complete environmental models after single trajectories

Future Directions

  1. Programmatic Representations: Explore effectiveness of code-based trajectory representations
  2. Complex Update Rules: Design more precise information fusion mechanisms
  3. Retrieval Augmentation: Incorporate retrieval-based memory mechanisms
  4. Robustness Enhancement: Improve consistency across environments

In-Depth Evaluation

Strengths

  1. Strong Novelty: First adaptation of HER to LM agents with significant theoretical and practical value
  2. Comprehensive Experiments: Validation across two different environment types with detailed ablation analysis
  3. High Practical Value: Addresses critical challenges of LM agents in high-cost interaction scenarios
  4. General Framework: Well-designed architecture with good extensibility and adaptability

Weaknesses

  1. Benchmark Limitations: Testing only on two relatively simple environments; lacks validation on more complex realistic scenarios
  2. Insufficient Theoretical Analysis: Lacks in-depth analysis of convergence properties and theoretical guarantees
  3. Computational Overhead: Multiple LM calls may introduce additional computational costs
  4. Model Capability Dependency: Method effectiveness highly depends on underlying LM's reasoning and generation abilities

Impact

  1. Academic Contribution: Provides new research direction for experience learning in LM agents
  2. Practical Applications: Potential applications in human-agent interaction, robot control, and other high-cost scenarios
  3. Methodological Inspiration: Provides design insights for other LM-based learning algorithms

Applicable Scenarios

  1. High-Cost Interaction Environments: Human-agent dialogue, physical system control
  2. Sparse Reward Tasks: Exploration-oriented navigation and planning problems
  3. Partially Observable Environments: Scenarios requiring learning environmental structure through interaction
  4. Multi-Goal Tasks: Environments where multiple sub-skills can be learned from single experiences

References

  • Andrychowicz, M., et al. (2017). Hindsight experience replay. NIPS.
  • Shinn, N., et al. (2023). Reflexion: language agents with verbal reinforcement learning. NIPS.
  • Wang, Z. Z., et al. (2025). Agent workflow memory. ICML.
  • Yao, S., et al. (2023). React: Synergizing reasoning and acting in language models. ICLR.

Overall Assessment: The ECHO framework proposed in this paper achieves important progress in sample-efficient learning for LM agents. The method is novel and experimental results are convincing. While certain limitations exist, it establishes a solid foundation for future development in this field and demonstrates high academic value and practical application potential.