2025-11-20T03:49:14.865400

Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting

Hu, Van Durme, Andreas et al.

Language model (LM) agents deployed in novel environments often exhibit poor sample efficiency when learning from sequential interactions. This significantly hinders the usefulness of such agents in environments where interaction is costly (for example, when they interact with humans or reset physical systems). While a number of existing LM agent architectures incorporate various mechanisms for experience storage and reflection, they make limited use of LMs' abilities to directly generate or reason about full counterfactual trajectories. We introduce ECHO (Experience Consolidation via Hindsight Optimization), a prompting framework that adapts hindsight experience replay from reinforcement learning for language model agents. ECHO generates optimized trajectories for alternative goals that could have been achieved during failed attempts, effectively creating synthetic positive examples from unsuccessful interactions. Our approach consists of two components: a hindsight rule that uses the language model itself to identify relevant subgoals and generate optimized trajectories, and an update rule that maintains compressed trajectory representations in memory. We evaluate ECHO on stateful versions of XMiniGrid, a text-based navigation and planning benchmark, and PeopleJoinQA, a collaborative information-gathering enterprise simulation. Across both domains, ECHO outperforms vanilla language agent baselines by up to 80%; in XMiniGrid, it also outperforms a number of sophisticated agent architectures including Reflexion and AWM, demonstrating faster adaptation to novel environments through more effective utilization of past experiences.

academic

Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting

Basic Information

Paper ID: 2510.10304
Title: Sample-Efficient Online Learning in LM Agents via Hindsight Trajectory Rewriting
Authors: Michael Y. Hu (NYU), Benjamin Van Durme (Microsoft), Jacob Andreas (Microsoft), Harsh Jhamtani (Microsoft)
Classification: cs.LG cs.AI cs.CL
Publication Date: October 11, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.10304
Code Link: https://github.com/michahu/echo

Abstract

Language model (LM) agents often exhibit poor sample efficiency when learning through sequential interactions in new environments. This severely limits the practical deployment of such agents in high-cost interaction scenarios (e.g., human-agent interaction or physical system resets). While existing LM agent architectures incorporate various experience storage and reflection mechanisms, they underutilize the LM's capacity to directly generate or reason about complete counterfactual trajectories. This paper introduces ECHO (Experience Consolidation via Hindsight Optimization), a prompting framework that adapts hindsight experience replay from reinforcement learning to language model agents. ECHO generates optimized trajectories for alternative goals that could have been achieved in failed attempts, effectively creating synthetic positive examples from unsuccessful interactions. The method comprises two components: hindsight rules that use the language model itself to identify relevant subgoals and generate optimized trajectories, and update rules that maintain compressed trajectory representations in memory.

Research Background and Motivation

Core Problems

Low Sample Efficiency: LM agents exhibit poor sample efficiency when learning in new environments, particularly in high-cost interaction scenarios
Limited Counterfactual Reasoning: Existing methods primarily focus on storing or synthesizing experiences without fully leveraging the LM's ability to reason about counterfactual trajectories
Sparse Reward Environments: Agents struggle to learn from failed experiences in reward-sparse settings

Problem Significance

Practical Application Needs: Improving sample efficiency is critical in high-cost scenarios such as human-agent interaction or physical system resets
Adaptation Requirements: Agents need to rapidly adapt to new environments, such as conversational assistants learning information retrieval and communication strategies in new organizations

Limitations of Existing Methods

Reflexion: Provides high-level reflections, but feedback is often too generic to change model performance
AWM (Agent Workflow Memory): Only stores workflows of successful trajectories, underutilizing failed experiences
Traditional Experience Replay: Primarily focuses on numerical rewards and states, unable to perform flexible trajectory editing

Core Contributions

Proposes ECHO Framework: The first prompting framework to adapt hindsight experience replay (HER) to language model agents
Innovative Trajectory Rewriting Mechanism: Enables arbitrary rewriting of failed trajectories, including changing goals and intermediate steps
Constructs Stateful Benchmarks: Creates XMiniGrid-Stateful and PeopleJoinQA-Stateful environments requiring exploration
Significant Performance Gains: 80% improvement over ReAct baseline on XMiniGrid, 42% improvement over the second-best baseline

Methodology Details

Task Definition

Consider an online setting where an LM agent sequentially processes a query sequence from time t=0 to T without access to true reward functions or demonstration data. The agent must learn through environment interaction and improve the efficiency of future decisions.

ECHO Architecture

Core Components

ECHO comprises two main components:

Hindsight Rule:
- Proposes achievable goals from a given trajectory
- Generates optimized trajectories or descriptions for these goals
- Takes no action if no goals can be proposed
Update Rule:
- Compares newly generated descriptions with previous descriptions
- Saves the shorter workflow (based on minimum description length principle)
- Maintains compressed trajectory representations

Algorithm Flow

def ECHO(LM, trajectory, replay_buf={}):
    # Hindsight rule
    summary = LM.summarize(trajectory)
    goals = LM.identify_goals(trajectory)
    for goal in goals:
        new_traj = LM.infer_traj(goal, trajectory)
        
    # Update rule
    old_traj = replay_buf[goal]
    if old_traj and len(new_traj) < len(old_traj):
        replay_buf[goal] = new_traj
    return replay_buf

Technical Innovations

Enhanced Expressiveness: Unlike traditional HER which only relabels goals, ECHO can arbitrarily rewrite trajectory structure
Leveraging Pretrained Knowledge: Uses the LM's world knowledge to fill information gaps and propose reasonable counterfactual information
Compressed Representations: Based on Kolmogorov complexity, maintains the shortest possible description of goal achievement
Adaptive Mechanisms: The LM can choose abstraction levels to avoid adding invalid trajectories

Experimental Setup

Datasets

XMiniGrid-Stateful

Base Environment: Procedurally generated 2D GridWorld navigation and planning tasks
Stateful Modification: Agents execute randomly sampled goals in the same environment, learning locations of unseen objects
Scale: 10 unique environments, each with 4 rooms and 4 objects, 16 queries per environment
Task: Pick randomly sampled objects within 64 steps; partial observability increases challenge

PeopleJoinQA-Stateful

Base Environment: Multi-agent collaborative information gathering question-answering task
Stateful Modification: Fixed organizational structure; agent answers all questions about that organization
Scale: 5 organizations, 248 total queries, averaging 7.98 messages per query
Task: Contact simulated personnel via tool calls, synthesize information to answer questions

Evaluation Metrics

Final Average Reward (Accuracy): Measures final performance

Cumulative Average Reward: Measures sample efficiency

Cumulative Average Reward at τ = (1/(τ+1)) × Σ(t=0 to τ) Rt

Improvement Relative to ReAct Baseline: Normalizes problem difficulty

Comparison Methods

ReAct: Reasoning-acting baseline agent
Reflexion: Language-based reinforcement learning for language agents
AWM: Agent Workflow Memory
AWM++: AWM + ECHO's update rule

Implementation Details

Model: GPT-4o
Temperature Settings: 0 for ReAct, 0.7 for offline reasoning in PeopleJoin
Max Tokens: 3800-4000
Trajectory Validity: 85% of synthetic trajectories are executable in XMiniGrid

Experimental Results

Main Results

XMiniGrid-Stateful

vs. ReAct: 80% improvement in average reward
vs. Second-Best Baseline: 42% improvement
Sample Efficiency: Cumulative reward exceeds ReAct baseline after 3 interactions
Strictly Superior: Outperforms all comparison methods including Reflexion and AWM

PeopleJoinQA-Stateful

Accuracy: Slightly lower than Reflexion by 4.6%, but still superior to ReAct
Efficiency: Average reduction of 1.6 messages, on par with AWM
Sample Efficiency: Exceeds ReAct baseline after first query

Trajectory Validity Analysis

Among 40 sampled examples in XMiniGrid:

85% Success Rate: Agents successfully achieve synthesized goals
Failure Causes: 4 cases due to execution deviation, 2 cases due to infeasible steps
Conclusion: Counterfactual workflows generated by ECHO are mostly correct and effective

Case Analysis

Failed Trajectory Example: Agent fails to pick up gray key

Reflexion Output: Generic feedback lacking specific improvement suggestions
AWM Output: Correctly generates no workflow due to failure
ECHO Output: Identifies that agent observed gray star, generates optimized trajectory for picking up gray star

Cross-Organization Variability

In PeopleJoinQA, optimal methods vary across organizations:

No method strictly dominates across all organizations
ECHO becomes the most efficient method in certain organizations (e.g., department stores)
Indicates need for improved robustness of offline methods

Language Model Agents

Current Status: Transition from static knowledge dependency to dynamic environment adaptation
Main Challenges: Insufficient exploration and adaptation capabilities in new environments
Application Domains: Web navigation, tool use, multi-agent collaboration, code generation

Memory System Classification

Following Sumers et al.'s taxonomy:

Semantic Memory: Environmental facts (e.g., Reflexion's reflections)
Episodic Memory: Past actions (e.g., AWM's workflows)

ECHO primarily improves the construction and update mechanisms of episodic memory

Experience Replay Techniques

Traditional HER: Relabels trajectory goals without modifying trajectory structure
Sparse Reward Advantages: Extracts maximum learning signal from few positive examples
ECHO Extension: Not only relabels goals but can edit arbitrary aspects of trajectories

Conclusions and Discussion

Main Conclusions

Effectiveness Validation: ECHO significantly improves sample efficiency in two exploration-requiring environments
Mechanism Advantages: Better utilizes past experience by converting failures into synthetic successes
Applicable Scenarios: Particularly effective in sparse reward environments where baselines perform poorly

Limitations

Representation Constraints: Primarily uses natural language representation; code-based representations may be more effective
Simplified Update Rules: Length-based heuristic update rules may be overly simplistic
Environment Dependency: Performance varies across different organizations/environments
Incomplete World Models: LM may lack complete environmental models after single trajectories

Future Directions

Programmatic Representations: Explore effectiveness of code-based trajectory representations
Complex Update Rules: Design more precise information fusion mechanisms
Retrieval Augmentation: Incorporate retrieval-based memory mechanisms
Robustness Enhancement: Improve consistency across environments

In-Depth Evaluation

Strengths

Strong Novelty: First adaptation of HER to LM agents with significant theoretical and practical value
Comprehensive Experiments: Validation across two different environment types with detailed ablation analysis
High Practical Value: Addresses critical challenges of LM agents in high-cost interaction scenarios
General Framework: Well-designed architecture with good extensibility and adaptability

Weaknesses

Benchmark Limitations: Testing only on two relatively simple environments; lacks validation on more complex realistic scenarios
Insufficient Theoretical Analysis: Lacks in-depth analysis of convergence properties and theoretical guarantees
Computational Overhead: Multiple LM calls may introduce additional computational costs
Model Capability Dependency: Method effectiveness highly depends on underlying LM's reasoning and generation abilities

Impact

Academic Contribution: Provides new research direction for experience learning in LM agents
Practical Applications: Potential applications in human-agent interaction, robot control, and other high-cost scenarios
Methodological Inspiration: Provides design insights for other LM-based learning algorithms

Applicable Scenarios

High-Cost Interaction Environments: Human-agent dialogue, physical system control
Sparse Reward Tasks: Exploration-oriented navigation and planning problems
Partially Observable Environments: Scenarios requiring learning environmental structure through interaction
Multi-Goal Tasks: Environments where multiple sub-skills can be learned from single experiences

References

Andrychowicz, M., et al. (2017). Hindsight experience replay. NIPS.
Shinn, N., et al. (2023). Reflexion: language agents with verbal reinforcement learning. NIPS.
Wang, Z. Z., et al. (2025). Agent workflow memory. ICML.
Yao, S., et al. (2023). React: Synergizing reasoning and acting in language models. ICLR.

Overall Assessment: The ECHO framework proposed in this paper achieves important progress in sample-efficient learning for LM agents. The method is novel and experimental results are convincing. While certain limitations exist, it establishes a solid foundation for future development in this field and demonstrates high academic value and practical application potential.