2025-11-24T21:37:17.430058

One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration

Khan, Prasad, Stengel-Eskin et al.
Symbolic world modeling requires inferring and representing an environment's transitional dynamics as an executable program. Prior work has focused on largely deterministic environments with abundant interaction data, simple mechanics, and human guidance. We address a more realistic and challenging setting, learning in a complex, stochastic environment where the agent has only "one life" to explore a hostile environment without human guidance. We introduce OneLife, a framework that models world dynamics through conditionally-activated programmatic laws within a probabilistic programming framework. Each law operates through a precondition-effect structure, activating in relevant world states. This creates a dynamic computation graph that routes inference and optimization only through relevant laws, avoiding scaling challenges when all laws contribute to predictions about a complex, hierarchical state, and enabling the learning of stochastic dynamics even with sparse rule activation. To evaluate our approach under these demanding constraints, we introduce a new evaluation protocol that measures (a) state ranking, the ability to distinguish plausible future states from implausible ones, and (b) state fidelity, the ability to generate future states that closely resemble reality. We develop and evaluate our framework on Crafter-OO, our reimplementation of the Crafter environment that exposes a structured, object-oriented symbolic state and a pure transition function that operates on that state alone. OneLife can successfully learn key environment dynamics from minimal, unguided interaction, outperforming a strong baseline on 16 out of 23 scenarios tested. We also test OneLife's planning ability, with simulated rollouts successfully identifying superior strategies. Our work establishes a foundation for autonomously constructing programmatic world models of unknown, complex environments.
academic

One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration

Basic Information

  • Paper ID: 2510.12088
  • Title: One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration
  • Authors: Zaid Khan, Archiki Prasad, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal (UNC Chapel Hill)
  • Classification: cs.AI, cs.CL, cs.LG
  • Publication Date: October 14, 2025
  • Paper Link: https://arxiv.org/abs/2510.12088

Abstract

Symbolic world modeling requires inferring and representing environmental transition dynamics as executable programs. Prior work has primarily focused on deterministic environments with abundant interaction data, simple mechanisms, and human guidance. This paper addresses a more realistic and challenging setting: learning in complex stochastic environments where an agent has only "one life" to explore an adversarial environment without human guidance. We propose the OneLife framework, which models world dynamics through conditionally activated programmatic rules within a probabilistic programming framework. Each rule operates through a precondition-effect structure, activating in relevant world states. This creates a dynamic computational graph that routes inference and optimization only through relevant rules, avoiding scaling challenges when all rules must predict over complex hierarchical states, and enabling learning of stochastic dynamics even with sparse rule activations.

Research Background and Motivation

Problem Definition

Traditional symbolic world modeling approaches face the following key challenges:

  1. Data Limitations: In the real world, agents often can only perform limited interactions, particularly in dangerous environments
  2. Stochasticity Handling: Real environments exhibit irreducible stochasticity, such as unpredictable NPC behavior
  3. Lack of External Guidance: Absence of environment-specific rewards or human-provided objectives
  4. Scaling Complexity: Existing methods struggle to scale when environments contain numerous interaction mechanisms

Research Significance

Symbolic world modeling is crucial for artificial intelligence because it enables:

  • Functional understanding of underlying environmental dynamics
  • Prediction of action outcomes without actual interaction
  • Construction of interpretable, editable, and verifiable representations

Limitations of Existing Approaches

Prior research primarily assumes:

  • Limited discoverable mechanisms with low stochasticity
  • Access to abundant interaction data
  • Human-provided environment-specific guidance (goals/rewards)

These assumptions often fail in complex open-world environments (e.g., Minecraft, RuneScape).

Research Motivation

The core research question is: How can an agent reverse-engineer the rules of complex, dangerous stochastic worlds with limited interaction budgets and without environment-specific human guidance?

Core Contributions

  1. OneLife Framework: Proposes a probabilistic symbolic world model capable of learning from stochastic adversarial environments with minimal interaction, without requiring access to human-defined rewards
  2. Crafter-OO Environment: Re-implements the Crafter environment, exposing structured object-oriented symbolic states and pure transition functions
  3. Evaluation Protocol: Introduces a new world modeling evaluation suite containing 30+ executable scenarios and state fidelity/state ranking metrics
  4. Performance Improvements: Outperforms strong baselines on 16/23 test scenarios and demonstrates planning capabilities

Methodology Details

Task Definition

Given a pure transition function T: S × A → Δ(S), where:

  • S: state space
  • A: action space
  • Δ(S): probability distribution over state space

The objective is to learn a symbolic world model from a single unguided exploration trajectory that can predict the probability distribution of state transitions.

Model Architecture

1. World Model Representation

OneLife models the environment as a mixture of programmatic rules:

p(s'|s,a;θ) = ∏_{o∈O} p(o|s,a;θ)

where the probability for each observable o is:

p(o=v|s,a;θ) ∝ ∏_{i∈I_o(s,a)} φ_i(o=v|s,a)^{θ_i}

2. Rule Structure

Each rule L_i is defined by a precondition-effect pair (c_i, e_i):

  • Precondition c_i(s,a) → {true, false}: determines whether the rule applies
  • Effect e_i(s,a) → s': makes predictions through state copy modification

3. Dynamic Computational Graph

For a given transition, only the rule set I(s,a) = {i | c_i(s,a) is true} satisfying preconditions is activated, creating a sparse parameter update mechanism.

Core Components

1. Exploration Strategy

Uses a large language model-driven exploration strategy:

  • Objective: discover as many underlying mechanisms as possible
  • Strategy: treat exploration as a reverse-engineering task
  • Advantage: compared to random strategies, survival time increases from 100 to 400 steps

2. Rule Synthesizer

Employs a general approach rather than hand-crafted synthesizers:

  • Proposes numerous simple atomic rules to explain each observed transition
  • Atomic rules: describe minimal state attribute changes
  • Support fine-grained credit assignment

3. Parameter Inference

Gradient-based optimization algorithm:

  • Maximizes log-likelihood of observed transitions
  • Updates only rule weights affecting observed variables
  • Uses L-BFGS for optimization

Technical Innovations

  1. Conditional Activation Mechanism: Implements selective rule activation through precondition structures, avoiding interference from irrelevant rules
  2. Sparse Parameter Updates: Performs gradient updates only on activated rules predicting observed changes, providing precise credit assignment
  3. Atomic Rule Decomposition: Decomposes complex events into multiple simple rules, improving learning accuracy
  4. Probabilistic Programming Framework: Supports modeling and inference of stochastic dynamics

Experimental Setup

Dataset

Crafter-OO Environment:

  • Re-implementation based on the Crafter environment
  • Exposes structured object-oriented state representation
  • Contains significant stochasticity and diverse mechanisms
  • Supports programmatic state modification

Evaluation Metrics

State Ranking Metrics

  • Rank@1: Whether the true next state ranks at the highest probability
  • Mean Reciprocal Rank (MRR): Average reciprocal rank of true state

State Fidelity Metrics

  • Raw Edit Distance: Number of JSON patch operations between predicted and true states
  • Normalized Edit Distance: Raw edit distance divided by total elements in state representation

Comparison Methods

  • Random World Model: Assigns uniform probability to all candidate states
  • PoE-World: State-of-the-art symbolic world model, using this paper's exploration strategy and rule synthesizer for fair comparison

Implementation Details

  • Evaluation scenarios: 40+ scenarios covering all core game mechanics
  • Corrupted state generation: 8 mutators producing invalid state transitions
  • Optimization algorithm: L-BFGS
  • Exploration budget: single trajectory, averaging 400 steps

Experimental Results

Main Results

MethodRank@1MRRRaw Edit Dist.Norm. Edit Dist.
Random8.5%0.322121.5380.809
PoE-World10.8%0.35110.6340.071
OneLife18.7%0.4798.7640.058

OneLife significantly outperforms baselines in discriminative accuracy:

  • Rank@1 improvement of 7.9 percentage points
  • MRR improvement of 0.128
  • Outperforms PoE-World baseline on 16/23 scenarios

Fine-grained Evaluation

Performance analysis by game mechanics shows OneLife excels on most mechanisms:

  • Resource Collection: wood, stone, coal collection tasks
  • Tool Crafting: crafting various pickaxes and swords
  • Combat System: combat with zombies and skeletons
  • World Operations: item placement and environment modification

Planning Capability Verification

Forward simulation testing validates planning capability across 3 scenarios:

ScenarioPlan DescriptionAvg StepsReal Env PreferenceOneLife Preference
Zombie FighterCraft sword then fight vs. fight immediately33 vs 17✓Craft sword✓Craft sword
Stone MinerCraft pickaxe then mine vs. mine directly31 vs 13✓Craft pickaxe✓Craft pickaxe
Sword SmithReuse workbench vs. rebuild each time5 vs 10✓Reuse✓Reuse

The world model learned by OneLife correctly identifies more efficient strategies across all scenarios.

Ablation Study

Comparison of different inference methods:

  • OneLife (Complete): 18.7% Rank@1, 0.479 MRR
  • Without Parameter Inference: 13.0% Rank@1, 0.429 MRR
  • PoE-World Inference: 10.8% Rank@1, 0.351 MRR

Results demonstrate that OneLife's inference algorithm is crucial for performance improvements.

Symbolic World Models

  • Monolithic Program Approaches: Tang et al. (2024), Dainese et al. (2024) use LLMs to synthesize single programs
  • Compositional Approaches: Piriyakulkij et al. (2025) propose product of experts models
  • Formal Planning Representations: Construct symbolic planning representations like PDDL

Programmatic Decision-Making Representations

  • Programmatic Policies: Provide better interpretability and generalization
  • Programmatic Rewards: Generate reward functions from natural language instructions
  • Skill Libraries: Construct composable temporally-extended skills

World Modeling for Open-Ended Exploration

  • Implicit World Models: Drive exploration through intrinsic motivation
  • Automated Scientific Discovery: Autonomously form hypotheses and conduct experiments
  • Fast Inductive Evaluation: Assess agent ability to rapidly induce world models in new environments

Conclusions and Discussion

Main Conclusions

  1. OneLife successfully addresses the challenge of learning symbolic world models from limited unguided interactions in complex stochastic environments
  2. Conditional activation of programmatic rules and sparse parameter updates are key innovations
  3. The learned world model supports effective planning and decision-making

Limitations

  1. Exploration Bottleneck: LLM-driven exploration strategies still struggle to fully discover complex technology trees
  2. Memory Issues: Exploration agents easily forget previously learned information
  3. Environment Specificity: Current implementation primarily targets Crafter-OO environment
  4. Computational Complexity: Rule synthesis and parameter inference incur significant computational overhead

Future Directions

  1. Improved Exploration Strategies: Develop more effective unguided exploration methods
  2. Extension to Other Environments: Validate framework generalization across diverse complex environments
  3. Online Learning: Support continuous learning and adaptation
  4. Multimodal Integration: Incorporate visual and textual information for world modeling

In-Depth Evaluation

Strengths

  1. Problem Importance: Addresses core challenges in symbolic world modeling—learning in complex stochastic environments with limited data
  2. Technical Innovation: Conditional activation mechanisms and sparse update strategies demonstrate significant novelty
  3. Comprehensive Experiments: Thorough evaluation protocol and multi-faceted experimental validation
  4. Practical Value: Demonstrates real planning application effectiveness
  5. Environmental Contribution: Crafter-OO provides valuable testing platform for symbolic world modeling

Weaknesses

  1. Exploration Dependency: Still relies on relatively powerful LLM for exploration, potentially limiting method generality
  2. Evaluation Scope: Primarily validated on single environment type; generalization capability requires further verification
  3. Theoretical Analysis: Lacks theoretical guarantees on convergence and sample complexity
  4. Computational Efficiency: Insufficient analysis of rule synthesis computational overhead

Impact

  1. Academic Contribution: Provides new research paradigm for symbolic world modeling
  2. Practical Prospects: Potential applications in game AI, robotics, and other domains
  3. Open-Source Value: Crafter-OO environment and evaluation framework available for community use
  4. Methodological Inspiration: Conditional activation and sparse update ideas applicable to other learning tasks

Applicable Scenarios

  1. Game AI: Rule learning and strategy planning for complex strategy games
  2. Robotics: Dynamics modeling and task planning in unknown environments
  3. Scientific Discovery: Automated scientific hypothesis generation and verification
  4. Educational Applications: Learner modeling in intelligent tutoring systems

References

The paper cites important works across symbolic world modeling, program synthesis, and reinforcement learning, providing comprehensive literature foundation for related research. Key references include the Crafter environment, PoE-World methodology, and various works on programmatic representation learning.


Overall Assessment: This is a high-quality research paper making significant contributions to the important and challenging field of symbolic world modeling. The OneLife framework cleverly addresses practical problems through well-designed techniques, with comprehensive experimental validation and substantial academic and practical potential. Despite certain limitations, it provides clear directions for future research.