2025-11-24T21:37:17.430058

One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration

Khan, Prasad, Stengel-Eskin et al.

Symbolic world modeling requires inferring and representing an environment's transitional dynamics as an executable program. Prior work has focused on largely deterministic environments with abundant interaction data, simple mechanics, and human guidance. We address a more realistic and challenging setting, learning in a complex, stochastic environment where the agent has only "one life" to explore a hostile environment without human guidance. We introduce OneLife, a framework that models world dynamics through conditionally-activated programmatic laws within a probabilistic programming framework. Each law operates through a precondition-effect structure, activating in relevant world states. This creates a dynamic computation graph that routes inference and optimization only through relevant laws, avoiding scaling challenges when all laws contribute to predictions about a complex, hierarchical state, and enabling the learning of stochastic dynamics even with sparse rule activation. To evaluate our approach under these demanding constraints, we introduce a new evaluation protocol that measures (a) state ranking, the ability to distinguish plausible future states from implausible ones, and (b) state fidelity, the ability to generate future states that closely resemble reality. We develop and evaluate our framework on Crafter-OO, our reimplementation of the Crafter environment that exposes a structured, object-oriented symbolic state and a pure transition function that operates on that state alone. OneLife can successfully learn key environment dynamics from minimal, unguided interaction, outperforming a strong baseline on 16 out of 23 scenarios tested. We also test OneLife's planning ability, with simulated rollouts successfully identifying superior strategies. Our work establishes a foundation for autonomously constructing programmatic world models of unknown, complex environments.

academic

One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration

Basic Information

Paper ID: 2510.12088
Title: One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration
Authors: Zaid Khan, Archiki Prasad, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal (UNC Chapel Hill)
Classification: cs.AI, cs.CL, cs.LG
Publication Date: October 14, 2025
Paper Link: https://arxiv.org/abs/2510.12088

Abstract

Symbolic world modeling requires inferring and representing environmental transition dynamics as executable programs. Prior work has primarily focused on deterministic environments with abundant interaction data, simple mechanisms, and human guidance. This paper addresses a more realistic and challenging setting: learning in complex stochastic environments where an agent has only "one life" to explore an adversarial environment without human guidance. We propose the OneLife framework, which models world dynamics through conditionally activated programmatic rules within a probabilistic programming framework. Each rule operates through a precondition-effect structure, activating in relevant world states. This creates a dynamic computational graph that routes inference and optimization only through relevant rules, avoiding scaling challenges when all rules must predict over complex hierarchical states, and enabling learning of stochastic dynamics even with sparse rule activations.

Research Background and Motivation

Problem Definition

Traditional symbolic world modeling approaches face the following key challenges:

Data Limitations: In the real world, agents often can only perform limited interactions, particularly in dangerous environments
Stochasticity Handling: Real environments exhibit irreducible stochasticity, such as unpredictable NPC behavior
Lack of External Guidance: Absence of environment-specific rewards or human-provided objectives
Scaling Complexity: Existing methods struggle to scale when environments contain numerous interaction mechanisms

Research Significance

Symbolic world modeling is crucial for artificial intelligence because it enables:

Functional understanding of underlying environmental dynamics
Prediction of action outcomes without actual interaction
Construction of interpretable, editable, and verifiable representations

Limitations of Existing Approaches

Prior research primarily assumes:

Limited discoverable mechanisms with low stochasticity
Access to abundant interaction data
Human-provided environment-specific guidance (goals/rewards)

These assumptions often fail in complex open-world environments (e.g., Minecraft, RuneScape).

Research Motivation

The core research question is: How can an agent reverse-engineer the rules of complex, dangerous stochastic worlds with limited interaction budgets and without environment-specific human guidance?

Core Contributions

OneLife Framework: Proposes a probabilistic symbolic world model capable of learning from stochastic adversarial environments with minimal interaction, without requiring access to human-defined rewards
Crafter-OO Environment: Re-implements the Crafter environment, exposing structured object-oriented symbolic states and pure transition functions
Evaluation Protocol: Introduces a new world modeling evaluation suite containing 30+ executable scenarios and state fidelity/state ranking metrics
Performance Improvements: Outperforms strong baselines on 16/23 test scenarios and demonstrates planning capabilities

Methodology Details

Task Definition

Given a pure transition function T: S × A → Δ(S), where:

S: state space
A: action space
Δ(S): probability distribution over state space

The objective is to learn a symbolic world model from a single unguided exploration trajectory that can predict the probability distribution of state transitions.

Model Architecture

1. World Model Representation

OneLife models the environment as a mixture of programmatic rules:

p(s'|s,a;θ) = ∏_{o∈O} p(o|s,a;θ)

where the probability for each observable o is:

p(o=v|s,a;θ) ∝ ∏_{i∈I_o(s,a)} φ_i(o=v|s,a)^{θ_i}

2. Rule Structure

Each rule L_i is defined by a precondition-effect pair (c_i, e_i):

Precondition c_i(s,a) → {true, false}: determines whether the rule applies
Effect e_i(s,a) → s': makes predictions through state copy modification

3. Dynamic Computational Graph

For a given transition, only the rule set I(s,a) = {i | c_i(s,a) is true} satisfying preconditions is activated, creating a sparse parameter update mechanism.

Core Components

1. Exploration Strategy

Uses a large language model-driven exploration strategy:

Objective: discover as many underlying mechanisms as possible
Strategy: treat exploration as a reverse-engineering task
Advantage: compared to random strategies, survival time increases from 100 to 400 steps

2. Rule Synthesizer

Employs a general approach rather than hand-crafted synthesizers:

Proposes numerous simple atomic rules to explain each observed transition
Atomic rules: describe minimal state attribute changes
Support fine-grained credit assignment

3. Parameter Inference

Gradient-based optimization algorithm:

Maximizes log-likelihood of observed transitions
Updates only rule weights affecting observed variables
Uses L-BFGS for optimization

Technical Innovations

Conditional Activation Mechanism: Implements selective rule activation through precondition structures, avoiding interference from irrelevant rules
Sparse Parameter Updates: Performs gradient updates only on activated rules predicting observed changes, providing precise credit assignment
Atomic Rule Decomposition: Decomposes complex events into multiple simple rules, improving learning accuracy
Probabilistic Programming Framework: Supports modeling and inference of stochastic dynamics

Experimental Setup

Dataset

Crafter-OO Environment:

Re-implementation based on the Crafter environment
Exposes structured object-oriented state representation
Contains significant stochasticity and diverse mechanisms
Supports programmatic state modification

Evaluation Metrics

State Ranking Metrics

Rank@1: Whether the true next state ranks at the highest probability
Mean Reciprocal Rank (MRR): Average reciprocal rank of true state

State Fidelity Metrics

Raw Edit Distance: Number of JSON patch operations between predicted and true states
Normalized Edit Distance: Raw edit distance divided by total elements in state representation

Comparison Methods

Random World Model: Assigns uniform probability to all candidate states
PoE-World: State-of-the-art symbolic world model, using this paper's exploration strategy and rule synthesizer for fair comparison

Implementation Details

Evaluation scenarios: 40+ scenarios covering all core game mechanics
Corrupted state generation: 8 mutators producing invalid state transitions
Optimization algorithm: L-BFGS
Exploration budget: single trajectory, averaging 400 steps

Experimental Results

Main Results

Method	Rank@1	MRR	Raw Edit Dist.	Norm. Edit Dist.
Random	8.5%	0.322	121.538	0.809
PoE-World	10.8%	0.351	10.634	0.071
OneLife	18.7%	0.479	8.764	0.058

OneLife significantly outperforms baselines in discriminative accuracy:

Rank@1 improvement of 7.9 percentage points
MRR improvement of 0.128
Outperforms PoE-World baseline on 16/23 scenarios

Fine-grained Evaluation

Performance analysis by game mechanics shows OneLife excels on most mechanisms:

Resource Collection: wood, stone, coal collection tasks
Tool Crafting: crafting various pickaxes and swords
Combat System: combat with zombies and skeletons
World Operations: item placement and environment modification

Planning Capability Verification

Forward simulation testing validates planning capability across 3 scenarios:

Scenario	Plan Description	Avg Steps	Real Env Preference	OneLife Preference
Zombie Fighter	Craft sword then fight vs. fight immediately	33 vs 17	✓Craft sword	✓Craft sword
Stone Miner	Craft pickaxe then mine vs. mine directly	31 vs 13	✓Craft pickaxe	✓Craft pickaxe
Sword Smith	Reuse workbench vs. rebuild each time	5 vs 10	✓Reuse	✓Reuse

The world model learned by OneLife correctly identifies more efficient strategies across all scenarios.

Ablation Study

Comparison of different inference methods:

OneLife (Complete): 18.7% Rank@1, 0.479 MRR
Without Parameter Inference: 13.0% Rank@1, 0.429 MRR
PoE-World Inference: 10.8% Rank@1, 0.351 MRR

Results demonstrate that OneLife's inference algorithm is crucial for performance improvements.

Symbolic World Models

Monolithic Program Approaches: Tang et al. (2024), Dainese et al. (2024) use LLMs to synthesize single programs
Compositional Approaches: Piriyakulkij et al. (2025) propose product of experts models
Formal Planning Representations: Construct symbolic planning representations like PDDL

Programmatic Decision-Making Representations

Programmatic Policies: Provide better interpretability and generalization
Programmatic Rewards: Generate reward functions from natural language instructions
Skill Libraries: Construct composable temporally-extended skills

World Modeling for Open-Ended Exploration

Implicit World Models: Drive exploration through intrinsic motivation
Automated Scientific Discovery: Autonomously form hypotheses and conduct experiments
Fast Inductive Evaluation: Assess agent ability to rapidly induce world models in new environments

Conclusions and Discussion

Main Conclusions

OneLife successfully addresses the challenge of learning symbolic world models from limited unguided interactions in complex stochastic environments
Conditional activation of programmatic rules and sparse parameter updates are key innovations
The learned world model supports effective planning and decision-making

Limitations

Exploration Bottleneck: LLM-driven exploration strategies still struggle to fully discover complex technology trees
Memory Issues: Exploration agents easily forget previously learned information
Environment Specificity: Current implementation primarily targets Crafter-OO environment
Computational Complexity: Rule synthesis and parameter inference incur significant computational overhead

Future Directions

Improved Exploration Strategies: Develop more effective unguided exploration methods
Extension to Other Environments: Validate framework generalization across diverse complex environments
Online Learning: Support continuous learning and adaptation
Multimodal Integration: Incorporate visual and textual information for world modeling

In-Depth Evaluation

Strengths

Problem Importance: Addresses core challenges in symbolic world modeling—learning in complex stochastic environments with limited data
Technical Innovation: Conditional activation mechanisms and sparse update strategies demonstrate significant novelty
Comprehensive Experiments: Thorough evaluation protocol and multi-faceted experimental validation
Practical Value: Demonstrates real planning application effectiveness
Environmental Contribution: Crafter-OO provides valuable testing platform for symbolic world modeling

Weaknesses

Exploration Dependency: Still relies on relatively powerful LLM for exploration, potentially limiting method generality
Evaluation Scope: Primarily validated on single environment type; generalization capability requires further verification
Theoretical Analysis: Lacks theoretical guarantees on convergence and sample complexity
Computational Efficiency: Insufficient analysis of rule synthesis computational overhead

Impact

Academic Contribution: Provides new research paradigm for symbolic world modeling
Practical Prospects: Potential applications in game AI, robotics, and other domains
Open-Source Value: Crafter-OO environment and evaluation framework available for community use
Methodological Inspiration: Conditional activation and sparse update ideas applicable to other learning tasks

Applicable Scenarios

Game AI: Rule learning and strategy planning for complex strategy games
Robotics: Dynamics modeling and task planning in unknown environments
Scientific Discovery: Automated scientific hypothesis generation and verification
Educational Applications: Learner modeling in intelligent tutoring systems

References

The paper cites important works across symbolic world modeling, program synthesis, and reinforcement learning, providing comprehensive literature foundation for related research. Key references include the Crafter environment, PoE-World methodology, and various works on programmatic representation learning.

Overall Assessment: This is a high-quality research paper making significant contributions to the important and challenging field of symbolic world modeling. The OneLife framework cleverly addresses practical problems through well-designed techniques, with comprehensive experimental validation and substantial academic and practical potential. Despite certain limitations, it provides clear directions for future research.