2025-11-25T14:13:18.562314

Physical Reinforcement Learning

Dillavou, Mishra
Digital computers are power-hungry and largely intolerant of damaged components, making them potentially difficult tools for energy-limited autonomous agents in uncertain environments. Recently developed Contrastive Local Learning Networks (CLLNs) - analog networks of self-adjusting nonlinear resistors - are inherently low-power and robust to physical damage, but were constructed to perform supervised learning. In this work we demonstrate success on two simple RL problems using Q-learning adapted for simulated CLLNs. Doing so makes explicit the components (beyond the network being trained) required to enact various tools in the RL toolbox, some of which (policy function and value function) are more natural in this system than others (replay buffer). We discuss assumptions such as the physical safety that digital hardware requires, CLLNs can forgo, and biological systems cannot rely on, and highlight secondary goals that are important in biology and trainable in CLLNs, but make little sense in digital computers.
academic

Physical Reinforcement Learning

Basic Information

  • Paper ID: 2511.17789
  • Title: Physical Reinforcement Learning
  • Authors: Sam Dillavou (University of Pennsylvania), Shruti Mishra (University of Cambridge)
  • Classification: cs.LG (Machine Learning), cond-mat.dis-nn (Condensed Matter - Disordered Systems and Neural Networks)
  • Publication Date: November 21, 2025 (arXiv v1)
  • Paper Link: https://arxiv.org/abs/2511.17789

Abstract

While digital computers are powerful, they suffer from high energy consumption and intolerance to component damage, which poses challenges for their use as autonomous intelligent agents in energy-limited and uncertain environments. This paper investigates Contrastive Local Learning Networks (CLLNs)—analog networks composed of self-regulating nonlinear resistors—for reinforcement learning tasks. CLLNs naturally exhibit low power consumption and robustness to physical damage, but have previously been used only for supervised learning. The authors successfully adapted Q-learning to simulated CLLNs to solve two simple reinforcement learning problems and clarified the components required to implement various tools in the RL toolkit. Policy functions and value functions are more naturally implemented in this system, while experience replay buffers are less natural.

Research Background and Motivation

1. Core Problem

Digital computers face two fundamental weaknesses in reinforcement learning applications:

  • Poor fault tolerance: Damage to a single transistor can cause system-wide failure, as the function of each component is inherently tied to its position in the system
  • High energy consumption: Laptop CPUs consume approximately 50W, stemming from the high energy cost of maintaining "perfect" operation and data transfer between processing and storage

2. Problem Significance

For autonomous agents in energy-limited environments, low power consumption and fault tolerance are critical. Biological systems excel in these aspects:

  • The human brain consumes only 20W total power while performing perception, cognition, motor control, and other tasks
  • The brain can withstand significant damage and continue operating, including single neuron destruction, traumatic brain injury, and even brain region removal
  • This robustness stems from distributed processing and emergent computation, rather than linear computation

3. Limitations of Existing Approaches

  • Few examples of artificial non-digital hardware applications in RL tasks
  • Many digitally-enhanced or simulated analog systems have been used for RL, but few hardware demonstrations combine distributed storage, computation, and analog signals
  • Recently developed CLLNs, while possessing low power and fault-tolerant characteristics, have not yet been validated in RL scenarios

4. Research Motivation

  • Explore the potential of CLLNs in RL applications, opening pathways for energy-efficient and fault-tolerant autonomous agents
  • Clarify which RL tools are natural for self-learning networks and which require additional pre-programmed hardware
  • Understand the additional challenges faced when placing an agent's "brain" outside the digital domain

Core Contributions

  1. First application of CLLNs to reinforcement learning: Successfully adapted Q-learning to simulated CLLNs, enabling RL capabilities for physical learning networks
  2. Validation on two RL tasks:
    • Four-state, four-action Markov Decision Process (MDP)
    • Nine-state (3×3 grid) four-action navigation task
    • Achieved near-optimal policies in 8-10 out of 10 trials
  3. Clarification of design considerations for physical learning systems:
    • Identified RL components naturally implementable in CLLNs (policy functions, value functions)
    • Identified components requiring additional hardware support (experience replay buffers)
    • Revealed constraints unique to physical systems (bounded parameters, non-feedforward structure)
  4. Proposed unique advantages of physical learning systems:
    • Low-power operation can be further optimized through modified learning algorithms
    • Online recovery capability after damage
    • Ability to train secondary objectives (e.g., power consumption, robustness) that are meaningless in digital systems

Detailed Methodology

Task Definition

Task 1: Four-State, Four-Action MDP

  • State space: 4 discrete states S₁, S₂, S₃, S₄
  • Action space: 4 discrete actions A₀, A₁, A₂, A₃
  • State transitions: Simple deterministic transitions, action i leads to state Si
  • Rewards: State-dependent rewards R(St, At) ~ N(0.1, 0.1), plus noise N(0, 0.01)
  • Objective: Learn optimal policy to maximize cumulative rewards

Task 2: Nine-State Navigation Task

  • State space: 9 positions on a 3×3 grid
  • Action space: 4 directional movements (up, down, left, right)
  • Reward structure: Large reward at target position (top-left corner), small reward gradient at other positions (5000× smaller)
  • Objective: Learn to navigate to high-reward position

Model Architecture

CLLN Fundamentals

CLLNs are networks composed of self-regulating resistive elements whose individual dynamics approximate gradient descent on a global loss function.

Network Structure:

  • Nodes divided into input nodes (yellow) and output nodes (blue)
  • Inputs: Data encoded by forcing node voltages V₁, ..., V₄
  • Outputs: Equilibrium voltage values O₁, ..., O₄ as network computation results
  • Network functions as: F(V₁, V₂, V₃, V₄) ≡ (O₁, O₂, O₃, O₄)

Conductance Model: Each conductive element is actually a MOSFET transistor operating in the triode (passive) region:

Gi = S(VG,i - VT - V̄)

Where:

  • S = 1 (constant)
  • VT = 0.7 (threshold voltage)
  • VG,i: Adjustable gate voltage (acting as weight)
  • V̄: Average voltage across edge nodes (implementing nonlinear transformation)
  • Parameter range constraint: 1.0 < VG,i < 5.5

Contrastive Learning Mechanism

The learning process requires comparing two different states:

  1. Free State:
    • Only inputs V₁, ..., V₄ applied
    • Each resistor experiences voltage drop ΔVᶠᵢ
    • Outputs are Oᶠₙ
  2. Clamped State:
    • Inputs and desired outputs (labels) applied
    • Voltage drop is ΔVᶜᵢ
    • Outputs pushed toward labels: Oᶜₙ = Oᶠₙ(1-η) + ηLₙ (η=0.1 in this work)

Local Learning Rule:

The system performs gradient descent on the contrastive function (difference in dissipated power between clamped and free states):

δGi = -α d/dGi[Pᶜ - Pᶠ]

Through chain rule derivation, a fully local learning rule is obtained:

δGi = α[(ΔVᶠᵢ)² - (ΔVᶜᵢ)²]

Key feature: Each element only needs to measure its own voltage drop in both states for updating, achieving decentralized training.

Q-Learning Adaptation Scheme

State Encoding

  • States S₁...S₄ encoded as input voltage vectors:
    • S₁: 1, 0, 1, 0 V
    • S₂: 0, 1, 0, 1 V
    • S₃: 1, 1, 0, 0 V
    • S₄: 0, 0, 1, 1 V

Action Selection

  • ε-greedy policy: ε linearly decays from 0.05 to 0
  • Select maximum of four outputs as action (probability 1-ε)

Q-Value Update

Future weighted score calculation:

Lt = R(St, At) + γ[max(F(St+1)) - mean(F(St+1))]

Where:

  • γ = 0.5 (discount factor)
  • Subtracting mean term improves performance, providing additional flexibility for small networks

Training Procedure

  1. System in state St, select action At
  2. Environment returns reward Rt, transitions to St+1
  3. Calculate Lt
  4. Train network:
    • Free state: Apply St as input
    • Clamped state: Apply St as input, keep unselected action outputs at Oᵢ, set selected action output to Lt
  5. Batch update every 50 steps

Technical Innovations

  1. Q-learning adapted to physical constraints:
    • Handling bounded parameters and outputs
    • Designing rewards and discount factors for desired output generation
  2. Training strategy for non-feedforward networks:
    • In CLLNs, voltage or resistance changes anywhere can affect all outputs
    • Training keeps unselected outputs static to avoid interference
  3. Temporal backtracking mechanism:
    • After environment transitions to St+1, must store and reapply St for updates
    • This is the "non-natural" step for physical systems
  4. Architecture adaptation:
    • Task 1: Uses cyclically connected network shown in Figure 2
    • Task 2: Uses densely connected network with 44 edges (6-4-4-1 layer structure, but non-feedforward)

Experimental Setup

Datasets

Task 1: Four-State MDP

  • Reward matrix: Sampled from N(0.1, 0.1), fixed for all trials
  • Reward noise: N(0, 0.01)
  • Optimal policy: Cycle through all four states
  • Total possible policies: 4⁴ = 256

Task 2: Nine-State Navigation

  • 3×3 grid world
  • Large reward at target position (top-left corner)
  • Reward gradient at other positions (5000× smaller, invisible in heatmaps)
  • Random position reset every 5 steps
  • No reward noise

Evaluation Metrics

  • Average reward: Computed over logarithmically-spaced intervals (minimum 10 steps)
  • Policy quality: Comparison with optimal/worst policies
  • Success rate: Proportion of trials reaching optimal or near-optimal policy
  • State visitation distribution: Time proportion agent spends in each state after training

Implementation Details

General Settings:

  • Initialization: VG,i ~ N(1.5, 0.1)
  • Learning rate α: Not explicitly specified, implicitly determined by physical process
  • Batch updates: Every 50 steps
  • Parameter range: 1.0 < VG,i < 5.5

Task 1:

  • Training steps: 100,000
  • Number of trials: 10
  • ε decay: 0.05 → 0 (linear)
  • Discount factor: γ = 0.5
  • Clamping parameter: η = 0.1

Task 2:

  • Training steps: 300,000
  • Number of trials: 10
  • ε decay: 0.1 → 0 (linear)
  • State reset frequency: Every 5 steps
  • Input encoding: Row and column coordinates rescaled to 0, 0.5, 1, plus inverted values and two constant nodes

Experimental Results

Main Results

Task 1: Four-State MDP

  • Success rate: 8 out of 10 trials reached optimal policy, remaining 2 achieved near-optimal
  • Learning curves (Figure 3B):
    • All trials (purple lines) show stable reward growth
    • Average reward (black line) rapidly converges to optimal policy level
    • Final performance approaches theoretical optimum (black dashed line)
    • Significantly outperforms worst policy (lower dashed line)

Task 2: Nine-State Navigation

  • Success rate: 8 out of 10 trials found one of multiple equivalent optimal policies
  • Learning curves (Figure 4B):
    • Steady reward growth
    • Fully reaches optimal policy line only in late training (ε→0)
    • Average performance (black line) shows consistent learning progress

State Visitation Analysis (Figure 4C):

  • 10 agents tested over 10,000 steps with ε=0 after training
  • Spend most time in high-reward square (top-left corner)
  • Heatmap shows agents successfully learned to navigate to target position

Experimental Findings

  1. Learning Stability:
    • Both tasks show stable learning processes
    • Consistent results across multiple trials with random initialization
    • No catastrophic forgetting or training collapse observed
  2. Impact of Physical Constraints:
    • Bounded parameters require careful reward and discount factor design
    • Subtracting mean term (in Lt calculation) significantly improves small network performance
  3. Adaptation to Non-Feedforward Structure:
    • Strategy of keeping unselected action outputs static during training is effective
    • This constraint has limited impact on simple tasks, but effects on complex policies require further study
  4. Necessity of Temporal Backtracking:
    • Requires storing and reapplying previous state St
    • This is "non-natural" for physical systems, potentially avoidable through hybrid state construction in future work

Analog and Neuromorphic RL Systems

  • Mak et al. (2007, 2010): CMOS current-mode dynamic programming circuits, early hardware RL attempts
  • Mikaitis et al. (2018): Neuromodulated synaptic plasticity on SpiNNaker neuromorphic system
  • Limitations: Mostly digitally-enhanced or simulated analog systems, lacking true distributed storage and analog signal computation

Physical Learning Systems

  • Coupled Learning framework (Stern et al., 2021): Theoretical foundation for CLLNs
  • Equilibrium Propagation (Scellier & Bengio, 2017): Bridge between energy-based models and backpropagation
  • Contrastive Hebbian Learning (Movellan, 1991): Early theory of contrastive learning
  • Dillavou et al. (2024): First experimental demonstration of CLLNs for supervised learning
  • Stern et al. (2024): Training CLLNs for low-power solutions
  • Dillavou et al. (2022): Demonstrating decentralized physics-driven learning and fault tolerance
  • Dillavou et al. (2025): Understanding and embracing imperfections in physical learning networks

Biological Learning Systems

  • Brain fault tolerance (Wang et al., 2014; Chua et al., 2007; Granovetter et al., 2022)
  • Low-power operation (Balasubramanian, 2021)
  • Natural primitives (Mead, 1990)

Advantages of This Work

  • First RL application: First work implementing RL on CLLNs
  • Fully analog: No reliance on digital processing; learning occurs in distributed, analog manner
  • Systematic analysis: Clarifies design considerations and constraints of physical learning systems

Conclusions and Discussion

Main Conclusions

  1. Feasibility validation: CLLNs can successfully execute reinforcement learning tasks, achieving near-optimal performance on simple MDPs and navigation problems
  2. Natural component identification:
    • Policy functions and value functions can be naturally implemented in a single network
    • Historical storage methods like experience replay buffers require substantial control hardware, deviating from the "wild network" vision
  3. Physical constraints clarified:
    • Bounded parameters and outputs
    • Non-feedforward structure
    • Temporal backtracking mechanism required
  4. Unique advantages:
    • Low power consumption can be further optimized through modified learning methods
    • Can be retrained after damage
    • Can train secondary objectives (power, robustness, transmission speed)

Limitations

  1. Limited task complexity:
    • Validated only on very simple tasks (4-state and 9-state)
    • Impact of non-feedforward structure on complex policies remains unclear
  2. Still requires external control:
    • Randomization and max function in ε-greedy algorithm require external hardware
    • Temporal backtracking requires state storage
    • Batch updates require coordination
  3. Simulation limitations:
    • Avoided component imperfections and bias issues in simulation
    • Physical implementation will face measurement noise and component variation
    • Energy consumption unrelated to actual resistances and currents (in simulation)
  4. Lack of historical memory:
    • Difficult to naturally implement eligibility traces or experience replay
    • Limits range of applicable RL algorithms
  5. Unknown scalability:
    • Performance on larger networks and complex tasks untested
    • Extensibility of state and action spaces unclear

Future Directions

  1. Avoiding temporal backtracking:
    • Explore hybrid state construction (involving St+1 and L)
    • Develop more natural physical learning procedures
  2. Online recovery architecture:
    • Design architectures and algorithms allowing immediate recovery after damage
    • Leverage CLLNs' retraining capability
  3. Secondary objective optimization:
    • Modify learning algorithms to favor low-power solutions
    • Train networks for improved physical damage robustness
    • Optimize input-output transmission speed
  4. Physical implementation:
    • Hardware demonstration to validate simulation results
    • Handle component imperfections and bias
    • Measure actual power consumption and fault tolerance
  5. Complex task extension:
    • Larger state and action spaces
    • Continuous control tasks
    • Multi-agent scenarios
  6. Learning learning algorithms:
    • Train CLLNs to perform necessary control functions (randomization, max function)
    • Explore meta-learning approaches

In-Depth Evaluation

Strengths

  1. Pioneering work:
    • First application of CLLNs to RL, opening new research direction for physical reinforcement learning
    • Provides alternative paradigm beyond digital RL
  2. Theoretical clarity:
    • Detailed derivation of local learning rules (Equations 1-4)
    • Clear explanation of contrastive learning mechanism
    • Rigorous mathematical formulation
  3. Systematic analysis:
    • Clear distinction between natural components and those requiring external support
    • Discussion of constraints and unique advantages specific to physical systems
    • Insightful comparisons with digital and biological systems
  4. Reasonable experimental design:
    • Progressive task complexity from simple to moderately complex
    • Multiple trials (10) validate stability
    • Comparison with theoretical optimal/worst policies
  5. Honest limitation discussion:
    • Acknowledges differences between simulation and physical implementation
    • Clearly identifies parts requiring external control
    • Discusses unknown scalability
  6. Interdisciplinary perspective:
    • Combines physics, machine learning, and neuroscience
    • Proposes secondary objectives meaningful in physical/biological systems but meaningless in digital systems

Weaknesses

  1. Overly simple tasks:
    • 4-state MDP and 3×3 grid are toy problems
    • Lacks validation on more complex, realistic tasks
    • Scalability is critical open question
  2. Still dependent on external control:
    • ε-greedy, max function, batch updates all require external hardware
    • Distance from "fully autonomous physical learning system" remains
    • Temporal backtracking mechanism is unnatural
  3. Simulation-only results:
    • No physical hardware implementation
    • Cannot verify key advantages like power consumption and fault tolerance
    • Impact of component imperfections unknown
  4. Limited methodological scope:
    • Only Q-learning attempted
    • Other RL methods (policy gradient, Actor-Critic) unexplored
    • No direct performance comparison with digital Q-learning
  5. Insufficient in-depth analysis:
    • No ablation studies analyzing impact of design choices
    • Hyperparameter sensitivity not studied
    • Learning dynamics analysis insufficient
  6. Single evaluation metric:
    • Primarily focuses on average reward
    • Lacks analysis of sample efficiency, convergence speed
    • No computational cost (simulation time) comparison

Impact

Contribution to field:

  • Opens new direction: Introduces RL capabilities to physical computing and neuromorphic computing fields
  • Theoretical value: Clarifies design space and constraints of physical learning systems
  • Inspirational: Proposes comparative framework for digital, physical, and biological learning systems

Practical value:

  • Long-term potential: Provides direction for energy-limited, high fault-tolerance autonomous agents
  • Short-term limitations: Currently validates only toy problems, far from practical application
  • Niche applications: May suit edge devices, extreme environments, embedded systems

Reproducibility:

  • Advantages: Detailed method description, complete mathematical derivations
  • Challenges: Requires specific circuit simulation capability, high barrier for physical implementation
  • Code: Paper does not mention code release

Applicable Scenarios

Ideal application scenarios:

  1. Extremely energy-limited environments:
    • Micro autonomous robots
    • Long-term unattended sensors
    • Wearable devices
  2. High fault-tolerance requirements:
    • Extreme environments (radiation, high temperature)
    • Military applications
    • Space exploration
  3. Embedded intelligence:
    • IoT edge devices
    • Simple control tasks
    • Real-time response requirements

Inapplicable scenarios:

  1. Complex tasks requiring extensive historical memory
  2. High-dimensional state/action spaces
  3. Tasks requiring precise computation
  4. Rapid prototyping (long hardware manufacturing cycle)

Complementarity with digital RL:

  • Supplement rather than replacement
  • Digital RL suits complex tasks and rapid iteration
  • Physical RL suits deployment under specific constraints

References

  1. Dillavou et al. (2024): Machine learning without a processor: Emergent learning in a nonlinear analog network. PNAS. (Original CLLN paper)
  2. Stern et al. (2021): Supervised Learning in Physical Networks: From Machine Learning to Learning Machines. Physical Review X. (Coupled Learning theoretical framework)
  3. Scellier & Bengio (2017): Equilibrium Propagation: Bridging the Gap between Energy-Based Models and Backpropagation. Frontiers in Computational Neuroscience. (Theoretical foundation)
  4. Mak et al. (2007, 2010): Early work on analog circuit RL
  5. Stern et al. (2024): Training self-learning circuits for power-efficient solutions. APL Machine Learning. (Power efficiency optimization)

Overall Assessment: This is pioneering work that first applies physical learning networks to reinforcement learning, with significant theoretical and potential practical value. While currently validated only on simple tasks and still distant from fully autonomous physical learning systems, it opens new research directions for energy-efficient and fault-tolerant autonomous agents. The paper's primary value lies in clarifying the design space, constraints, and unique advantages of physical learning systems, laying foundation for subsequent research. Future work should continue advancing hardware implementation, task complexity, and methodological refinement.