2025-11-25T14:13:18.562314

Physical Reinforcement Learning

Dillavou, Mishra

Digital computers are power-hungry and largely intolerant of damaged components, making them potentially difficult tools for energy-limited autonomous agents in uncertain environments. Recently developed Contrastive Local Learning Networks (CLLNs) - analog networks of self-adjusting nonlinear resistors - are inherently low-power and robust to physical damage, but were constructed to perform supervised learning. In this work we demonstrate success on two simple RL problems using Q-learning adapted for simulated CLLNs. Doing so makes explicit the components (beyond the network being trained) required to enact various tools in the RL toolbox, some of which (policy function and value function) are more natural in this system than others (replay buffer). We discuss assumptions such as the physical safety that digital hardware requires, CLLNs can forgo, and biological systems cannot rely on, and highlight secondary goals that are important in biology and trainable in CLLNs, but make little sense in digital computers.

academic

Physical Reinforcement Learning

Basic Information

Paper ID: 2511.17789
Title: Physical Reinforcement Learning
Authors: Sam Dillavou (University of Pennsylvania), Shruti Mishra (University of Cambridge)
Classification: cs.LG (Machine Learning), cond-mat.dis-nn (Condensed Matter - Disordered Systems and Neural Networks)
Publication Date: November 21, 2025 (arXiv v1)
Paper Link: https://arxiv.org/abs/2511.17789

Abstract

While digital computers are powerful, they suffer from high energy consumption and intolerance to component damage, which poses challenges for their use as autonomous intelligent agents in energy-limited and uncertain environments. This paper investigates Contrastive Local Learning Networks (CLLNs)—analog networks composed of self-regulating nonlinear resistors—for reinforcement learning tasks. CLLNs naturally exhibit low power consumption and robustness to physical damage, but have previously been used only for supervised learning. The authors successfully adapted Q-learning to simulated CLLNs to solve two simple reinforcement learning problems and clarified the components required to implement various tools in the RL toolkit. Policy functions and value functions are more naturally implemented in this system, while experience replay buffers are less natural.

Research Background and Motivation

1. Core Problem

Digital computers face two fundamental weaknesses in reinforcement learning applications:

Poor fault tolerance: Damage to a single transistor can cause system-wide failure, as the function of each component is inherently tied to its position in the system
High energy consumption: Laptop CPUs consume approximately 50W, stemming from the high energy cost of maintaining "perfect" operation and data transfer between processing and storage

2. Problem Significance

For autonomous agents in energy-limited environments, low power consumption and fault tolerance are critical. Biological systems excel in these aspects:

The human brain consumes only 20W total power while performing perception, cognition, motor control, and other tasks
The brain can withstand significant damage and continue operating, including single neuron destruction, traumatic brain injury, and even brain region removal
This robustness stems from distributed processing and emergent computation, rather than linear computation

3. Limitations of Existing Approaches

Few examples of artificial non-digital hardware applications in RL tasks
Many digitally-enhanced or simulated analog systems have been used for RL, but few hardware demonstrations combine distributed storage, computation, and analog signals
Recently developed CLLNs, while possessing low power and fault-tolerant characteristics, have not yet been validated in RL scenarios

4. Research Motivation

Explore the potential of CLLNs in RL applications, opening pathways for energy-efficient and fault-tolerant autonomous agents
Clarify which RL tools are natural for self-learning networks and which require additional pre-programmed hardware
Understand the additional challenges faced when placing an agent's "brain" outside the digital domain

Core Contributions

First application of CLLNs to reinforcement learning: Successfully adapted Q-learning to simulated CLLNs, enabling RL capabilities for physical learning networks
Validation on two RL tasks:
- Four-state, four-action Markov Decision Process (MDP)
- Nine-state (3×3 grid) four-action navigation task
- Achieved near-optimal policies in 8-10 out of 10 trials
Clarification of design considerations for physical learning systems:
- Identified RL components naturally implementable in CLLNs (policy functions, value functions)
- Identified components requiring additional hardware support (experience replay buffers)
- Revealed constraints unique to physical systems (bounded parameters, non-feedforward structure)
Proposed unique advantages of physical learning systems:
- Low-power operation can be further optimized through modified learning algorithms
- Online recovery capability after damage
- Ability to train secondary objectives (e.g., power consumption, robustness) that are meaningless in digital systems

Detailed Methodology

Task Definition

Task 1: Four-State, Four-Action MDP

State space: 4 discrete states S₁, S₂, S₃, S₄
Action space: 4 discrete actions A₀, A₁, A₂, A₃
State transitions: Simple deterministic transitions, action i leads to state Si
Rewards: State-dependent rewards R(St, At) ~ N(0.1, 0.1), plus noise N(0, 0.01)
Objective: Learn optimal policy to maximize cumulative rewards

Task 2: Nine-State Navigation Task

State space: 9 positions on a 3×3 grid
Action space: 4 directional movements (up, down, left, right)
Reward structure: Large reward at target position (top-left corner), small reward gradient at other positions (5000× smaller)
Objective: Learn to navigate to high-reward position

Model Architecture

CLLN Fundamentals

CLLNs are networks composed of self-regulating resistive elements whose individual dynamics approximate gradient descent on a global loss function.

Network Structure:

Nodes divided into input nodes (yellow) and output nodes (blue)
Inputs: Data encoded by forcing node voltages V₁, ..., V₄
Outputs: Equilibrium voltage values O₁, ..., O₄ as network computation results
Network functions as: F(V₁, V₂, V₃, V₄) ≡ (O₁, O₂, O₃, O₄)

Conductance Model: Each conductive element is actually a MOSFET transistor operating in the triode (passive) region:

Gi = S(VG,i - VT - V̄)

Where:

S = 1 (constant)
VT = 0.7 (threshold voltage)
VG,i: Adjustable gate voltage (acting as weight)
V̄: Average voltage across edge nodes (implementing nonlinear transformation)
Parameter range constraint: 1.0 < VG,i < 5.5

Contrastive Learning Mechanism

The learning process requires comparing two different states:

Free State:
- Only inputs V₁, ..., V₄ applied
- Each resistor experiences voltage drop ΔVᶠᵢ
- Outputs are Oᶠₙ
Clamped State:
- Inputs and desired outputs (labels) applied
- Voltage drop is ΔVᶜᵢ
- Outputs pushed toward labels: Oᶜₙ = Oᶠₙ(1-η) + ηLₙ (η=0.1 in this work)

Local Learning Rule:

The system performs gradient descent on the contrastive function (difference in dissipated power between clamped and free states):

δGi = -α d/dGi[Pᶜ - Pᶠ]

Through chain rule derivation, a fully local learning rule is obtained:

δGi = α[(ΔVᶠᵢ)² - (ΔVᶜᵢ)²]

Key feature: Each element only needs to measure its own voltage drop in both states for updating, achieving decentralized training.

Q-Learning Adaptation Scheme

State Encoding

States S₁...S₄ encoded as input voltage vectors:
- S₁: 1, 0, 1, 0 V
- S₂: 0, 1, 0, 1 V
- S₃: 1, 1, 0, 0 V
- S₄: 0, 0, 1, 1 V

Action Selection

ε-greedy policy: ε linearly decays from 0.05 to 0
Select maximum of four outputs as action (probability 1-ε)

Q-Value Update

Future weighted score calculation:

Lt = R(St, At) + γ[max(F(St+1)) - mean(F(St+1))]

Where:

γ = 0.5 (discount factor)
Subtracting mean term improves performance, providing additional flexibility for small networks

Training Procedure

System in state St, select action At
Environment returns reward Rt, transitions to St+1
Calculate Lt
Train network:
- Free state: Apply St as input
- Clamped state: Apply St as input, keep unselected action outputs at Oᵢ, set selected action output to Lt
Batch update every 50 steps

Technical Innovations

Q-learning adapted to physical constraints:
- Handling bounded parameters and outputs
- Designing rewards and discount factors for desired output generation
Training strategy for non-feedforward networks:
- In CLLNs, voltage or resistance changes anywhere can affect all outputs
- Training keeps unselected outputs static to avoid interference
Temporal backtracking mechanism:
- After environment transitions to St+1, must store and reapply St for updates
- This is the "non-natural" step for physical systems
Architecture adaptation:
- Task 1: Uses cyclically connected network shown in Figure 2
- Task 2: Uses densely connected network with 44 edges (6-4-4-1 layer structure, but non-feedforward)

Experimental Setup

Datasets

Task 1: Four-State MDP

Reward matrix: Sampled from N(0.1, 0.1), fixed for all trials
Reward noise: N(0, 0.01)
Optimal policy: Cycle through all four states
Total possible policies: 4⁴ = 256

Task 2: Nine-State Navigation

3×3 grid world
Large reward at target position (top-left corner)
Reward gradient at other positions (5000× smaller, invisible in heatmaps)
Random position reset every 5 steps
No reward noise

Evaluation Metrics

Average reward: Computed over logarithmically-spaced intervals (minimum 10 steps)
Policy quality: Comparison with optimal/worst policies
Success rate: Proportion of trials reaching optimal or near-optimal policy
State visitation distribution: Time proportion agent spends in each state after training

Implementation Details

General Settings:

Initialization: VG,i ~ N(1.5, 0.1)
Learning rate α: Not explicitly specified, implicitly determined by physical process
Batch updates: Every 50 steps
Parameter range: 1.0 < VG,i < 5.5

Task 1:

Training steps: 100,000
Number of trials: 10
ε decay: 0.05 → 0 (linear)
Discount factor: γ = 0.5
Clamping parameter: η = 0.1

Task 2:

Training steps: 300,000
Number of trials: 10
ε decay: 0.1 → 0 (linear)
State reset frequency: Every 5 steps
Input encoding: Row and column coordinates rescaled to 0, 0.5, 1, plus inverted values and two constant nodes

Experimental Results

Main Results

Task 1: Four-State MDP

Success rate: 8 out of 10 trials reached optimal policy, remaining 2 achieved near-optimal
Learning curves (Figure 3B):
- All trials (purple lines) show stable reward growth
- Average reward (black line) rapidly converges to optimal policy level
- Final performance approaches theoretical optimum (black dashed line)
- Significantly outperforms worst policy (lower dashed line)

Task 2: Nine-State Navigation

Success rate: 8 out of 10 trials found one of multiple equivalent optimal policies
Learning curves (Figure 4B):
- Steady reward growth
- Fully reaches optimal policy line only in late training (ε→0)
- Average performance (black line) shows consistent learning progress

State Visitation Analysis (Figure 4C):

10 agents tested over 10,000 steps with ε=0 after training
Spend most time in high-reward square (top-left corner)
Heatmap shows agents successfully learned to navigate to target position

Experimental Findings

Learning Stability:
- Both tasks show stable learning processes
- Consistent results across multiple trials with random initialization
- No catastrophic forgetting or training collapse observed
Impact of Physical Constraints:
- Bounded parameters require careful reward and discount factor design
- Subtracting mean term (in Lt calculation) significantly improves small network performance
Adaptation to Non-Feedforward Structure:
- Strategy of keeping unselected action outputs static during training is effective
- This constraint has limited impact on simple tasks, but effects on complex policies require further study
Necessity of Temporal Backtracking:
- Requires storing and reapplying previous state St
- This is "non-natural" for physical systems, potentially avoidable through hybrid state construction in future work

Analog and Neuromorphic RL Systems

Mak et al. (2007, 2010): CMOS current-mode dynamic programming circuits, early hardware RL attempts
Mikaitis et al. (2018): Neuromodulated synaptic plasticity on SpiNNaker neuromorphic system
Limitations: Mostly digitally-enhanced or simulated analog systems, lacking true distributed storage and analog signal computation

Physical Learning Systems

Coupled Learning framework (Stern et al., 2021): Theoretical foundation for CLLNs
Equilibrium Propagation (Scellier & Bengio, 2017): Bridge between energy-based models and backpropagation
Contrastive Hebbian Learning (Movellan, 1991): Early theory of contrastive learning

Dillavou et al. (2024): First experimental demonstration of CLLNs for supervised learning
Stern et al. (2024): Training CLLNs for low-power solutions
Dillavou et al. (2022): Demonstrating decentralized physics-driven learning and fault tolerance
Dillavou et al. (2025): Understanding and embracing imperfections in physical learning networks

Biological Learning Systems

Brain fault tolerance (Wang et al., 2014; Chua et al., 2007; Granovetter et al., 2022)
Low-power operation (Balasubramanian, 2021)
Natural primitives (Mead, 1990)

Advantages of This Work

First RL application: First work implementing RL on CLLNs
Fully analog: No reliance on digital processing; learning occurs in distributed, analog manner
Systematic analysis: Clarifies design considerations and constraints of physical learning systems

Conclusions and Discussion

Main Conclusions

Feasibility validation: CLLNs can successfully execute reinforcement learning tasks, achieving near-optimal performance on simple MDPs and navigation problems
Natural component identification:
- Policy functions and value functions can be naturally implemented in a single network
- Historical storage methods like experience replay buffers require substantial control hardware, deviating from the "wild network" vision
Physical constraints clarified:
- Bounded parameters and outputs
- Non-feedforward structure
- Temporal backtracking mechanism required
Unique advantages:
- Low power consumption can be further optimized through modified learning methods
- Can be retrained after damage
- Can train secondary objectives (power, robustness, transmission speed)

Limitations

Limited task complexity:
- Validated only on very simple tasks (4-state and 9-state)
- Impact of non-feedforward structure on complex policies remains unclear
Still requires external control:
- Randomization and max function in ε-greedy algorithm require external hardware
- Temporal backtracking requires state storage
- Batch updates require coordination
Simulation limitations:
- Avoided component imperfections and bias issues in simulation
- Physical implementation will face measurement noise and component variation
- Energy consumption unrelated to actual resistances and currents (in simulation)
Lack of historical memory:
- Difficult to naturally implement eligibility traces or experience replay
- Limits range of applicable RL algorithms
Unknown scalability:
- Performance on larger networks and complex tasks untested
- Extensibility of state and action spaces unclear

Future Directions

Avoiding temporal backtracking:
- Explore hybrid state construction (involving St+1 and L)
- Develop more natural physical learning procedures
Online recovery architecture:
- Design architectures and algorithms allowing immediate recovery after damage
- Leverage CLLNs' retraining capability
Secondary objective optimization:
- Modify learning algorithms to favor low-power solutions
- Train networks for improved physical damage robustness
- Optimize input-output transmission speed
Physical implementation:
- Hardware demonstration to validate simulation results
- Handle component imperfections and bias
- Measure actual power consumption and fault tolerance
Complex task extension:
- Larger state and action spaces
- Continuous control tasks
- Multi-agent scenarios
Learning learning algorithms:
- Train CLLNs to perform necessary control functions (randomization, max function)
- Explore meta-learning approaches

In-Depth Evaluation

Strengths

Pioneering work:
- First application of CLLNs to RL, opening new research direction for physical reinforcement learning
- Provides alternative paradigm beyond digital RL
Theoretical clarity:
- Detailed derivation of local learning rules (Equations 1-4)
- Clear explanation of contrastive learning mechanism
- Rigorous mathematical formulation
Systematic analysis:
- Clear distinction between natural components and those requiring external support
- Discussion of constraints and unique advantages specific to physical systems
- Insightful comparisons with digital and biological systems
Reasonable experimental design:
- Progressive task complexity from simple to moderately complex
- Multiple trials (10) validate stability
- Comparison with theoretical optimal/worst policies
Honest limitation discussion:
- Acknowledges differences between simulation and physical implementation
- Clearly identifies parts requiring external control
- Discusses unknown scalability
Interdisciplinary perspective:
- Combines physics, machine learning, and neuroscience
- Proposes secondary objectives meaningful in physical/biological systems but meaningless in digital systems

Weaknesses

Overly simple tasks:
- 4-state MDP and 3×3 grid are toy problems
- Lacks validation on more complex, realistic tasks
- Scalability is critical open question
Still dependent on external control:
- ε-greedy, max function, batch updates all require external hardware
- Distance from "fully autonomous physical learning system" remains
- Temporal backtracking mechanism is unnatural
Simulation-only results:
- No physical hardware implementation
- Cannot verify key advantages like power consumption and fault tolerance
- Impact of component imperfections unknown
Limited methodological scope:
- Only Q-learning attempted
- Other RL methods (policy gradient, Actor-Critic) unexplored
- No direct performance comparison with digital Q-learning
Insufficient in-depth analysis:
- No ablation studies analyzing impact of design choices
- Hyperparameter sensitivity not studied
- Learning dynamics analysis insufficient
Single evaluation metric:
- Primarily focuses on average reward
- Lacks analysis of sample efficiency, convergence speed
- No computational cost (simulation time) comparison

Impact

Contribution to field:

Opens new direction: Introduces RL capabilities to physical computing and neuromorphic computing fields
Theoretical value: Clarifies design space and constraints of physical learning systems
Inspirational: Proposes comparative framework for digital, physical, and biological learning systems

Practical value:

Long-term potential: Provides direction for energy-limited, high fault-tolerance autonomous agents
Short-term limitations: Currently validates only toy problems, far from practical application
Niche applications: May suit edge devices, extreme environments, embedded systems

Reproducibility:

Advantages: Detailed method description, complete mathematical derivations
Challenges: Requires specific circuit simulation capability, high barrier for physical implementation
Code: Paper does not mention code release

Applicable Scenarios

Ideal application scenarios:

Extremely energy-limited environments:
- Micro autonomous robots
- Long-term unattended sensors
- Wearable devices
High fault-tolerance requirements:
- Extreme environments (radiation, high temperature)
- Military applications
- Space exploration
Embedded intelligence:
- IoT edge devices
- Simple control tasks
- Real-time response requirements

Inapplicable scenarios:

Complex tasks requiring extensive historical memory
High-dimensional state/action spaces
Tasks requiring precise computation
Rapid prototyping (long hardware manufacturing cycle)

Complementarity with digital RL:

Supplement rather than replacement
Digital RL suits complex tasks and rapid iteration
Physical RL suits deployment under specific constraints

References

Dillavou et al. (2024): Machine learning without a processor: Emergent learning in a nonlinear analog network. PNAS. (Original CLLN paper)
Stern et al. (2021): Supervised Learning in Physical Networks: From Machine Learning to Learning Machines. Physical Review X. (Coupled Learning theoretical framework)
Scellier & Bengio (2017): Equilibrium Propagation: Bridging the Gap between Energy-Based Models and Backpropagation. Frontiers in Computational Neuroscience. (Theoretical foundation)
Mak et al. (2007, 2010): Early work on analog circuit RL
Stern et al. (2024): Training self-learning circuits for power-efficient solutions. APL Machine Learning. (Power efficiency optimization)

Overall Assessment: This is pioneering work that first applies physical learning networks to reinforcement learning, with significant theoretical and potential practical value. While currently validated only on simple tasks and still distant from fully autonomous physical learning systems, it opens new research directions for energy-efficient and fault-tolerant autonomous agents. The paper's primary value lies in clarifying the design space, constraints, and unique advantages of physical learning systems, laying foundation for subsequent research. Future work should continue advancing hardware implementation, task complexity, and methodological refinement.