A growing body of research has demonstrated that the behavior of large language models can be effectively controlled at inference time by directly modifying their internal states, either through vector additions to their activations or through updates to their weight matrices. These techniques, while powerful, are often guided by empirical heuristics, such as deriving steering vectors from the average activations of contrastive prompts. This work provides a theoretical foundation for these interventions, explaining how they emerge from the fundamental computations of the transformer architecture. Building on the recent finding that a prompt's influence can be mathematically mapped to implicit weight updates (Dherin et al., 2025), we generalize this theory to deep, multi-block transformers. We show how the information contained in any chunk of a user prompt is represented and composed internally through weight vectors and weight matrices. We then derive a principled method for condensing this information into token-independent thought vectors and thought matrices. These constructs provide a theoretical explanation for existing vector- and matrix-based model editing techniques and offer a direct, computationally-grounded method for transmuting textual input into reusable weight updates.
This paper provides theoretical foundations for inference-time control techniques in large language models. Existing research demonstrates that model behavior can be effectively controlled by directly modifying internal states—either by adding vectors to activations or updating weight matrices. However, these techniques typically rely on empirical heuristics lacking theoretical support. Building on the discovery that prompt effects can be mathematically mapped to implicit weight updates, this paper generalizes the theory to deep multi-block transformers. The paper demonstrates how any information block in a user prompt can be internally represented and combined through weight vectors and weight matrices, and derives principled methods for compressing this information into token-agnostic "thought vectors" and "thought matrices."
The core problem addressed by this research is: Why do existing model intervention techniques (such as activation steering and model editing) effectively control complex model behavior? What are the underlying mathematical principles behind these techniques?
Theoretical Gap: Despite vector steering and matrix editing techniques being highly effective in practice, they lack theoretical explanations based on transformer architecture
Method Limitations: Existing approaches primarily rely on empirical heuristics, such as constructing steering vectors through averaged activations from contrastive prompts
Need for Unified Framework: A unified theoretical framework is needed to explain how textual instructions translate into concrete weight or activation changes
Given a prompt C = I, x₁, ..., xₙ containing an instruction block I and subsequent content, the goal is to find equivalent weight updates such that model outputs with instruction I removed match those of the original complete prompt.
Theoretical Unification: Successfully unifies empirical model control techniques within a transformer computation-based theoretical framework
Method Effectiveness: Experiments demonstrate the feasibility of thought patch methods on arithmetic and translation tasks
Theoretical Explanation: Provides mathematical foundations for existing heuristics, such as contrastive activation averaging being the correct least-squares approximation
Significant Theoretical Contribution: First to provide rigorous theoretical foundations for model control techniques, filling an important theoretical gap
Mathematical Rigor: Provides complete mathematical derivations and theorem proofs with solid theoretical framework
Strong Unification: Successfully unifies seemingly different approaches (vector steering and matrix editing)
Practical Value: Methods are directly applicable, offering new perspectives for practical applications
Dherin et al. (2025) - Implicit dynamics learning theory for single-block transformers
Turner et al. (2025) - Activation engineering for steering language models
Meng et al. (2022) - Locating and editing factual associations in GPT
Todd et al. (2024) - Functional vectors in large language models
Overall Assessment: This is a paper of significant theoretical value that successfully provides rigorous theoretical foundations for empirical model control techniques. While there is room for improvement in experimental validation, its theoretical contributions are important for understanding and developing transformer model control techniques.