2025-11-12T10:46:10.127053

Transmuting prompts into weights

Mazzawi, Dherin, Munn et al.
A growing body of research has demonstrated that the behavior of large language models can be effectively controlled at inference time by directly modifying their internal states, either through vector additions to their activations or through updates to their weight matrices. These techniques, while powerful, are often guided by empirical heuristics, such as deriving steering vectors from the average activations of contrastive prompts. This work provides a theoretical foundation for these interventions, explaining how they emerge from the fundamental computations of the transformer architecture. Building on the recent finding that a prompt's influence can be mathematically mapped to implicit weight updates (Dherin et al., 2025), we generalize this theory to deep, multi-block transformers. We show how the information contained in any chunk of a user prompt is represented and composed internally through weight vectors and weight matrices. We then derive a principled method for condensing this information into token-independent thought vectors and thought matrices. These constructs provide a theoretical explanation for existing vector- and matrix-based model editing techniques and offer a direct, computationally-grounded method for transmuting textual input into reusable weight updates.
academic

Transmuting Prompts into Weights

Basic Information

  • Paper ID: 2510.08734
  • Title: Transmuting prompts into weights
  • Authors: Hanna Mazzawi, Benoit Dherin, Michael Munn, Michael Wunder, Javier Gonzalvo (Google Research)
  • Classification: cs.LG (Machine Learning)
  • Publication Date: October 9, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.08734

Abstract

This paper provides theoretical foundations for inference-time control techniques in large language models. Existing research demonstrates that model behavior can be effectively controlled by directly modifying internal states—either by adding vectors to activations or updating weight matrices. However, these techniques typically rely on empirical heuristics lacking theoretical support. Building on the discovery that prompt effects can be mathematically mapped to implicit weight updates, this paper generalizes the theory to deep multi-block transformers. The paper demonstrates how any information block in a user prompt can be internally represented and combined through weight vectors and weight matrices, and derives principled methods for compressing this information into token-agnostic "thought vectors" and "thought matrices."

Research Background and Motivation

Problem Definition

The core problem addressed by this research is: Why do existing model intervention techniques (such as activation steering and model editing) effectively control complex model behavior? What are the underlying mathematical principles behind these techniques?

Significance

  1. Theoretical Gap: Despite vector steering and matrix editing techniques being highly effective in practice, they lack theoretical explanations based on transformer architecture
  2. Method Limitations: Existing approaches primarily rely on empirical heuristics, such as constructing steering vectors through averaged activations from contrastive prompts
  3. Need for Unified Framework: A unified theoretical framework is needed to explain how textual instructions translate into concrete weight or activation changes

Limitations of Existing Methods

  1. Activation Steering Methods: Using only vector addition may fail to fully represent the complete effects of instructions
  2. Model Editing Methods: Lack strategies derived from first principles to compress general prompt information into reusable weight updates
  3. Insufficient Theoretical Explanation: The success of existing techniques lacks theoretical grounding in transformer computational mechanisms

Core Contributions

  1. Theoretical Extension: Extends token patching theory from single transformer blocks to deep multi-block transformer architectures
  2. Thought Patch Framework: Proposes methods for aggregating token-dependent transient patches into reusable weight updates
  3. Theoretical Unification: Provides unified theoretical explanations for existing vector steering and matrix editing techniques
  4. Practical Methods: Provides computational methods for directly converting text prompts into weight updates

Methodology Details

Task Definition

Given a prompt C = I, x₁, ..., xₙ containing an instruction block I and subsequent content, the goal is to find equivalent weight updates such that model outputs with instruction I removed match those of the original complete prompt.

Token Patching Theory

Single-Block Extension

Based on work by Dherin et al., the output of a single transformer block can be perfectly replicated through the following token patch:

δₓ(I) = A(C, x) - A(C\I, x)                    (3)
∆ₓ(I) = δₓ(I)aₓᵀ / ||aₓ||²                    (4)

where aₓ = A(C\I, x) is the attention output of token x without context I.

Multi-Block Extension

For deep transformers, token patches must be recursively applied to each layer:

x⁽²⁾ = T⁽²⁾_patched ∘ T⁽¹⁾_patched (C⁽⁰⁾\I⁽⁰⁾, x⁽⁰⁾)

Each layer's patch is computed using transformed activations from the previous layer.

Thought Patch Derivation

Thought Vector Approximation

By minimizing squared error across all token vectors, the optimal approximation for the thought vector is obtained:

δ(I) = (1/n) Σᵢ₌₁ⁿ δᵢ

Thought Matrix Approximation

Theorem 3.1: Given n vectors a₁,...,aₙ, the minimization problem:

∆(I) = argminₘ Σᵢ₌₁ⁿ ||Maᵢ - ∆ᵢaᵢ||²        (7)

has a unique solution if and only if the operator Z = Σᵢ₌₁ⁿ aᵢaᵢᵀ is invertible:

∆(I) = (Σᵢ₌₁ⁿ δᵢaᵢᵀ) Z⁻¹                    (8)

Practical Approximation

Assuming vectors aᵢ follow a spherical distribution, where Z approximates a scalar multiple of the identity matrix, the practical formula is:

∆(I) = λ Σᵢ₌₁ⁿ δᵢaᵢᵀ

Technical Innovations

  1. Theoretical Foundation: First to provide transformer architecture-based theoretical explanations for empirical model control techniques
  2. Unified Framework: Unifies vector steering and matrix editing within a single weight update mechanism
  3. Mathematical Rigor: Provides rigorous mathematical derivations and theorem proofs
  4. Practicality: Methods are directly applicable to real models without requiring backpropagation

Experimental Setup

Datasets

  1. Arithmetic Tasks: Synthetic datasets for three-digit addition and multiplication
  2. Machine Translation: English-French translation dataset ("mntn/en-fr")

Models

All experiments use the Gemma 3.0 1B model

Evaluation Metrics

  • Arithmetic Tasks: Accuracy (target ≥80%)
  • Machine Translation: Translation quality evaluated using Gemini 2.5-Flash-lite

Implementation Details

  • Target Layers: Layers 10-20
  • Hyperparameters: c₁ and c₂ determined through tuning
  • Stability Improvements: Rank-1 updates stabilized through attention vector norm normalization

Experimental Results

Main Results

Arithmetic Tasks

  • Addition: Achieves 100% accuracy with fewer than 300 demonstration tokens
  • Multiplication: Achieves 80% accuracy, demonstrating method effectiveness on more complex tasks
  • Behavioral Observations: Patched models produce more detailed chain-of-thought reasoning

Machine Translation

  • Patched Model: Achieves 60% accuracy without instructions
  • Baseline Model: Achieves 72% accuracy with instructions
  • Performance Gap: 12% performance gap exists, but demonstrates method feasibility

Key Findings

  1. Hyperparameter Sensitivity: Method is highly sensitive to hyperparameter c₁
    • Low c₁: Model simply repeats input
    • High c₁: Output becomes repetitive and unstable
  2. Baseline Outperformance: In some arithmetic problems, patched models even outperform baseline models with instructions
  3. Language Confusion: In translation tasks, models sometimes default to incorrect target languages

Case Studies

Success Case (Addition):

  • Query: 2 9 2
  • Patched Model Output: "Okay, let's calculate the sum of 2 + 9 + 2: 2 + 9 + 2 = 13 So, the answer is 13."

Error Correction Case (Multiplication):

  • Baseline Model Error: 0 * 8 * 6 = 48
  • Patched Model Correct: 0 * 8 * 6 = 0

Activation Steering Methods

  • Steering Vectors: Guide model behavior by adding carefully designed vectors to residual streams
  • Contrastive Methods: Construct vectors using activation differences between positive and negative sample prompts
  • Functional Vectors: Capture task-specific vector representations

Model Editing Methods

  • ROME: Modifies factual associations using rank-1 matrix edits
  • MEND: Learns low-rank updates to feedforward weight matrices
  • Safety Control: Removes unsafe activation directions through editing

Contributions of This Work

First to provide a unified theoretical framework derived from first principles, explaining why both classes of methods are effective.

Conclusions and Discussion

Main Conclusions

  1. Theoretical Unification: Successfully unifies empirical model control techniques within a transformer computation-based theoretical framework
  2. Method Effectiveness: Experiments demonstrate the feasibility of thought patch methods on arithmetic and translation tasks
  3. Theoretical Explanation: Provides mathematical foundations for existing heuristics, such as contrastive activation averaging being the correct least-squares approximation

Limitations

  1. Performance Gap: Performance loss exists compared to direct prompting
  2. Hyperparameter Sensitivity: Method is highly sensitive to hyperparameter selection, requiring careful tuning
  3. Task Complexity: Performance on more complex tasks requires further verification
  4. Computational Complexity: Computing Z⁻¹ is challenging in the general case

Future Directions

  1. Analysis Tools: Use the framework as an analytical tool for better understanding task representations and reasoning in large language models
  2. Performance Improvement: Research methods to reduce performance gaps and lower hyperparameter sensitivity
  3. Extended Applications: Explore applications to more complex tasks
  4. Theory Refinement: Further develop the theoretical framework to handle more general cases

In-Depth Evaluation

Strengths

  1. Significant Theoretical Contribution: First to provide rigorous theoretical foundations for model control techniques, filling an important theoretical gap
  2. Mathematical Rigor: Provides complete mathematical derivations and theorem proofs with solid theoretical framework
  3. Strong Unification: Successfully unifies seemingly different approaches (vector steering and matrix editing)
  4. Practical Value: Methods are directly applicable, offering new perspectives for practical applications

Weaknesses

  1. Limited Experimental Scale: Verified only on 1B parameter models, lacking experiments on large-scale models
  2. Narrow Task Scope: Experimental tasks are relatively simple; performance on complex NLP tasks remains unknown
  3. Performance Loss: Significant performance degradation compared to direct prompting
  4. Engineering Challenges: Hyperparameter sensitivity may limit practical applications

Impact

  1. Academic Value: Provides important theoretical foundations for transformer mechanism understanding and model control research
  2. Practical Prospects: Offers new technical pathways for model deployment and control
  3. Research Inspiration: May catalyze more theory-based model control method research

Applicable Scenarios

  1. Model Analysis: Understanding internal representations and computational mechanisms
  2. Lightweight Deployment: Achieving model specialization in resource-constrained environments
  3. Safety Control: Providing theoretical guidance for model safety and alignment
  4. R&D Tools: Serving as analytical tools for model development and debugging

References

Key references include:

  1. Dherin et al. (2025) - Implicit dynamics learning theory for single-block transformers
  2. Turner et al. (2025) - Activation engineering for steering language models
  3. Meng et al. (2022) - Locating and editing factual associations in GPT
  4. Todd et al. (2024) - Functional vectors in large language models

Overall Assessment: This is a paper of significant theoretical value that successfully provides rigorous theoretical foundations for empirical model control techniques. While there is room for improvement in experimental validation, its theoretical contributions are important for understanding and developing transformer model control techniques.