2025-11-12T10:46:10.127053

Transmuting prompts into weights

Mazzawi, Dherin, Munn et al.

A growing body of research has demonstrated that the behavior of large language models can be effectively controlled at inference time by directly modifying their internal states, either through vector additions to their activations or through updates to their weight matrices. These techniques, while powerful, are often guided by empirical heuristics, such as deriving steering vectors from the average activations of contrastive prompts. This work provides a theoretical foundation for these interventions, explaining how they emerge from the fundamental computations of the transformer architecture. Building on the recent finding that a prompt's influence can be mathematically mapped to implicit weight updates (Dherin et al., 2025), we generalize this theory to deep, multi-block transformers. We show how the information contained in any chunk of a user prompt is represented and composed internally through weight vectors and weight matrices. We then derive a principled method for condensing this information into token-independent thought vectors and thought matrices. These constructs provide a theoretical explanation for existing vector- and matrix-based model editing techniques and offer a direct, computationally-grounded method for transmuting textual input into reusable weight updates.

academic

Transmuting Prompts into Weights

Basic Information

Paper ID: 2510.08734
Title: Transmuting prompts into weights
Authors: Hanna Mazzawi, Benoit Dherin, Michael Munn, Michael Wunder, Javier Gonzalvo (Google Research)
Classification: cs.LG (Machine Learning)
Publication Date: October 9, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.08734

Abstract

This paper provides theoretical foundations for inference-time control techniques in large language models. Existing research demonstrates that model behavior can be effectively controlled by directly modifying internal states—either by adding vectors to activations or updating weight matrices. However, these techniques typically rely on empirical heuristics lacking theoretical support. Building on the discovery that prompt effects can be mathematically mapped to implicit weight updates, this paper generalizes the theory to deep multi-block transformers. The paper demonstrates how any information block in a user prompt can be internally represented and combined through weight vectors and weight matrices, and derives principled methods for compressing this information into token-agnostic "thought vectors" and "thought matrices."

Research Background and Motivation

Problem Definition

The core problem addressed by this research is: Why do existing model intervention techniques (such as activation steering and model editing) effectively control complex model behavior? What are the underlying mathematical principles behind these techniques?

Significance

Theoretical Gap: Despite vector steering and matrix editing techniques being highly effective in practice, they lack theoretical explanations based on transformer architecture
Method Limitations: Existing approaches primarily rely on empirical heuristics, such as constructing steering vectors through averaged activations from contrastive prompts
Need for Unified Framework: A unified theoretical framework is needed to explain how textual instructions translate into concrete weight or activation changes

Limitations of Existing Methods

Activation Steering Methods: Using only vector addition may fail to fully represent the complete effects of instructions
Model Editing Methods: Lack strategies derived from first principles to compress general prompt information into reusable weight updates
Insufficient Theoretical Explanation: The success of existing techniques lacks theoretical grounding in transformer computational mechanisms

Core Contributions

Theoretical Extension: Extends token patching theory from single transformer blocks to deep multi-block transformer architectures
Thought Patch Framework: Proposes methods for aggregating token-dependent transient patches into reusable weight updates
Theoretical Unification: Provides unified theoretical explanations for existing vector steering and matrix editing techniques
Practical Methods: Provides computational methods for directly converting text prompts into weight updates

Methodology Details

Task Definition

Given a prompt C = I, x₁, ..., xₙ containing an instruction block I and subsequent content, the goal is to find equivalent weight updates such that model outputs with instruction I removed match those of the original complete prompt.

Token Patching Theory

Single-Block Extension

Based on work by Dherin et al., the output of a single transformer block can be perfectly replicated through the following token patch:

δₓ(I) = A(C, x) - A(C\I, x)                    (3)
∆ₓ(I) = δₓ(I)aₓᵀ / ||aₓ||²                    (4)

where aₓ = A(C\I, x) is the attention output of token x without context I.

Multi-Block Extension

For deep transformers, token patches must be recursively applied to each layer:

x⁽²⁾ = T⁽²⁾_patched ∘ T⁽¹⁾_patched (C⁽⁰⁾\I⁽⁰⁾, x⁽⁰⁾)

Each layer's patch is computed using transformed activations from the previous layer.

Thought Patch Derivation

Thought Vector Approximation

By minimizing squared error across all token vectors, the optimal approximation for the thought vector is obtained:

δ(I) = (1/n) Σᵢ₌₁ⁿ δᵢ

Thought Matrix Approximation

Theorem 3.1: Given n vectors a₁,...,aₙ, the minimization problem:

∆(I) = argminₘ Σᵢ₌₁ⁿ ||Maᵢ - ∆ᵢaᵢ||²        (7)

has a unique solution if and only if the operator Z = Σᵢ₌₁ⁿ aᵢaᵢᵀ is invertible:

∆(I) = (Σᵢ₌₁ⁿ δᵢaᵢᵀ) Z⁻¹                    (8)

Practical Approximation

Assuming vectors aᵢ follow a spherical distribution, where Z approximates a scalar multiple of the identity matrix, the practical formula is:

∆(I) = λ Σᵢ₌₁ⁿ δᵢaᵢᵀ

Technical Innovations

Theoretical Foundation: First to provide transformer architecture-based theoretical explanations for empirical model control techniques
Unified Framework: Unifies vector steering and matrix editing within a single weight update mechanism
Mathematical Rigor: Provides rigorous mathematical derivations and theorem proofs
Practicality: Methods are directly applicable to real models without requiring backpropagation

Experimental Setup

Datasets

Arithmetic Tasks: Synthetic datasets for three-digit addition and multiplication
Machine Translation: English-French translation dataset ("mntn/en-fr")

Models

All experiments use the Gemma 3.0 1B model

Evaluation Metrics

Arithmetic Tasks: Accuracy (target ≥80%)
Machine Translation: Translation quality evaluated using Gemini 2.5-Flash-lite

Implementation Details

Target Layers: Layers 10-20
Hyperparameters: c₁ and c₂ determined through tuning
Stability Improvements: Rank-1 updates stabilized through attention vector norm normalization

Experimental Results

Main Results

Arithmetic Tasks

Addition: Achieves 100% accuracy with fewer than 300 demonstration tokens
Multiplication: Achieves 80% accuracy, demonstrating method effectiveness on more complex tasks
Behavioral Observations: Patched models produce more detailed chain-of-thought reasoning

Machine Translation

Patched Model: Achieves 60% accuracy without instructions
Baseline Model: Achieves 72% accuracy with instructions
Performance Gap: 12% performance gap exists, but demonstrates method feasibility

Key Findings

Hyperparameter Sensitivity: Method is highly sensitive to hyperparameter c₁
- Low c₁: Model simply repeats input
- High c₁: Output becomes repetitive and unstable
Baseline Outperformance: In some arithmetic problems, patched models even outperform baseline models with instructions
Language Confusion: In translation tasks, models sometimes default to incorrect target languages

Case Studies

Success Case (Addition):

Query: 2 9 2
Patched Model Output: "Okay, let's calculate the sum of 2 + 9 + 2: 2 + 9 + 2 = 13 So, the answer is 13."

Error Correction Case (Multiplication):

Baseline Model Error: 0 * 8 * 6 = 48
Patched Model Correct: 0 * 8 * 6 = 0

Activation Steering Methods

Steering Vectors: Guide model behavior by adding carefully designed vectors to residual streams
Contrastive Methods: Construct vectors using activation differences between positive and negative sample prompts
Functional Vectors: Capture task-specific vector representations

Model Editing Methods

ROME: Modifies factual associations using rank-1 matrix edits
MEND: Learns low-rank updates to feedforward weight matrices
Safety Control: Removes unsafe activation directions through editing

Contributions of This Work

First to provide a unified theoretical framework derived from first principles, explaining why both classes of methods are effective.

Conclusions and Discussion

Main Conclusions

Theoretical Unification: Successfully unifies empirical model control techniques within a transformer computation-based theoretical framework
Method Effectiveness: Experiments demonstrate the feasibility of thought patch methods on arithmetic and translation tasks
Theoretical Explanation: Provides mathematical foundations for existing heuristics, such as contrastive activation averaging being the correct least-squares approximation

Limitations

Performance Gap: Performance loss exists compared to direct prompting
Hyperparameter Sensitivity: Method is highly sensitive to hyperparameter selection, requiring careful tuning
Task Complexity: Performance on more complex tasks requires further verification
Computational Complexity: Computing Z⁻¹ is challenging in the general case

Future Directions

Analysis Tools: Use the framework as an analytical tool for better understanding task representations and reasoning in large language models
Performance Improvement: Research methods to reduce performance gaps and lower hyperparameter sensitivity
Extended Applications: Explore applications to more complex tasks
Theory Refinement: Further develop the theoretical framework to handle more general cases

In-Depth Evaluation

Strengths

Significant Theoretical Contribution: First to provide rigorous theoretical foundations for model control techniques, filling an important theoretical gap
Mathematical Rigor: Provides complete mathematical derivations and theorem proofs with solid theoretical framework
Strong Unification: Successfully unifies seemingly different approaches (vector steering and matrix editing)
Practical Value: Methods are directly applicable, offering new perspectives for practical applications

Weaknesses

Limited Experimental Scale: Verified only on 1B parameter models, lacking experiments on large-scale models
Narrow Task Scope: Experimental tasks are relatively simple; performance on complex NLP tasks remains unknown
Performance Loss: Significant performance degradation compared to direct prompting
Engineering Challenges: Hyperparameter sensitivity may limit practical applications

Impact

Academic Value: Provides important theoretical foundations for transformer mechanism understanding and model control research
Practical Prospects: Offers new technical pathways for model deployment and control
Research Inspiration: May catalyze more theory-based model control method research

Applicable Scenarios

Model Analysis: Understanding internal representations and computational mechanisms
Lightweight Deployment: Achieving model specialization in resource-constrained environments
Safety Control: Providing theoretical guidance for model safety and alignment
R&D Tools: Serving as analytical tools for model development and debugging

References

Key references include:

Dherin et al. (2025) - Implicit dynamics learning theory for single-block transformers
Turner et al. (2025) - Activation engineering for steering language models
Meng et al. (2022) - Locating and editing factual associations in GPT
Todd et al. (2024) - Functional vectors in large language models

Overall Assessment: This is a paper of significant theoretical value that successfully provides rigorous theoretical foundations for empirical model control techniques. While there is room for improvement in experimental validation, its theoretical contributions are important for understanding and developing transformer model control techniques.