The extraordinary success of recent Large Language Models (LLMs) on a diverse array of tasks has led to an explosion of scientific and philosophical theorizing aimed at explaining how they do what they do. Unfortunately, disagreement over fundamental theoretical issues has led to stalemate, with entrenched camps of LLM optimists and pessimists often committed to very different views of how these systems work. Overcoming stalemate requires agreement on fundamental questions, and the goal of this paper is to address one such question, namely: is LLM behavior driven partly by representation-based information processing of the sort implicated in biological cognition, or is it driven entirely by processes of memorization and stochastic table look-up? This is a question about what kind of algorithm LLMs implement, and the answer carries serious implications for higher level questions about whether these systems have beliefs, intentions, concepts, knowledge, and understanding. I argue that LLM behavior is partially driven by representation-based information processing, and then I describe and defend a series of practical techniques for investigating these representations and developing explanations on their basis. The resulting account provides a groundwork for future theorizing about language models and their successors.
- Paper ID: 2501.00885
- Title: Representation in large language models
- Author: Cameron C. Yetman (University of Toronto)
- Classification: cs.CL cs.AI cs.LG
- Publication Date: January 1, 2025 (draft version)
- Paper Link: https://arxiv.org/abs/2501.00885
The remarkable success of large language models (LLMs) across diverse tasks has prompted extensive scientific and philosophical theorization aimed at explaining their mechanisms. However, disagreement on fundamental theoretical questions has resulted in a stalemate, with opposing camps of LLM optimists and pessimists holding fundamentally different views on how these systems operate. Overcoming this impasse requires consensus on basic questions. This paper aims to address one such fundamental issue: Is LLM behavior partially driven by representation-based information processing similar to that found in biological cognition, or is it entirely driven by memorization and stochastic lookup table processes? This is a question about what algorithms LLMs implement, and the answer has significant implications for higher-level questions, such as whether these systems possess beliefs, intentions, concepts, knowledge, and understanding. The author argues that LLM behavior is partially driven by representation-based information processing and describes and defends a series of practical techniques for studying these representations and developing explanations based on them.
The fundamental question this research addresses is: Is LLM behavior driven by representation-based information processing, or does it depend entirely on memorization and stochastic lookup table processes?
- Reconciling theoretical disagreements: The LLM research field currently exhibits severe theoretical divisions, with optimists arguing that LLMs possess cognition-like capabilities and pessimists contending they are merely sophisticated pattern-matching systems
- Cognitive science foundations: This question directly relates to whether LLMs can serve as cognitive models and whether they themselves constitute cognitive systems
- Foundation for higher-level capabilities: The answer will influence our assessment of whether LLMs possess higher-level cognitive abilities such as beliefs, intentions, concepts, knowledge, and understanding
- Terminological overuse: The term "representation" in machine learning practice is used too broadly, losing theoretical value
- Behavioral inference limitations: Determining the existence of representations solely from behavioral performance involves fundamental uncertainty
- Lack of systematic methodology: There is an absence of systematic approaches to identify and verify representations in LLMs
The author contends that resolving this foundational question is crucial for breaking the current theoretical stalemate and providing a solid foundation for future LLM theorization.
- Proposed a four-condition characterization of representation: Provides a substantive, operational definition of "representation" including four conditions: INFORMATION, EXPLOITABILITY, BEHAVIOR, and ROLE
- Refuted lookup table explanations: Through analysis of cases such as Othello-GPT and color space models, demonstrated that LLMs cannot be entirely explained by finite state automata or lookup tables
- Established a mechanistic interpretability framework: Systematically described how to use probing and intervention techniques to test for the presence of representations
- Provided practical research methods: Offered concrete technical tools and methodological guidance for studying LLM representations
The author proposes an operational definition: System S has a representation R of feature z if and only if the following four conditions are satisfied:
REPRESENTATION
- INFORMATION: R carries information about z
- EXPLOITABILITY: The information R carries about z is exploitable by S
- BEHAVIOR: S's exploitation of the information R carries about z enables S to produce robust z-related behavior
- ROLE: R plays a mechanistic role in S's robust z-related behavior
- Information Condition (INFORMATION)
- Defined using mutual information: I(X,Y)=H(X)−H(X∣Y)
- Condition is satisfied when I(R,z)>0
- Information relationships can be established through causally generated correlations or structural correspondences
- Exploitability Condition (EXPLOITABILITY)
- S must be able to modulate its z-related behavior in content-relevant ways based on R's activation
- Verified through testing and intervention on R
- Behavior Condition (BEHAVIOR)
- "Robust" refers to insensitivity to minor perturbations in surrounding conditions
- Representations enable robust behavior but must be embedded in appropriate algorithms
- Role Condition (ROLE)
- R must play a causal role in the mechanism driving behavior
- Avoids panrepresentationalism problems
The author analyzes the view of LLMs as lookup tables:
- Finite state automaton perspective: LLMs are viewed as finite state automata encoding large-scale lookup tables
- Non-productive characteristic: Lookup table systems are characteristically non-productive—"can only return what has already been input"
- Counterevidence:
- Othello-GPT: Trained on data with 25% of the game tree missing, yet achieves 99.98% legal move rate on complete datasets
- Color space model: Performs comparably on rotated color encodings as on original data (36% vs 34% Top-3 accuracy)
Experimental Design:
- GPT model trained on millions of Othello game records
- Records contain only move sequences, no game rules or board attribute information
- Control group: trained on complete dataset
- Experimental group: trained on skewed dataset with 25% of game tree missing
Results:
- Control group: 99.99% legal move success rate
- Experimental group: 99.98% legal move success rate
- Key finding: The model succeeds on unseen board configurations, indicating it is not a simple lookup table
Experimental Design:
- Pre-trained GPT tested on structural property reasoning in color and spatial domains
- In-context learning paradigm: 60 training examples
- Control group: limited spectrum portion with RGB codes paired to color names
- Experimental group: systematically arranged "rotated" condition maintaining structural relationships
Results:
- Control group: 34% Top-3 accuracy
- Rotated group: 36% Top-3 accuracy
- Key finding: Comparable performance when structural relationships are maintained but specific pairings are entirely novel
- Small linear MLPs used as probes
- Decode specific information from target network hidden layer activations
- Verify INFORMATION and EXPLOITABILITY conditions
- Activation patching: Modify specific activation values and observe behavioral changes
- Feature steering: Clamp specific features to anomalously high/low values
- Verify BEHAVIOR and ROLE conditions
Othello-GPT Verification Results:
- Linear probes successfully classify board states ("mine"/"yours"/"empty")
- Activation intervention (flipping piece state) causes model predictions to align with modified board state
Claude 3 Sonnet Verification Results:
- Sparse autoencoders identify interpretable features (e.g., Golden Gate Bridge, brain science)
- Feature steering experiments: 10x activation of Golden Gate Bridge feature leads to model mentioning the bridge
- Cognitive science tradition: Theoretical foundations established by Fodor (1975), Sterelny (1991), Shea (2018)
- Computational levels: Based on Marr's (1982) framework of algorithmic levels of analysis
- Representation learning: Bengio et al. (2014) representation learning framework
- Terminological generalization problem: Ramsey (2017) on the generalization of the "representation" concept
- Circuit analysis: Elhage et al. (2021), Dunefsky et al. (2024) computational pathway analysis
- Causal abstraction: Geiger et al. (2021) causal model alignment methods
- Mechanistic interpretability: MI research tradition established by Olah et al. (2018, 2020)
- LLMs possess substantive representations: In certain contexts, LLM behavior is driven by representations satisfying the four-condition definition
- Lookup table explanations are insufficient: Pure memorization and lookup tables cannot explain LLMs' generalization capabilities
- Mechanistic interpretability methods are effective: Probing and intervention techniques provide viable pathways for studying LLM representations
- Context-dependence of condition application: Robustness assessment of representations depends on specific tasks and environments
- Content determination problem unresolved: The question of how representation content is determined has not been systematically addressed
- Higher-level cognitive abilities remain open: The question of whether LLMs possess beliefs, knowledge, understanding, etc. has not been directly addressed
- Systematic representation mapping: Establish systematic accounts of when LLMs are expected to rely on representations versus other mechanisms
- Content determination theory: Develop theoretical frameworks for determining LLM representation content
- Cognitive ability assessment: Assess LLMs' higher-level cognitive abilities based on representation analysis
- Outstanding theoretical contribution: Provides rigorous representation definition, filling an important theoretical gap
- Methodological innovation: Organically combines representation theory from cognitive science with interpretability techniques from machine learning
- Sufficient empirical evidence: Supports core arguments through multiple case studies and technical verification
- Clear and rigorous writing: Logical argumentation is clear and technical details are accurately described
- Limited case studies: Primarily based on a small number of cases, requiring broader validation
- Ambiguous robustness standards: The definition of "robust behavior" remains relatively subjective
- Practical challenges: Application of proposed methods to large-scale LLMs still faces technical challenges
- Theoretical impact: Provides important theoretical foundation for research on LLM cognitive abilities
- Methodological impact: Advances application of mechanistic interpretability in LLM research
- Practical value: Provides new tools for AI safety and interpretability research
- LLM capability assessment: Evaluate whether specific LLMs possess genuine cognitive abilities
- Model improvement: Improve model architecture and training methods based on representation analysis
- AI safety research: Understand LLM internal mechanisms to enhance system safety
The paper cites rich interdisciplinary literature, primarily including:
- Cognitive science foundational literature: Fodor (1975), Marr (1982), Shea (2018)
- Machine learning interpretability: Olah et al. (2018), Elhage et al. (2021)
- Critical LLM research: Bender & Koller (2020), Marcus & Davis (2020)
- Technical methodology literature: Li et al. (2023), Templeton et al. (2024)
Summary: This paper makes important theoretical and methodological contributions to the field of LLM representation research. Through rigorous conceptual analysis, empirical research, and technical innovation, it provides new perspectives for understanding LLM internal mechanisms. While certain limitations remain, it establishes a solid foundation for future research on LLM cognitive abilities.