2025-11-16T07:28:12.353949

Representation in large language models

Yetman

The extraordinary success of recent Large Language Models (LLMs) on a diverse array of tasks has led to an explosion of scientific and philosophical theorizing aimed at explaining how they do what they do. Unfortunately, disagreement over fundamental theoretical issues has led to stalemate, with entrenched camps of LLM optimists and pessimists often committed to very different views of how these systems work. Overcoming stalemate requires agreement on fundamental questions, and the goal of this paper is to address one such question, namely: is LLM behavior driven partly by representation-based information processing of the sort implicated in biological cognition, or is it driven entirely by processes of memorization and stochastic table look-up? This is a question about what kind of algorithm LLMs implement, and the answer carries serious implications for higher level questions about whether these systems have beliefs, intentions, concepts, knowledge, and understanding. I argue that LLM behavior is partially driven by representation-based information processing, and then I describe and defend a series of practical techniques for investigating these representations and developing explanations on their basis. The resulting account provides a groundwork for future theorizing about language models and their successors.

academic

Representation in Large Language Models

Basic Information

Paper ID: 2501.00885
Title: Representation in large language models
Author: Cameron C. Yetman (University of Toronto)
Classification: cs.CL cs.AI cs.LG
Publication Date: January 1, 2025 (draft version)
Paper Link: https://arxiv.org/abs/2501.00885

Abstract

The remarkable success of large language models (LLMs) across diverse tasks has prompted extensive scientific and philosophical theorization aimed at explaining their mechanisms. However, disagreement on fundamental theoretical questions has resulted in a stalemate, with opposing camps of LLM optimists and pessimists holding fundamentally different views on how these systems operate. Overcoming this impasse requires consensus on basic questions. This paper aims to address one such fundamental issue: Is LLM behavior partially driven by representation-based information processing similar to that found in biological cognition, or is it entirely driven by memorization and stochastic lookup table processes? This is a question about what algorithms LLMs implement, and the answer has significant implications for higher-level questions, such as whether these systems possess beliefs, intentions, concepts, knowledge, and understanding. The author argues that LLM behavior is partially driven by representation-based information processing and describes and defends a series of practical techniques for studying these representations and developing explanations based on them.

Research Background and Motivation

Core Research Question

The fundamental question this research addresses is: Is LLM behavior driven by representation-based information processing, or does it depend entirely on memorization and stochastic lookup table processes?

Significance of the Question

Reconciling theoretical disagreements: The LLM research field currently exhibits severe theoretical divisions, with optimists arguing that LLMs possess cognition-like capabilities and pessimists contending they are merely sophisticated pattern-matching systems
Cognitive science foundations: This question directly relates to whether LLMs can serve as cognitive models and whether they themselves constitute cognitive systems
Foundation for higher-level capabilities: The answer will influence our assessment of whether LLMs possess higher-level cognitive abilities such as beliefs, intentions, concepts, knowledge, and understanding

Limitations of Existing Approaches

Terminological overuse: The term "representation" in machine learning practice is used too broadly, losing theoretical value
Behavioral inference limitations: Determining the existence of representations solely from behavioral performance involves fundamental uncertainty
Lack of systematic methodology: There is an absence of systematic approaches to identify and verify representations in LLMs

Research Motivation

The author contends that resolving this foundational question is crucial for breaking the current theoretical stalemate and providing a solid foundation for future LLM theorization.

Core Contributions

Proposed a four-condition characterization of representation: Provides a substantive, operational definition of "representation" including four conditions: INFORMATION, EXPLOITABILITY, BEHAVIOR, and ROLE
Refuted lookup table explanations: Through analysis of cases such as Othello-GPT and color space models, demonstrated that LLMs cannot be entirely explained by finite state automata or lookup tables
Established a mechanistic interpretability framework: Systematically described how to use probing and intervention techniques to test for the presence of representations
Provided practical research methods: Offered concrete technical tools and methodological guidance for studying LLM representations

Detailed Methodology

Four-Condition Definition of Representation

The author proposes an operational definition: System S has a representation R of feature z if and only if the following four conditions are satisfied:

REPRESENTATION

INFORMATION: R carries information about z
EXPLOITABILITY: The information R carries about z is exploitable by S
BEHAVIOR: S's exploitation of the information R carries about z enables S to produce robust z-related behavior
ROLE: R plays a mechanistic role in S's robust z-related behavior

Technical Details

Information Condition (INFORMATION)
- Defined using mutual information: $I(X,Y) = H(X) - H(X|Y)$
- Condition is satisfied when $I(R,z) > 0$
- Information relationships can be established through causally generated correlations or structural correspondences
Exploitability Condition (EXPLOITABILITY)
- S must be able to modulate its z-related behavior in content-relevant ways based on R's activation
- Verified through testing and intervention on R
Behavior Condition (BEHAVIOR)
- "Robust" refers to insensitivity to minor perturbations in surrounding conditions
- Representations enable robust behavior but must be embedded in appropriate algorithms
Role Condition (ROLE)
- R must play a causal role in the mechanism driving behavior
- Avoids panrepresentationalism problems

Critique of the Lookup Table Hypothesis

The author analyzes the view of LLMs as lookup tables:

Finite state automaton perspective: LLMs are viewed as finite state automata encoding large-scale lookup tables
Non-productive characteristic: Lookup table systems are characteristically non-productive—"can only return what has already been input"
Counterevidence:
- Othello-GPT: Trained on data with 25% of the game tree missing, yet achieves 99.98% legal move rate on complete datasets
- Color space model: Performs comparably on rotated color encodings as on original data (36% vs 34% Top-3 accuracy)

Experimental Setup and Results

Case Study 1: Othello-GPT

Experimental Design:

GPT model trained on millions of Othello game records
Records contain only move sequences, no game rules or board attribute information
Control group: trained on complete dataset
Experimental group: trained on skewed dataset with 25% of game tree missing

Results:

Control group: 99.99% legal move success rate
Experimental group: 99.98% legal move success rate
Key finding: The model succeeds on unseen board configurations, indicating it is not a simple lookup table

Case Study 2: Color Space Model

Experimental Design:

Pre-trained GPT tested on structural property reasoning in color and spatial domains
In-context learning paradigm: 60 training examples
Control group: limited spectrum portion with RGB codes paired to color names
Experimental group: systematically arranged "rotated" condition maintaining structural relationships

Results:

Control group: 34% Top-3 accuracy
Rotated group: 36% Top-3 accuracy
Key finding: Comparable performance when structural relationships are maintained but specific pairings are entirely novel

Mechanistic Interpretability Verification

Probing Techniques

Small linear MLPs used as probes
Decode specific information from target network hidden layer activations
Verify INFORMATION and EXPLOITABILITY conditions

Intervention Techniques

Activation patching: Modify specific activation values and observe behavioral changes
Feature steering: Clamp specific features to anomalously high/low values
Verify BEHAVIOR and ROLE conditions

Othello-GPT Verification Results:

Linear probes successfully classify board states ("mine"/"yours"/"empty")
Activation intervention (flipping piece state) causes model predictions to align with modified board state

Claude 3 Sonnet Verification Results:

Sparse autoencoders identify interpretable features (e.g., Golden Gate Bridge, brain science)
Feature steering experiments: 10x activation of Golden Gate Bridge feature leads to model mentioning the bridge

Representation Theory Foundations

Cognitive science tradition: Theoretical foundations established by Fodor (1975), Sterelny (1991), Shea (2018)
Computational levels: Based on Marr's (1982) framework of algorithmic levels of analysis

Representation in Machine Learning

Representation learning: Bengio et al. (2014) representation learning framework
Terminological generalization problem: Ramsey (2017) on the generalization of the "representation" concept

LLM Interpretation Methods

Circuit analysis: Elhage et al. (2021), Dunefsky et al. (2024) computational pathway analysis
Causal abstraction: Geiger et al. (2021) causal model alignment methods
Mechanistic interpretability: MI research tradition established by Olah et al. (2018, 2020)

Conclusions and Discussion

Main Conclusions

LLMs possess substantive representations: In certain contexts, LLM behavior is driven by representations satisfying the four-condition definition
Lookup table explanations are insufficient: Pure memorization and lookup tables cannot explain LLMs' generalization capabilities
Mechanistic interpretability methods are effective: Probing and intervention techniques provide viable pathways for studying LLM representations

Limitations

Context-dependence of condition application: Robustness assessment of representations depends on specific tasks and environments
Content determination problem unresolved: The question of how representation content is determined has not been systematically addressed
Higher-level cognitive abilities remain open: The question of whether LLMs possess beliefs, knowledge, understanding, etc. has not been directly addressed

Future Directions

Systematic representation mapping: Establish systematic accounts of when LLMs are expected to rely on representations versus other mechanisms
Content determination theory: Develop theoretical frameworks for determining LLM representation content
Cognitive ability assessment: Assess LLMs' higher-level cognitive abilities based on representation analysis

In-Depth Evaluation

Strengths

Outstanding theoretical contribution: Provides rigorous representation definition, filling an important theoretical gap
Methodological innovation: Organically combines representation theory from cognitive science with interpretability techniques from machine learning
Sufficient empirical evidence: Supports core arguments through multiple case studies and technical verification
Clear and rigorous writing: Logical argumentation is clear and technical details are accurately described

Weaknesses

Limited case studies: Primarily based on a small number of cases, requiring broader validation
Ambiguous robustness standards: The definition of "robust behavior" remains relatively subjective
Practical challenges: Application of proposed methods to large-scale LLMs still faces technical challenges

Impact

Theoretical impact: Provides important theoretical foundation for research on LLM cognitive abilities
Methodological impact: Advances application of mechanistic interpretability in LLM research
Practical value: Provides new tools for AI safety and interpretability research

Applicable Scenarios

LLM capability assessment: Evaluate whether specific LLMs possess genuine cognitive abilities
Model improvement: Improve model architecture and training methods based on representation analysis
AI safety research: Understand LLM internal mechanisms to enhance system safety

References

The paper cites rich interdisciplinary literature, primarily including:

Cognitive science foundational literature: Fodor (1975), Marr (1982), Shea (2018)
Machine learning interpretability: Olah et al. (2018), Elhage et al. (2021)
Critical LLM research: Bender & Koller (2020), Marcus & Davis (2020)
Technical methodology literature: Li et al. (2023), Templeton et al. (2024)

Summary: This paper makes important theoretical and methodological contributions to the field of LLM representation research. Through rigorous conceptual analysis, empirical research, and technical innovation, it provides new perspectives for understanding LLM internal mechanisms. While certain limitations remain, it establishes a solid foundation for future research on LLM cognitive abilities.