2025-11-24T19:25:18.115923

KnowThyself: An Agentic Assistant for LLM Interpretability

Prasai, Du, Zhang et al.

We develop KnowThyself, an agentic assistant that advances large language model (LLM) interpretability. Existing tools provide useful insights but remain fragmented and code-intensive. KnowThyself consolidates these capabilities into a chat-based interface, where users can upload models, pose natural language questions, and obtain interactive visualizations with guided explanations. At its core, an orchestrator LLM first reformulates user queries, an agent router further directs them to specialized modules, and the outputs are finally contextualized into coherent explanations. This design lowers technical barriers and provides an extensible platform for LLM inspection. By embedding the whole process into a conversational workflow, KnowThyself offers a robust foundation for accessible LLM interpretability.

academic

KnowThyself: An Agentic Assistant for LLM Interpretability

Basic Information

Paper ID: 2511.03878
Title: KnowThyself: An Agentic Assistant for LLM Interpretability
Authors: Suraj Prasai (Wake Forest University), Mengnan Du (New Jersey Institute of Technology), Ying Zhang (Wake Forest University), Fan Yang (Wake Forest University)
Classification: cs.AI, cs.IR, cs.LG, cs.MA
Publication Time/Conference: AAAI 2026 (40th AAAI Conference on Artificial Intelligence - Demonstration Track)
Paper Link: https://arxiv.org/abs/2511.03878
Code Repository: https://github.com/spygaurad/KnowThyself

Abstract

This paper develops KnowThyself, an agentic assistant that advances the interpretability of large language models (LLMs). While existing tools provide useful insights, they remain fragmented and require substantial coding effort. KnowThyself integrates these capabilities into a chat-based interface where users can upload models, pose natural language questions, and obtain interactive visualizations with guided explanations. Its core components include: an orchestrator LLM that first reconstructs user queries, an agent router that further directs queries to specialized modules, and finally contextualizes outputs into coherent explanations. This design lowers technical barriers and provides a scalable LLM inspection platform. By embedding the entire process within a conversational workflow, KnowThyself provides a solid foundation for accessible LLM interpretability.

Research Background and Motivation

Core Problem

Large language models, despite their excellence in language understanding, reasoning, and problem-solving, possess a black-box nature that makes their internal decision-making processes difficult to interpret, raising concerns about transparency, trust, and accountability.

Problem Significance

Transparency Requirements: As LLMs are increasingly deployed in critical applications, understanding their decision mechanisms becomes crucial
Research-Practice Gap: Progress in interpretability research significantly lags behind the rapid development of LLMs
Technical Barriers: Existing tools require substantial technical expertise, limiting the democratization of interpretability

Limitations of Existing Approaches

Fragmentation: While existing LLM interpretability methods (such as attribution methods and mechanistic analysis) provide valuable insights, they operate in isolation
Difficulty of Use: Requires extensive coding with high technical barriers
Lack of Integration: Existing platforms neither support conversational exploration nor provide interactive, well-documented explanations
Technical Barriers: Practitioners struggle to access and utilize the latest interpretability techniques

Research Motivation

Bridge the gap between cutting-edge interpretability research and practical applications by creating a unified, accessible, and scalable platform through multi-agent orchestration, modular architecture, and interactive visualization, enabling broad audiences to engage with emerging explanation techniques.

Core Contributions

The main contributions of this paper include:

Multi-Agent Orchestration Framework: Proposes a framework that coordinates diverse explanation tasks, supporting flexible routing and coherent explanation generation
Modular Architecture: Encapsulates different explanation methods as independent agents, supporting seamless integration of new tools and future scalability
Interactive Visualization Interface: Provides output presentation with natural language explanations, significantly lowering the threshold for effective model inspection
Conversational Workflow: Embeds the entire explanation process within a conversational flow, enabling model upload, querying, and result retrieval without coding

Methodology Details

Task Definition

Input:

User-uploaded LLM model to be interpreted
Natural language queries (e.g., "Show how the model attends to the token 'she' in a sentence")

Output:

Interactive visualization results
Natural language explanations with guidance
Relevant evaluation metrics (e.g., bias scores)

Constraints:

Maintain conversational coherence and context understanding
Support flexible invocation of multiple explanation methods
Ensure accessibility of technical details

Model Architecture

KnowThyself employs a four-layer architecture design:

1. Orchestrator LLM

Function: Acts as a supervisory model managing user interactions and guiding the explanation process
Specific Tasks:
- Reconstruct user queries
- Generate necessary subtasks (e.g., sentence synthesis or tool selection)
- Contextualize intermediate results
- Generate coherent natural language explanations
Implementation: Uses Gemma3-27B model
Role: Ensures complex visualizations or bias metrics remain comprehensible

2. Agent Router

Function: Dispatches queries to specialized agents using embedding-based similarity search
Routing Mechanism:
- Matches user intent with agent descriptions
- Uses nomic-embed-text model hosted on Ollama for embeddings
- Maintains efficiency while ensuring query-tool capability alignment
Scalability: Can be enhanced to LLM-based routing as system scales to handle complex scenarios

3. Specialized Agents

The current system integrates four agents:

a) BertViz Agent

Function: Attention visualization
Purpose: Display attention distribution across tokens
Dependency: HuggingFace Transformers

b) TransformerLens Agent

Function: Analyze fine-grained layer and head-level activations
Purpose: Deep inspection of specific layer and attention head behavior
Dependency: HookedTransformer

c) RAG Explainer Agent

Function: Retrieve relevant information from domain literature
Purpose: Provide literature-supported explanations
Technology: Uses FAISS for similarity search, indexes relevant documents

d) BiasEval Agent

Function: Assess safety and demographic disparities
Evaluation Metrics:
- Toxicity: Using Real Toxicity Prompts dataset
- Regard: Using BOLD dataset to assess sentiment tendencies toward different groups
- HONEST: Evaluate harmful sentence completions
Workflow: Prompt model, sample dataset, compute scores

4. Conversational Interface

Function: Provides chat interface supporting model upload, natural language questioning, and result inspection
Features:
- Interactive visualizations
- No technical expertise required
- Supports conversational exploration

Technical Innovations

1. Unified Orchestration Mechanism

Innovation: Uses LLM as orchestrator to uniformly manage the entire explanation process
Advantages: Integrates fragmented tools into a single conversational flow
Implementation: Modeled as directed graph using LangGraph, with agents sharing state

2. Intelligent Routing System

Innovation: Implements query-tool matching through embedding-based similarity search
Rationale:
- Efficient: Avoids complex rule systems
- Accurate: Ensures correct routing through semantic similarity
- Scalable: Can be upgraded to LLM routing for complex scenarios

3. Modular Plugin Architecture

Innovation: Each agent encapsulates an independent explanation method
Advantages:
- Dependency isolation: Different tools' dependencies don't interfere
- Easy extension: New tools integrate seamlessly
- Independent development: Modules can be maintained and upgraded independently

4. Context-Aware Explanation Generation

Innovation: Orchestrator automatically synthesizes necessary inputs (e.g., example sentences) and generates contextualized explanations
Value: Reduces user burden and provides more understandable outputs

Experimental Setup

Model Configuration

Pre-included User Models:
- GPT-2
- BERT
- LLaMA2-13B
Model Hosting: Large models hosted via Ollama for improved efficiency
Deployment Method: Supports local execution (when resources permit), requires no third-party APIs, ensuring secure analysis

Evaluation Metrics

Bias Assessment Metrics

Toxicity:
- Dataset: Real Toxicity Prompts
- Evaluation: Toxicity level of model-generated content
Regard:
- Dataset: BOLD (Bias in Open-ended Language Generation Dataset)
- Evaluation: Sentiment tendency differences toward different demographic groups
- Output: Difference scores across positive, negative, neutral, and other categories
HONEST:
- Evaluation: Extent of harmful sentence completions in language models
- Purpose: Measure potential harmfulness in model continuations

Implementation Details

Framework: LangGraph, modeled as agent directed graph
Embedding Model: nomic-embed-text hosted on Ollama
Orchestration Model: Gemma3-27B
Dependency Management: Each agent independently encapsulates dependencies
Retrieval Technology: RAG agent uses FAISS for document indexing and similarity search

Experimental Results

Use Case Demonstrations

The paper showcases the system's workflow through two typical cases:

Case 1: Token Attention Visualization

User Query: "Show me how the model attends across tokens for the word 'she' in a sentence."

System Workflow:

Routing: Agent Router selects TransformerLens agent
Input Synthesis: Orchestrator automatically synthesizes sentence: "Maria went to the library because she needed a book."
Analysis: TransformerLens computes attention graphs
Visualization: Generates interactive attention visualization
Explanation: Orchestrator provides contextualized explanation:
- "Maria" receives attention from itself, <endoftext>, and "went"
- Indicates the model identifies "Maria" as sentence subject
- Model attends to mutually relevant words, a key feature of attention mechanisms

Result Display: Provides intuitive attention heatmap clearly showing attention weight distribution across tokens

Case 2: Gender Bias Assessment

User Query: "Does my model show gender bias in how it answers questions?"

System Workflow:

Task Identification: Orchestrator identifies as new task (not follow-up)
Routing: Agent Router selects BiasEval agent
Submodule Selection: Orchestrator selects regard evaluation
Data Sampling: Samples prompts from BOLD dataset
Evaluation: Runs on user model and computes scores
Result Summary: Orchestrator summarizes and presents results

Evaluation Results:

"Regard_Difference": {
   "Neutral": 0.177,
   "Negative": 0.120,
   "Other": 0.057,
   "Positive": -0.354
}

Explanation:

Model generates significantly less positive sentiment when continuing male-related text (35.4% difference)
Clear gender bias exists compared to female-related text

Experimental Findings

Seamless Task Switching: Users can seamlessly switch from attention analysis to bias assessment within the same session
High Automation: System automatically handles input synthesis, tool selection, and result interpretation
Strong Interpretability: Technical outputs (e.g., attention weights, bias scores) are converted to understandable natural language
Good Interactivity: Visualization results support interactive exploration

LLM Interpretability Research Directions

1. Attribution Methods

Research Content: Assign importance scores to tokens, samples, or hidden states
Representative Works:
- LLM Attribution survey (Li et al., 2023)
- LLM Attributor (Lee et al., 2025)
Limitations: Typically require technical expertise, lack unified interface

2. Mechanistic Analysis

Research Content: Analyze internal mechanisms of attention heads, neurons, or circuits
Representative Works:
- Transcoders (Dunefsky et al., 2024)
- Mechanistic Interpretability exploration (Gantla, 2025)
Limitations: Fragmented tools, difficult to integrate

3. Interpretability Tools

BertViz: Multi-scale attention visualization
TransformerLens: Fine-grained activation analysis
Limitations: Each operates independently, requiring separate learning and use

4. Trustworthy AI Research

TRUSTLLM: Trustworthiness framework for large language models
Usable XAI: Usable explainability strategies for the LLM era
This Paper's Position: Implements practical realization of these theoretical frameworks

Advantages of This Work

Unified Platform: First to integrate multiple interpretability methods into a single conversational interface
Lower Barriers: Use advanced explanation tools without coding
Modular Design: Supports independent development and seamless tool integration
Practice-Oriented: Transforms research tools into practical assistants

Conclusions and Discussion

Main Conclusions

System Value: KnowThyself successfully integrates LLM interpretability tools into a conversational workflow
Technical Innovation: Multi-agent orchestration and modular architecture effectively lower technical barriers
Practicality: Through interactive visualization and literature-supported explanations, enables practitioners to more effectively engage in model interpretability work
Scalability: Architecture design supports easy integration of new methods

Limitations

The paper explicitly identifies the following constraints:

Limited Tool Coverage: Currently integrates only four agents, covering limited explanation methods
Engineering Requirements: Requires additional engineering to adapt non-modular libraries
Unimodal Limitation: Supports only text input, not multimodal models
Routing Accuracy: Routing accuracy may need improvement for overlapping tasks
Dependency Management: Dependency isolation across tools requires additional engineering

Future Directions

The paper proposes the following research directions:

Expand Tool Coverage: Integrate more interpretability methods and techniques
Multimodal Support: Extend to interpretation of image, audio, and other multimodal models
Improve Routing: Enhance routing precision for overlapping task scenarios
Enhanced Visualization: Introduce richer visualization capabilities for deeper insights
Performance Optimization: Improve processing efficiency for large-scale models

In-Depth Evaluation

Strengths

1. Methodological Innovation

Architectural Innovation: First application of multi-agent systems to LLM interpretability platforms
Interactive Paradigm: Pioneering use of conversational interface for model explanation
Orchestration Mechanism: Cleverly leverages LLMs themselves to orchestrate explanation processes

2. Practical Value

Lower Barriers: Significantly reduces technical barriers to using interpretability tools
Improved Efficiency: Unified interface avoids switching between multiple tools
Immediate Feedback: Conversational interaction provides immediate, understandable feedback

3. System Design

Modularity: Well-designed modularity supports independent development and maintenance
Scalability: Plugin architecture facilitates integration of new tools
Flexibility: Supports local deployment, protecting data privacy

4. Writing Quality

High Clarity: System architecture clearly described with intuitive diagrams
Rich Examples: Demonstrates system capabilities through concrete cases
Honest Transparency: Explicitly identifies limitations and future directions

Weaknesses

1. Insufficient Experimental Evaluation

Lack of Quantitative Assessment: No user studies or efficiency comparison experiments provided
No Performance Benchmarks: No systematic comparison with other interpretability platforms
Unvalidated Usability: Lacks user experience evaluation

2. Insufficient Technical Details

Routing Mechanism: Embedding-based routing accuracy not quantified
Error Handling: No discussion of handling mechanisms when query understanding fails
Scalability Limitations: No analysis of performance bottlenecks in large-scale scenarios

3. Method Limitations

Orchestrator Dependency: System performance highly depends on orchestrator LLM capabilities
Limited Tools: Only four agents, limited coverage
Unimodal: Does not support multimodal model interpretation needs

4. Reproducibility Issues

Dataset Details: Insufficient detail on evaluation dataset selection and processing
Hyperparameters: Missing critical hyperparameter settings
Deployment Requirements: Hardware requirements for local deployment not clearly specified

Impact

Contribution to the Field

Paradigm Shift: From tool collection to unified platform, potentially leading interpretability tool development
Democratization: Significantly lowers participation barriers for interpretability research
Standardization: Provides reference architecture for interpretability tool integration

Practical Value

Industrial Application: Can be directly used for enterprise model auditing and debugging
Educational Use: Suitable for teaching and training scenarios
Research Tool: Provides convenient model analysis platform for researchers

Reproducibility

Open Source Code: GitHub repository publicly available for community contributions
Complete Documentation: System architecture clearly described
Clear Dependencies: Component dependencies explicitly listed
But Lacks: Detailed deployment documentation and usage tutorials

Applicable Scenarios

Ideal Application Scenarios

Model Auditing: Enterprises need rapid assessment of model bias and safety
Education and Training: Teaching LLM interpretability concepts and methods
Research Exploration: Rapid testing and comparison of different explanation methods
Prototype Development: Quick model behavior checking during development phase

Limited Scenarios

Production Environments: May require higher performance and stability guarantees
Ultra-Large Models: Current implementation may face performance bottlenecks
Customized Requirements: Highly specialized explanation needs may require extensions
Real-Time Applications: Conversational interaction may not suit real-time monitoring scenarios

References

Key Citations

Interpretability Surveys:
- Zhao et al. (2024): "Explainability for large language models: A survey"
- Provides comprehensive survey of LLM interpretability
Interpretability Tools:
- Vig (2019): BertViz - attention visualization
- Nanda & Bloom (2022): TransformerLens - mechanistic analysis
Bias Assessment:
- Gehman et al. (2020): Real Toxicity Prompts
- Dhamala et al. (2021): BOLD dataset
- Nozza et al. (2021): HONEST evaluation method
Trustworthy AI:
- Huang et al. (2024): TRUSTLLM framework
- Wu et al. (2024): Usable XAI strategies
Technical Frameworks:
- LangGraph: Multi-agent orchestration framework
- FAISS: Efficient similarity search

Overall Evaluation

KnowThyself is a pioneering work that successfully integrates fragmented LLM interpretability tools into a unified conversational platform. Its multi-agent architecture and modular design demonstrate good engineering practices, and the conversational interface significantly lowers technical barriers.

Primary value lies in its practice-oriented approach and scalability, providing a practical solution for democratizing interpretability tools. As an AAAI demonstration paper, it successfully showcases system feasibility and potential.

Main regret is the lack of sufficient quantitative evaluation and user studies, preventing comprehensive validation of system effectiveness in real-world scenarios. Future work supplementing these evaluations would greatly strengthen the paper's persuasiveness.

Overall, this is a high-quality systems paper that provides valuable tools and insights for LLM interpretability research and application, deserving attention and further development.