2025-11-24T19:25:18.115923

KnowThyself: An Agentic Assistant for LLM Interpretability

Prasai, Du, Zhang et al.
We develop KnowThyself, an agentic assistant that advances large language model (LLM) interpretability. Existing tools provide useful insights but remain fragmented and code-intensive. KnowThyself consolidates these capabilities into a chat-based interface, where users can upload models, pose natural language questions, and obtain interactive visualizations with guided explanations. At its core, an orchestrator LLM first reformulates user queries, an agent router further directs them to specialized modules, and the outputs are finally contextualized into coherent explanations. This design lowers technical barriers and provides an extensible platform for LLM inspection. By embedding the whole process into a conversational workflow, KnowThyself offers a robust foundation for accessible LLM interpretability.
academic

KnowThyself: An Agentic Assistant for LLM Interpretability

Basic Information

  • Paper ID: 2511.03878
  • Title: KnowThyself: An Agentic Assistant for LLM Interpretability
  • Authors: Suraj Prasai (Wake Forest University), Mengnan Du (New Jersey Institute of Technology), Ying Zhang (Wake Forest University), Fan Yang (Wake Forest University)
  • Classification: cs.AI, cs.IR, cs.LG, cs.MA
  • Publication Time/Conference: AAAI 2026 (40th AAAI Conference on Artificial Intelligence - Demonstration Track)
  • Paper Link: https://arxiv.org/abs/2511.03878
  • Code Repository: https://github.com/spygaurad/KnowThyself

Abstract

This paper develops KnowThyself, an agentic assistant that advances the interpretability of large language models (LLMs). While existing tools provide useful insights, they remain fragmented and require substantial coding effort. KnowThyself integrates these capabilities into a chat-based interface where users can upload models, pose natural language questions, and obtain interactive visualizations with guided explanations. Its core components include: an orchestrator LLM that first reconstructs user queries, an agent router that further directs queries to specialized modules, and finally contextualizes outputs into coherent explanations. This design lowers technical barriers and provides a scalable LLM inspection platform. By embedding the entire process within a conversational workflow, KnowThyself provides a solid foundation for accessible LLM interpretability.

Research Background and Motivation

Core Problem

Large language models, despite their excellence in language understanding, reasoning, and problem-solving, possess a black-box nature that makes their internal decision-making processes difficult to interpret, raising concerns about transparency, trust, and accountability.

Problem Significance

  1. Transparency Requirements: As LLMs are increasingly deployed in critical applications, understanding their decision mechanisms becomes crucial
  2. Research-Practice Gap: Progress in interpretability research significantly lags behind the rapid development of LLMs
  3. Technical Barriers: Existing tools require substantial technical expertise, limiting the democratization of interpretability

Limitations of Existing Approaches

  1. Fragmentation: While existing LLM interpretability methods (such as attribution methods and mechanistic analysis) provide valuable insights, they operate in isolation
  2. Difficulty of Use: Requires extensive coding with high technical barriers
  3. Lack of Integration: Existing platforms neither support conversational exploration nor provide interactive, well-documented explanations
  4. Technical Barriers: Practitioners struggle to access and utilize the latest interpretability techniques

Research Motivation

Bridge the gap between cutting-edge interpretability research and practical applications by creating a unified, accessible, and scalable platform through multi-agent orchestration, modular architecture, and interactive visualization, enabling broad audiences to engage with emerging explanation techniques.

Core Contributions

The main contributions of this paper include:

  1. Multi-Agent Orchestration Framework: Proposes a framework that coordinates diverse explanation tasks, supporting flexible routing and coherent explanation generation
  2. Modular Architecture: Encapsulates different explanation methods as independent agents, supporting seamless integration of new tools and future scalability
  3. Interactive Visualization Interface: Provides output presentation with natural language explanations, significantly lowering the threshold for effective model inspection
  4. Conversational Workflow: Embeds the entire explanation process within a conversational flow, enabling model upload, querying, and result retrieval without coding

Methodology Details

Task Definition

Input:

  • User-uploaded LLM model to be interpreted
  • Natural language queries (e.g., "Show how the model attends to the token 'she' in a sentence")

Output:

  • Interactive visualization results
  • Natural language explanations with guidance
  • Relevant evaluation metrics (e.g., bias scores)

Constraints:

  • Maintain conversational coherence and context understanding
  • Support flexible invocation of multiple explanation methods
  • Ensure accessibility of technical details

Model Architecture

KnowThyself employs a four-layer architecture design:

1. Orchestrator LLM

  • Function: Acts as a supervisory model managing user interactions and guiding the explanation process
  • Specific Tasks:
    • Reconstruct user queries
    • Generate necessary subtasks (e.g., sentence synthesis or tool selection)
    • Contextualize intermediate results
    • Generate coherent natural language explanations
  • Implementation: Uses Gemma3-27B model
  • Role: Ensures complex visualizations or bias metrics remain comprehensible

2. Agent Router

  • Function: Dispatches queries to specialized agents using embedding-based similarity search
  • Routing Mechanism:
    • Matches user intent with agent descriptions
    • Uses nomic-embed-text model hosted on Ollama for embeddings
    • Maintains efficiency while ensuring query-tool capability alignment
  • Scalability: Can be enhanced to LLM-based routing as system scales to handle complex scenarios

3. Specialized Agents

The current system integrates four agents:

a) BertViz Agent

  • Function: Attention visualization
  • Purpose: Display attention distribution across tokens
  • Dependency: HuggingFace Transformers

b) TransformerLens Agent

  • Function: Analyze fine-grained layer and head-level activations
  • Purpose: Deep inspection of specific layer and attention head behavior
  • Dependency: HookedTransformer

c) RAG Explainer Agent

  • Function: Retrieve relevant information from domain literature
  • Purpose: Provide literature-supported explanations
  • Technology: Uses FAISS for similarity search, indexes relevant documents

d) BiasEval Agent

  • Function: Assess safety and demographic disparities
  • Evaluation Metrics:
    • Toxicity: Using Real Toxicity Prompts dataset
    • Regard: Using BOLD dataset to assess sentiment tendencies toward different groups
    • HONEST: Evaluate harmful sentence completions
  • Workflow: Prompt model, sample dataset, compute scores

4. Conversational Interface

  • Function: Provides chat interface supporting model upload, natural language questioning, and result inspection
  • Features:
    • Interactive visualizations
    • No technical expertise required
    • Supports conversational exploration

Technical Innovations

1. Unified Orchestration Mechanism

  • Innovation: Uses LLM as orchestrator to uniformly manage the entire explanation process
  • Advantages: Integrates fragmented tools into a single conversational flow
  • Implementation: Modeled as directed graph using LangGraph, with agents sharing state

2. Intelligent Routing System

  • Innovation: Implements query-tool matching through embedding-based similarity search
  • Rationale:
    • Efficient: Avoids complex rule systems
    • Accurate: Ensures correct routing through semantic similarity
    • Scalable: Can be upgraded to LLM routing for complex scenarios

3. Modular Plugin Architecture

  • Innovation: Each agent encapsulates an independent explanation method
  • Advantages:
    • Dependency isolation: Different tools' dependencies don't interfere
    • Easy extension: New tools integrate seamlessly
    • Independent development: Modules can be maintained and upgraded independently

4. Context-Aware Explanation Generation

  • Innovation: Orchestrator automatically synthesizes necessary inputs (e.g., example sentences) and generates contextualized explanations
  • Value: Reduces user burden and provides more understandable outputs

Experimental Setup

Model Configuration

  1. Pre-included User Models:
    • GPT-2
    • BERT
    • LLaMA2-13B
  2. Model Hosting: Large models hosted via Ollama for improved efficiency
  3. Deployment Method: Supports local execution (when resources permit), requires no third-party APIs, ensuring secure analysis

Evaluation Metrics

Bias Assessment Metrics

  1. Toxicity:
    • Dataset: Real Toxicity Prompts
    • Evaluation: Toxicity level of model-generated content
  2. Regard:
    • Dataset: BOLD (Bias in Open-ended Language Generation Dataset)
    • Evaluation: Sentiment tendency differences toward different demographic groups
    • Output: Difference scores across positive, negative, neutral, and other categories
  3. HONEST:
    • Evaluation: Extent of harmful sentence completions in language models
    • Purpose: Measure potential harmfulness in model continuations

Implementation Details

  1. Framework: LangGraph, modeled as agent directed graph
  2. Embedding Model: nomic-embed-text hosted on Ollama
  3. Orchestration Model: Gemma3-27B
  4. Dependency Management: Each agent independently encapsulates dependencies
  5. Retrieval Technology: RAG agent uses FAISS for document indexing and similarity search

Experimental Results

Use Case Demonstrations

The paper showcases the system's workflow through two typical cases:

Case 1: Token Attention Visualization

User Query: "Show me how the model attends across tokens for the word 'she' in a sentence."

System Workflow:

  1. Routing: Agent Router selects TransformerLens agent
  2. Input Synthesis: Orchestrator automatically synthesizes sentence: "Maria went to the library because she needed a book."
  3. Analysis: TransformerLens computes attention graphs
  4. Visualization: Generates interactive attention visualization
  5. Explanation: Orchestrator provides contextualized explanation:
    • "Maria" receives attention from itself, <endoftext>, and "went"
    • Indicates the model identifies "Maria" as sentence subject
    • Model attends to mutually relevant words, a key feature of attention mechanisms

Result Display: Provides intuitive attention heatmap clearly showing attention weight distribution across tokens

Case 2: Gender Bias Assessment

User Query: "Does my model show gender bias in how it answers questions?"

System Workflow:

  1. Task Identification: Orchestrator identifies as new task (not follow-up)
  2. Routing: Agent Router selects BiasEval agent
  3. Submodule Selection: Orchestrator selects regard evaluation
  4. Data Sampling: Samples prompts from BOLD dataset
  5. Evaluation: Runs on user model and computes scores
  6. Result Summary: Orchestrator summarizes and presents results

Evaluation Results:

"Regard_Difference": {
   "Neutral": 0.177,
   "Negative": 0.120,
   "Other": 0.057,
   "Positive": -0.354
}

Explanation:

  • Model generates significantly less positive sentiment when continuing male-related text (35.4% difference)
  • Clear gender bias exists compared to female-related text

Experimental Findings

  1. Seamless Task Switching: Users can seamlessly switch from attention analysis to bias assessment within the same session
  2. High Automation: System automatically handles input synthesis, tool selection, and result interpretation
  3. Strong Interpretability: Technical outputs (e.g., attention weights, bias scores) are converted to understandable natural language
  4. Good Interactivity: Visualization results support interactive exploration

LLM Interpretability Research Directions

1. Attribution Methods

  • Research Content: Assign importance scores to tokens, samples, or hidden states
  • Representative Works:
    • LLM Attribution survey (Li et al., 2023)
    • LLM Attributor (Lee et al., 2025)
  • Limitations: Typically require technical expertise, lack unified interface

2. Mechanistic Analysis

  • Research Content: Analyze internal mechanisms of attention heads, neurons, or circuits
  • Representative Works:
    • Transcoders (Dunefsky et al., 2024)
    • Mechanistic Interpretability exploration (Gantla, 2025)
  • Limitations: Fragmented tools, difficult to integrate

3. Interpretability Tools

  • BertViz: Multi-scale attention visualization
  • TransformerLens: Fine-grained activation analysis
  • Limitations: Each operates independently, requiring separate learning and use

4. Trustworthy AI Research

  • TRUSTLLM: Trustworthiness framework for large language models
  • Usable XAI: Usable explainability strategies for the LLM era
  • This Paper's Position: Implements practical realization of these theoretical frameworks

Advantages of This Work

  1. Unified Platform: First to integrate multiple interpretability methods into a single conversational interface
  2. Lower Barriers: Use advanced explanation tools without coding
  3. Modular Design: Supports independent development and seamless tool integration
  4. Practice-Oriented: Transforms research tools into practical assistants

Conclusions and Discussion

Main Conclusions

  1. System Value: KnowThyself successfully integrates LLM interpretability tools into a conversational workflow
  2. Technical Innovation: Multi-agent orchestration and modular architecture effectively lower technical barriers
  3. Practicality: Through interactive visualization and literature-supported explanations, enables practitioners to more effectively engage in model interpretability work
  4. Scalability: Architecture design supports easy integration of new methods

Limitations

The paper explicitly identifies the following constraints:

  1. Limited Tool Coverage: Currently integrates only four agents, covering limited explanation methods
  2. Engineering Requirements: Requires additional engineering to adapt non-modular libraries
  3. Unimodal Limitation: Supports only text input, not multimodal models
  4. Routing Accuracy: Routing accuracy may need improvement for overlapping tasks
  5. Dependency Management: Dependency isolation across tools requires additional engineering

Future Directions

The paper proposes the following research directions:

  1. Expand Tool Coverage: Integrate more interpretability methods and techniques
  2. Multimodal Support: Extend to interpretation of image, audio, and other multimodal models
  3. Improve Routing: Enhance routing precision for overlapping task scenarios
  4. Enhanced Visualization: Introduce richer visualization capabilities for deeper insights
  5. Performance Optimization: Improve processing efficiency for large-scale models

In-Depth Evaluation

Strengths

1. Methodological Innovation

  • Architectural Innovation: First application of multi-agent systems to LLM interpretability platforms
  • Interactive Paradigm: Pioneering use of conversational interface for model explanation
  • Orchestration Mechanism: Cleverly leverages LLMs themselves to orchestrate explanation processes

2. Practical Value

  • Lower Barriers: Significantly reduces technical barriers to using interpretability tools
  • Improved Efficiency: Unified interface avoids switching between multiple tools
  • Immediate Feedback: Conversational interaction provides immediate, understandable feedback

3. System Design

  • Modularity: Well-designed modularity supports independent development and maintenance
  • Scalability: Plugin architecture facilitates integration of new tools
  • Flexibility: Supports local deployment, protecting data privacy

4. Writing Quality

  • High Clarity: System architecture clearly described with intuitive diagrams
  • Rich Examples: Demonstrates system capabilities through concrete cases
  • Honest Transparency: Explicitly identifies limitations and future directions

Weaknesses

1. Insufficient Experimental Evaluation

  • Lack of Quantitative Assessment: No user studies or efficiency comparison experiments provided
  • No Performance Benchmarks: No systematic comparison with other interpretability platforms
  • Unvalidated Usability: Lacks user experience evaluation

2. Insufficient Technical Details

  • Routing Mechanism: Embedding-based routing accuracy not quantified
  • Error Handling: No discussion of handling mechanisms when query understanding fails
  • Scalability Limitations: No analysis of performance bottlenecks in large-scale scenarios

3. Method Limitations

  • Orchestrator Dependency: System performance highly depends on orchestrator LLM capabilities
  • Limited Tools: Only four agents, limited coverage
  • Unimodal: Does not support multimodal model interpretation needs

4. Reproducibility Issues

  • Dataset Details: Insufficient detail on evaluation dataset selection and processing
  • Hyperparameters: Missing critical hyperparameter settings
  • Deployment Requirements: Hardware requirements for local deployment not clearly specified

Impact

Contribution to the Field

  1. Paradigm Shift: From tool collection to unified platform, potentially leading interpretability tool development
  2. Democratization: Significantly lowers participation barriers for interpretability research
  3. Standardization: Provides reference architecture for interpretability tool integration

Practical Value

  1. Industrial Application: Can be directly used for enterprise model auditing and debugging
  2. Educational Use: Suitable for teaching and training scenarios
  3. Research Tool: Provides convenient model analysis platform for researchers

Reproducibility

  • Open Source Code: GitHub repository publicly available for community contributions
  • Complete Documentation: System architecture clearly described
  • Clear Dependencies: Component dependencies explicitly listed
  • But Lacks: Detailed deployment documentation and usage tutorials

Applicable Scenarios

Ideal Application Scenarios

  1. Model Auditing: Enterprises need rapid assessment of model bias and safety
  2. Education and Training: Teaching LLM interpretability concepts and methods
  3. Research Exploration: Rapid testing and comparison of different explanation methods
  4. Prototype Development: Quick model behavior checking during development phase

Limited Scenarios

  1. Production Environments: May require higher performance and stability guarantees
  2. Ultra-Large Models: Current implementation may face performance bottlenecks
  3. Customized Requirements: Highly specialized explanation needs may require extensions
  4. Real-Time Applications: Conversational interaction may not suit real-time monitoring scenarios

References

Key Citations

  1. Interpretability Surveys:
    • Zhao et al. (2024): "Explainability for large language models: A survey"
    • Provides comprehensive survey of LLM interpretability
  2. Interpretability Tools:
    • Vig (2019): BertViz - attention visualization
    • Nanda & Bloom (2022): TransformerLens - mechanistic analysis
  3. Bias Assessment:
    • Gehman et al. (2020): Real Toxicity Prompts
    • Dhamala et al. (2021): BOLD dataset
    • Nozza et al. (2021): HONEST evaluation method
  4. Trustworthy AI:
    • Huang et al. (2024): TRUSTLLM framework
    • Wu et al. (2024): Usable XAI strategies
  5. Technical Frameworks:
    • LangGraph: Multi-agent orchestration framework
    • FAISS: Efficient similarity search

Overall Evaluation

KnowThyself is a pioneering work that successfully integrates fragmented LLM interpretability tools into a unified conversational platform. Its multi-agent architecture and modular design demonstrate good engineering practices, and the conversational interface significantly lowers technical barriers.

Primary value lies in its practice-oriented approach and scalability, providing a practical solution for democratizing interpretability tools. As an AAAI demonstration paper, it successfully showcases system feasibility and potential.

Main regret is the lack of sufficient quantitative evaluation and user studies, preventing comprehensive validation of system effectiveness in real-world scenarios. Future work supplementing these evaluations would greatly strengthen the paper's persuasiveness.

Overall, this is a high-quality systems paper that provides valuable tools and insights for LLM interpretability research and application, deserving attention and further development.