2025-11-22T08:49:16.236324

VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents

Lee, Ji, Wen et al.

Text analytics has traditionally required specialized knowledge in Natural Language Processing (NLP) or text analysis, which presents a barrier for entry-level analysts. Recent advances in large language models (LLMs) have changed the landscape of NLP by enabling more accessible and automated text analysis (e.g., topic detection, summarization, information extraction, etc.). We introduce VIDEE, a system that supports entry-level data analysts to conduct advanced text analytics with intelligent agents. VIDEE instantiates a human-agent collaroration workflow consisting of three stages: (1) Decomposition, which incorporates a human-in-the-loop Monte-Carlo Tree Search algorithm to support generative reasoning with human feedback, (2) Execution, which generates an executable text analytics pipeline, and (3) Evaluation, which integrates LLM-based evaluation and visualizations to support user validation of execution results. We conduct two quantitative experiments to evaluate VIDEE's effectiveness and analyze common agent errors. A user study involving participants with varying levels of NLP and text analytics experience -- from none to expert -- demonstrates the system's usability and reveals distinct user behavior patterns. The findings identify design implications for human-agent collaboration, validate the practical utility of VIDEE for non-expert users, and inform future improvements to intelligent text analytics systems.

academic

VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents

Basic Information

Paper ID: 2506.21582
Title: VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents
Authors: Sam Yu-Te Lee, Chenyang Ji, Shicheng Wen, Lifu Huang, Dongyu Liu, Kwan-Liu Ma
Classification: cs.CL cs.AI cs.HC
Publication Date: October 13, 2025 (arXiv v4)
Paper Link: https://arxiv.org/abs/2506.21582

Abstract

Text analytics traditionally requires expertise in natural language processing (NLP) or text analysis, presenting technical barriers for entry-level analysts. Recent advances in large language models (LLMs) have transformed the NLP landscape by enabling more accessible and automated text analytics (such as topic detection, summarization, information extraction, etc.). This paper introduces the VIDEE system, which enables entry-level data analysts to collaborate with intelligent agents for advanced text analytics. VIDEE instantiates a three-stage human-in-the-loop workflow: (1) Decomposition stage, combining human-in-the-loop Monte Carlo Tree Search (MCTS) to support generative reasoning with human feedback; (2) Execution stage, generating executable text analytics pipelines; (3) Evaluation stage, integrating LLM-based evaluation and visualization to support user verification of execution results.

Research Background and Motivation

Problem Definition

Traditional text analytics faces four major challenges:

Large Decomposition Space Problem: The flexibility of prompting allows multiple decomposition approaches through different subtask combinations to achieve objectives. Analysts must balance subtask difficulty against overall pipeline robustness.
Technical Knowledge Barrier: Analysts possess varying levels of technical knowledge, particularly regarding LLMs. The LLM-related field is rapidly evolving, and analysts may struggle to keep pace with the latest technologies.
Implementation and Experimentation Difficulty: Constructing and implementing text analytics pipelines requires substantial engineering effort, including handling input/output formats, intermediate data transformations, and parameter analysis.
Evaluation Challenges: Evaluating LLM-based text analytics pipelines requires unique evaluation methodologies that are not yet widely established.

Research Motivation

These challenges motivate the need for an agent system to support text analysts. Given user objectives and datasets, an agent with sufficient technical knowledge can automatically decompose objectives, search the large decomposition space, and generate text analytics plans, then implement and execute pipelines, and finally evaluate results.

Core Contributions

Proposed Three-Stage Human-in-the-Loop Workflow: Designed a complete workflow encompassing Decomposition, Execution, and Evaluation to achieve complex text analytics objectives.
Developed VIDEE System: Implemented an agent system with a visual interface that enables data analysts to perform text analytics in a code-free environment.
Technical Innovations:
- Human-in-the-loop decomposition algorithm based on Monte Carlo Tree Search (MCTS)
- Conceptual framework of analysis units to handle data structure variations
- Evaluation mechanism integrating LLM judges with visualization
Empirical Research Findings: Through systematic evaluation and user studies, provided new insights into agent systems and human-AI collaboration.

Methodology Details

Task Definition

Input: User objectives (natural language description) and text datasets Output: Complete text analytics pipeline and its execution results Constraints: Support code-free environment, accommodate users with varying technical levels

Three-Stage Workflow Architecture

1. Decomposition Stage

Objective: Decompose user objectives into sequences of semantic tasks
Core Algorithm: Enhanced Monte Carlo Tree Search (MCTS)
Human-AI Collaboration: Humans monitor the search process while agents explore possible pipeline options

MCTS Algorithm Enhancements:

Utilize LLM judges as reward functions
Define three evaluation criteria: complexity, coherence, and importance
Support human feedback to adjust search direction
Replace random rollout with comprehensive reward calculation

2. Execution Stage

Transformation Process: Semantic tasks → Primitive tasks → Executable pipelines
Compilation Process: Generate input/output patterns, algorithm selection, and hyperparameters
Technical Support: Execution graph construction based on LangGraph

Analysis Unit Conceptual Framework:

Define input units for each primitive task
Adopt MapReduce paradigm to handle data structure variations
Automatically create new analysis units

3. Evaluation Stage

Evaluation Method: LLM judge-based evaluation without ground truth labels
Visualization: Bar charts and extended topic radial graphs
Automatic Recommendation: System recommends three evaluation criteria for each task

Technical Innovations

Combining Generative Reasoning with MCTS: Compared to the greedy strategy of beam search, MCTS's backpropagation provides backward feedback, making it more suitable for text analytics pipeline planning.
Analysis Unit Framework: Automatically handles data structure variations through MapReduce paradigm, supporting diverse combinations of primitive tasks.
Human-AI Collaboration Dynamics: Users serve as managers, LLM judges as advisors, reducing the necessity for LLM alignment.

Experimental Setup

Datasets

Decomposer Evaluation:
- LLooM scenario: HCI paper abstracts dataset
- TnT-LLM scenario: Microsoft Bing Copilot user conversation dataset
Executor Evaluation:
- Wikipedia dataset (n=210) with ground truth labels as topics
User Study:
- HCI paper abstracts dataset (100 papers)
- Concept induction task

Evaluation Metrics

Decomposer Evaluation: Arena method using o3-mini model to compare generated pipelines with human-crafted pipelines
Executor Evaluation: Concept coverage
User Study: Task completion, user behavior patterns, usability feedback

Baseline Methods

Decomposer: Human-crafted pipelines (LLooM and TnT-LLM)
Executor: BERTopic and GPT-4o baseline methods

Implementation Details

Models: GPT-4o, Claude-3.5-Sonnet, Gemini-2.0
Framework: AutoGen + LangGraph
Cost: Average $0.005 per expansion, approximately 7 minutes for complete tree

Experimental Results

Main Results

Decomposer Evaluation

Performance: In 10 comparisons, 6 generated pipelines were rated as better (2 for LLooM, 4 for TnT-LLM)
Advantages: Generated pipelines are more direct and concise
Limitations: Failed to consider context window constraints for long data processing

Executor Evaluation

Concept Coverage: 83% vs BERTopic (52.6%) vs GPT-4o (53%)
Performance Improvement: 30% improvement over baseline methods
Reliability: Achieves comparable results to LLooM human-crafted pipelines

User Study Findings

Positive Feedback:

Clear and Intuitive Workflow: All participants completed tasks within reasonable timeframes
Importance of Automation: Even expert-level participants found it more efficient than coding
Trust in Programmatic Generation: Users trust explicit processes more than black-box systems like ChatGPT

User Behavior Patterns:

Search Strategy Preference: "Exploit-first then explore" rather than balanced strategies
Alignment vs. Recommendations: Users view LLM judges as advisors rather than ground truth
Understanding Role of Analysis Units: Explicit analysis units aid pipeline understanding and error debugging

System Limitations

Execution Errors: Incorrect analysis unit selection may occur during compilation
Learning Curve: Requires 30 minutes of training for proficient use
Technical Dependency: Heavily relies on parallelized cloud-based LLM queries

LLM-Based Text Analytics

Individual Analytics: LLMs excel at text classification, information extraction, and other tasks
End-to-End Pipelines: TnT-LLM, LLooM, topic analysis frameworks, etc.

LLM-Assisted Data Analysis

Data cleaning and transformation tools (Data Wrangler)
Visual data exploration systems (LightVA, InterChat)
Text analytics presents unique challenges compared to traditional data analysis

Human-AI Collaboration Design Research

Prompt engineering challenges and solutions
User control and evaluation requirements in agent systems
Multi-level abstraction and interactive system design

Conclusions and Discussion

Main Conclusions

Feasibility Validation: The three-stage workflow effectively reduces technical barriers to text analytics
User Acceptance: Users with varying technical levels can successfully use the system
Technical Effectiveness: Generated pipeline quality is comparable to expert-crafted pipelines

Limitations

User Study Scale: Only 6 participants with sample bias toward graduate students
Technical Constraints: Dependent on cloud-based LLMs, lacking self-correction mechanisms
Functional Limitations: Does not support time series analysis, network analysis, or external knowledge bases

Future Directions

Conversational Agents: Integrate natural language command conversion
Feedback Loops: Feed execution and evaluation results back to the decomposition stage
Evaluation Method Extension: Support evaluation for non-text tasks such as clustering analysis
Open-Source Ecosystem Integration: Integration with tools like LangSmith

In-Depth Evaluation

Strengths

Systematic Innovation: First to propose a complete human-AI collaboration workflow for text analytics
Technical Depth: MCTS algorithm enhancements and analysis unit framework provide theoretical contributions
Practical Value: Genuinely reduces technical barriers to text analytics
Comprehensive Evaluation: Combines quantitative experiments with qualitative user studies

Weaknesses

Scalability: Heavily dependent on cloud APIs with cost and latency concerns
Error Handling: Lacks robust error detection and recovery mechanisms
Applicable Scope: Primarily suitable for standard text analytics tasks with limited support for specialized domains

Impact

Academic Contribution: Provides new paradigms for human-AI collaboration and agent system design
Practical Value: Likely to advance democratization of text analytics
Reproducibility: Built on open-source frameworks, facilitating reproduction and extension

Applicable Scenarios

Target Users: Entry-level data analysts, social science researchers, journalists
Application Domains: Customer feedback analysis, academic literature mining, social media analysis
Usage Conditions: Requires basic data analysis knowledge and 30 minutes of training time

References

This paper cites 63 related references, primarily including:

LLM text analytics applications (TnT-LLM, LLooM, etc.)
Human-AI collaboration interface design (AutoGen, LangGraph, etc.)
Visualization and interactive system design
Monte Carlo Tree Search algorithms

Overall Assessment: This is a high-quality systems paper that makes important contributions to the field of human-AI collaborative text analytics. The technical innovations are solid, experimental evaluation is comprehensive, and it has significant implications for advancing the democratization of text analytics tools. Despite some technical limitations, it provides clear directions for future research.