2025-11-21T21:40:15.836321

Comparative Analysis of Large Language Models for the Machine-Assisted Resolution of User Intentions

Flerlage, Acker, Kao

Large Language Models (LLMs) have emerged as transformative tools for natural language understanding and user intent resolution, enabling tasks such as translation, summarization, and, increasingly, the orchestration of complex workflows. This development signifies a paradigm shift from conventional, GUI-driven user interfaces toward intuitive, language-first interaction paradigms. Rather than manually navigating applications, users can articulate their objectives in natural language, enabling LLMs to orchestrate actions across multiple applications in a dynamic and contextual manner. However, extant implementations frequently rely on cloud-based proprietary models, which introduce limitations in terms of privacy, autonomy, and scalability. For language-first interaction to become a truly robust and trusted interface paradigm, local deployment is not merely a convenience; it is an imperative. This limitation underscores the importance of evaluating the feasibility of locally deployable, open-source, and open-access LLMs as foundational components for future intent-based operating systems. In this study, we examine the capabilities of several open-source and open-access models in facilitating user intention resolution through machine assistance. A comparative analysis is conducted against OpenAI's proprietary GPT-4-based systems to assess performance in generating workflows for various user intentions. The present study offers empirical insights into the practical viability, performance trade-offs, and potential of open LLMs as autonomous, locally operable components in next-generation operating systems. The results of this study inform the broader discussion on the decentralization and democratization of AI infrastructure and point toward a future where user-device interaction becomes more seamless, adaptive, and privacy-conscious through locally embedded intelligence.

academic

Comparative Analysis of Large Language Models for the Machine-Assisted Resolution of User Intentions

Basic Information

Paper ID: 2510.08576
Title: Comparative Analysis of Large Language Models for the Machine-Assisted Resolution of User Intentions
Authors: Justus Flerlage (Technische Universität Berlin), Alexander Acker (logsight.ai GmbH), Odej Kao (Technische Universität Berlin)
Classification: cs.SE cs.AI cs.CL cs.HC
Conference: HAIC 2025: First International Workshop on Human-AI Collaborative Systems
Paper Link: https://arxiv.org/abs/2510.08576

Abstract

This research explores the transformative role of Large Language Models (LLMs) in natural language understanding and user intent resolution, particularly their capabilities in orchestrating complex workflows. The study focuses on the transition from traditional GUI-driven interfaces toward intuitive language-first interaction paradigms. However, existing implementations often rely on cloud-based proprietary models, which present limitations in privacy, autonomy, and scalability. This paper evaluates the feasibility of locally deployed open-source LLMs as foundational components for future intent-based operating systems through comparative analysis of open-source and open-access models against OpenAI's proprietary GPT-4 system.

Research Background and Motivation

Core Problems

Need for Interaction Paradigm Shift: Traditional operating systems based on GUI, hierarchical file management, and shell mechanisms require users to manually coordinate multiple applications, a process that is cumbersome and time-consuming
Privacy and Autonomy Challenges: Existing cloud-based proprietary models present limitations in privacy, autonomy, and scalability
Necessity of Local Deployment: To achieve truly robust and trustworthy language-first interaction paradigms, local deployment is not merely convenient but essential

Research Significance

Advancing the transition from GUI-driven to language-first interaction paradigms
Evaluating the feasibility of open-source LLMs in future intent-driven operating systems
Promoting decentralization and democratization of AI infrastructure

Limitations of Existing Approaches

Dependence on external cloud infrastructure, lacking autonomy
Privacy and data security concerns
Network dependency limiting application scenarios

Core Contributions

First Systematic Comparison: Comprehensive comparative analysis of open-source/open-access LLMs versus proprietary GPT-4 models on user intent resolution tasks
Practical System Architecture: Design and implementation of a Controller-based system architecture supporting dynamic execution of LLM-generated workflows
Multi-dimensional Evaluation Framework: Establishment of an evaluation system encompassing response time, time-to-first-token, code quality, and other dimensions
Feasibility Validation of Open-Source LLMs: Demonstration that open-source models achieve performance levels approaching proprietary models on user intent resolution tasks

Methodology Details

Task Definition

Converting user natural language intent into executable workflows, specifically:

Input: User's natural language intent description
Output: Executable workflow in Python code form
Constraints: Code must invoke a predefined set of API functions

System Architecture

Core Components

Controller: Central coordination unit managing communication with LLMs and workflow execution
Function Table: Catalog containing available functions and their specifications, providing function signatures and implementation callbacks
Prompt Formatter: Generates LLM prompts based on user intent and Function Table
Executor: Executes LLM-generated code in a controlled environment
LLM Service: Externally hosted LLM interface

Workflow Modeling

Conceptualizing workflows as deterministic state machines
Modeling using imperative programming language (Python)
Supporting sequential steps and complex control flow structures (loops, branches)
Enabling step interruption, preemption, and asynchronous task management

Technical Innovations

State Machine and Code Equivalence: Innovatively models workflows as state machines with state transitions implemented through Python code execution
Controlled Execution Environment: Restricts executable functions through Function Table to ensure security
Unified Multi-Model Interface: Designs a unified evaluation framework supporting multiple LLMs

Experimental Setup

Test Models

Open-Source/Open-Access Models:

falcon-3-10b-instruct
qwen-2.5-14b-instruct
phi-4

Proprietary Models:

gpt-4o
gpt-4o-mini
gpt-4-turbo
gpt-4.5-preview-2025-02-27

Test Intent Set

Nine user intents of varying complexity:

Simple baseline functionality (e.g., "Please sleep for 5 seconds")
External information requests (e.g., query temperature, Wikipedia summaries)
System-oriented tasks (e.g., file listing, remote installation)
Media interaction (e.g., play random song)
Composite tasks (e.g., send file to insurance company)

Evaluation Metrics

Functional Correctness: Intent resolution success rate
Response Time: Total time to receive complete output
Time-to-First-Token: Time to receive initial output
Code Quality: Presence of preamble, postscript, and code comments

Implementation Details

Controller implemented in Python 3
Execution on Android device using Termux environment
Model temperature set to 0.0 for deterministic results
Each intent tested once per LLM

Experimental Results

Main Results

Intent Resolution Success Rate

Model Category	Successful Resolutions	Overall Performance
Open-Source Models	7/9	Comparable to gpt-4-turbo
Proprietary Models (Top-tier)	8/9	Slightly superior to open-source

Specific Performance:

falcon-3-10b-instruct: 7/9 successful
phi-4: 7/9 successful
qwen-2.5-14b-instruct: 7/9 successful
gpt-4o, gpt-4o-mini, gpt-4.5-preview: 8/9 successful
gpt-4-turbo: 7/9 successful

Performance Metrics Comparison

Average Response Time:

Fastest: gpt-4o (1.75s)
Fastest open-source: qwen-2.5-14b-instruct (3.42s)
Slowest: gpt-4.5-preview-2025-02-27 (7.24s)

Average Time-to-First-Token:

Fastest: falcon-3-10b-instruct (353.4ms)
Slowest: gpt-4.5-preview-2025-02-27 (900.1ms)

Detailed Analysis

Failure Case Analysis

Intent 8 (Wikipedia Summary): Nearly all models failed due to content exceeding context window
Format Issues: falcon-3-10b-instruct used incorrect code block markers in Intent 7
Function Selection Errors: Some models selected inappropriate API functions for complex intents

Code Quality Characteristics

Preamble/Postscript: Open-source models generally lack these; proprietary models show mixed results
Code Comments: phi-4 and most proprietary models tend to include comments
Code Correctness: Most generated code is syntactically and logically correct

Core Technical Foundations

Transformer Architecture: Foundation of all modern LLMs, enabling parallelized training and high-quality NLP
Code Generation: Applications in code assistance tools such as GitHub Copilot
Intent Recognition: Related research on user intent identification in conversational systems

Application Domain Extensions

Personal Assistants: Existing solutions such as Siri, Cortana, Alexa
Operating System Integration: Research on LLM-agent-oriented operating systems such as AIOS
GUI Automation: Research on AI directly operating existing GUI applications

Security and Privacy

Data Privacy: Privacy concerns in training data and user information handling
AI Risks: Systematic analysis of issues including hallucinations and erroneous code generation

Conclusions and Discussion

Main Conclusions

Performance Proximity: Open-source LLMs demonstrate performance approaching proprietary models on user intent resolution tasks, achieving 77.8% success rate (7/9)
Acceptable Response Time: While proprietary models show advantages in response time, open-source model performance remains within acceptable ranges
Local Deployment Feasibility: Validates the feasibility of constructing intent-driven systems using self-hosted open-source models

Limitations

Single-Test Constraint: Each intent tested only once, lacking statistical significance verification
Computational Resource Requirements: Current models still require substantial computational resources, limiting truly local deployment
Security Risks: Direct execution of generated code presents security vulnerabilities requiring more robust sandboxing mechanisms
API Coverage Scope: Current API set is relatively limited, struggling with more complex user intents

Future Directions

Model Optimization: Reducing model size and computational requirements through pruning, distillation, and quantization techniques
Security Mechanisms: Developing more robust isolation and sandboxing mechanisms
API Expansion: Constructing more comprehensive APIs to handle diverse user intents
Alignment Issues: Addressing AI system shutdown problems and alignment deception issues

In-Depth Evaluation

Strengths

Significant Research Value: First systematic evaluation of open-source LLMs' application potential in intent-driven operating systems
Reasonable Experimental Design: Encompasses test cases of varying complexity with comprehensive evaluation dimensions
Technical Innovation: State machine and code execution equivalence modeling demonstrates innovation
High Practical Value: Provides important reference for future operating system design

Weaknesses

Limited Test Scale: Only nine test cases with relatively small sample size
Missing Statistical Analysis: Lacks confidence intervals and significance testing
Insufficient Security Consideration: Relatively superficial discussion of code execution security risks
Unverified Long-term Reliability: Does not consider model stability during extended use

Impact

Academic Contribution: Provides important benchmark for LLM integration in operating systems
Practical Value: Demonstrates feasibility of open-source solutions, promoting technology democratization
Future-Oriented: Indicates direction for next-generation human-computer interaction interface design

Applicable Scenarios

Privacy-Sensitive Environments: Enterprise and personal applications requiring local processing
Resource-Constrained Devices: Mobile devices and edge computing scenarios
Customization Requirements: Professional domains requiring specific functional optimization
Research Prototypes: Academic research and proof-of-concept systems

References

This paper cites 38 important references covering Transformer architecture, LLM applications, code generation, human-computer interaction, AI safety, and other related research domains, providing a solid theoretical foundation for the research.

Overall Assessment: This is a forward-looking and practically valuable research paper that systematically evaluates open-source LLMs' application potential in future operating systems for the first time. While presenting certain limitations in experimental scale and security analysis, its research conclusions hold significant importance for promoting AI technology democratization and advancing next-generation human-computer interaction interface development.