2025-11-16T09:07:12.223206

Where to Search: Measure the Prior-Structured Search Space of LLM Agents

Song

The generate-filter-refine (iterative paradigm) based on large language models (LLMs) has achieved progress in reasoning, programming, and program discovery in AI+Science. However, the effectiveness of search depends on where to search, namely, how to encode the domain prior into an operationally structured hypothesis space. To this end, this paper proposes a compact formal theory that describes and measures LLM-assisted iterative search guided by domain priors. We represent an agent as a fuzzy relation operator on inputs and outputs to capture feasible transitions; the agent is thereby constrained by a fixed safety envelope. To describe multi-step reasoning/search, we weight all reachable paths by a single continuation parameter and sum them to obtain a coverage generating function; this induces a measure of reachability difficulty; and it provides a geometric interpretation of search on the graph induced by the safety envelope. We further provide the simplest testable inferences and validate them via a majority-vote instantiation. This theory offers a workable language and operational tools to measure agents and their search spaces, proposing a systematic formal description of iterative search constructed by LLMs.

academic

Where to Search: Measure the Prior-Structured Search Space of LLM Agents

基本信息

论文ID: 2510.14846
标题: Where to Search: Measure the Prior-Structured Search Space of LLM Agents
作者: Zhuo-Yang Song
分类: cs.AI cs.CL cs.LO
发表时间: 2025年10月16日 (arXiv预印本)
论文链接: https://arxiv.org/abs/2510.14846

摘要

基于大型语言模型(LLMs)的生成-过滤-优化(generate-filter-refine)迭代范式在推理、编程和AI+科学的程序发现方面取得了进展。然而，搜索的有效性取决于在哪里搜索，即如何将领域先验编码到可操作的结构化假设空间中。为此，本文提出了一个紧凑的形式化理论，用以描述和测量由领域先验指导的LLM辅助迭代搜索。作者将智能体表示为输入和输出上的模糊关系算子来捕获可行转换；智能体因此被固定的安全包络约束。为了描述多步推理/搜索，作者通过单一延续参数对所有可达路径进行加权并求和，得到覆盖生成函数；这诱导了可达性难度的度量；并提供了对安全包络诱导图上搜索的几何解释。

长时域任务需求：长时域任务对安全性和可控性提出更高要求，需要在可验证和可控的边界内操作
复杂性挑战：长时域问题往往涉及组合爆炸和稀疏奖励，纯启发式或0/1评分不足以量化可达性难度
理论缺失：当前实践主要依赖工程启发式(提示设计、过滤器、评分函数等)，缺乏统一的语言和定量工具

现有方法局限性

缺乏统一的智能体-空间-搜索测量语言
难以可比较地测量不同智能体间可达性与安全性的权衡
缺乏对智能体长时域行为特征的清晰刻画和解释

研究动机

建立一个简洁、可计算、模型无关的形式化理论，统一安全性和可达性测量，提供可测试的预测和工程可用的设计原则。

核心贡献

提出了紧凑的形式化理论：将智能体形式化为模糊关系算子，通过覆盖生成函数统一描述迭代搜索过程
建立了统一的测量框架：引入延续参数和覆盖指数，提供安全性与可达性的统一量化方法
提供了几何解释：在安全包络诱导的有向图上定义几何量，给出搜索过程的几何解释
验证了理论预测：通过多数投票实例化验证了理论的可测试推论，提供了外部验证

方法详解

任务定义

输入空间： $C_1$ (智能体输入空间)
输出空间： $C_2$ (智能体输出空间，满足 $C_2 \subseteq C_1$ 以支持迭代)
目标：测量和描述在安全约束下的迭代搜索过程

最短距离： $d_0(f,g) := \inf\{n \in \mathbb{N}: N_n(f,g) \geq 1\}$
最短路径数： $N_{d_0}(f,g)$
临界参数： $p_c(f,g) := \inf\{p \in [0,1]: P_{f,g}^{ideal}(p) \geq 1\}$
覆盖指数： $R_c(f,g) := 1 - p_c(f,g)$

假设1：近似单向搜索(闭环路径稀少)
假设2：低阶项主导(过长轨迹相对稀少)

实验设置

实验环境

搜索空间：二维网格 $G_N := \{0,\ldots,N-1\}^2$
网格规模： $N = 3, 5, 8$
目标点：分别为 $(1,2), (3,4), (6,7)$

智能体构造

LLM模型集合：gpt-4-mini, gpt-4, qwen3, qwen-plus, gemini-2.5-flash, deepseek-v3, grok-4, doubao
多数投票机制：对每个位置 $f$ 独立采样 $m=5$ 次，取众数作为决策
理想智能体： $\mu_f^{(t)}(g) := \frac{1}{n}\sum_L \mu_f^{(L,t)}(g)$
安全包络： $\mu_f^{0,(t)}(g) := \mathbf{1}\{\mu_f^{(t)}(g) > 0\}$