iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use
Zeng, Ding, Wang et al.
Augmenting large language models (LLMs) with external tools is a promising approach to enhance their capabilities, especially for complex tasks. Synthesizing tool-use data through real-world simulations is an effective way to achieve this. However, our investigation reveals that training gains significantly decay as synthetic data increases. The model struggles to benefit from additional synthetic data, which fails to endow it with advanced tool-use capabilities in complex scenarios Moreover, we discovered that the above limitation usually manifests as a fragment deficiency (i.e., parameter errors) in response. To this end, we propose an iterative reinforced fine-tuning strategy designed to alleviate this limitation. This strategy involves: (1) enhancing the diversity of response for synthetic data through path exploration of Monte Carlo Tree Search. (2) iteratively pinpointing the model's deficiency by constructing fine-grained preference pairs, and then improving it by preference optimization algorithms for targeted improvement. The experiments show that our method achieves 13.11% better performance than the same-size base model. It achieves an improvement of 6.5% in complex scenarios compared to the baseline, and it also outperforms larger open-source and closed-source models.
academic
iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use
Institutions: Harbin Institute of Technology Social Computing and Information Retrieval Research Center, Huawei Technologies Co., Ltd., Shanghai Jiao Tong University, University of Science and Technology of China
The integration of Large Language Models (LLMs) with external tools is an effective approach to enhance their capabilities, particularly in complex tasks. Generating synthetic tool-use data through real-world simulations is an effective means to achieve this goal. However, research reveals that training gains diminish significantly as synthetic data increases. Models struggle to benefit from additional synthetic data and fail to acquire advanced tool-use capabilities in complex scenarios. The authors identify that this limitation typically manifests as fragmented defects in responses (i.e., parameter errors). To address this, they propose an iterative reinforcement fine-tuning strategy comprising: (1) enhanced response diversity through Monte Carlo Tree Search (MCTS) path exploration; (2) iterative model deficiency localization through fine-grained preference pair construction and targeted improvement via preference optimization algorithms. Experiments demonstrate that this approach achieves 13.11% performance improvement over baseline models of comparable scale, 6.5% improvement over baselines in complex scenarios, and outperforms larger-scale open-source and closed-source models.
Core Problem: Existing tool-use training methods suffer from diminishing training gains when processing synthetic data, with models unable to effectively learn from increased synthetic data volumes
Significance: Tool-use capability is a critical ability for LLMs in practical applications, involving information retrieval, precise computation, and hallucination reduction
Limitations of Existing Methods:
Traditional supervised fine-tuning (SFT) performs poorly in complex tool-use scenarios
Performance gains show marginal diminishing returns as synthetic data scale increases
Models exhibit systematic deficiencies in parameter extraction and reasoning
Identified and analyzed the diminishing returns problem in synthetic tool-use data training, discovering that errors concentrate on parameter-related fragmented defects
Proposed the iTool framework, incorporating progressive warm-up training and iterative reinforcement learning as core components
Designed an MCTS-based fine-grained preference data generation method capable of effectively identifying and correcting erroneous fragments in responses
Achieved significant improvements across multiple benchmarks, with 8B parameter models surpassing larger-scale open-source and closed-source models
In tool-use tasks, the LLM receives a user query q and a candidate tool set T = {t₀, t₁, ..., t|T|}, with the objective of satisfying user intent by executing a specific sequence of tools. The decision process can be described as y ~ π(y | s₀, q, T), where π(·) denotes the policy model, s₀ represents the initial task state, and y represents the action taken by the model.
Fragment-level Error Identification: Through MCTS-generated fine-grained preference pairs, precisely locates erroneous fragments in responses
Dynamic Complexity Calibration: Dynamically selects complex samples based on generation perplexity, improving training efficiency
Iterative Optimization Strategy: Combines curriculum learning and reinforcement learning to progressively enhance model performance in complex scenarios
The paper cites important works in tool-use, reinforcement learning, and preference optimization, including:
Toolformer (Schick et al., 2023)
DPO (Rafailov et al., 2024)
SimPO (Meng et al., 2024)
ToolLLaMA (Qin et al., 2023)
MCTS-related work (Coulom, 2006; Grill et al., 2020)
Overall Assessment: This is a high-quality research paper that accurately identifies key problems in tool-use training, proposes innovative and effective solutions, and validates method effectiveness through comprehensive experiments. Despite computational cost limitations, it demonstrates significant academic contributions and practical value.