2025-11-11T13:34:09.510990

iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use

Zeng, Ding, Wang et al.

Augmenting large language models (LLMs) with external tools is a promising approach to enhance their capabilities, especially for complex tasks. Synthesizing tool-use data through real-world simulations is an effective way to achieve this. However, our investigation reveals that training gains significantly decay as synthetic data increases. The model struggles to benefit from additional synthetic data, which fails to endow it with advanced tool-use capabilities in complex scenarios Moreover, we discovered that the above limitation usually manifests as a fragment deficiency (i.e., parameter errors) in response. To this end, we propose an iterative reinforced fine-tuning strategy designed to alleviate this limitation. This strategy involves: (1) enhancing the diversity of response for synthetic data through path exploration of Monte Carlo Tree Search. (2) iteratively pinpointing the model's deficiency by constructing fine-grained preference pairs, and then improving it by preference optimization algorithms for targeted improvement. The experiments show that our method achieves 13.11% better performance than the same-size base model. It achieves an improvement of 6.5% in complex scenarios compared to the baseline, and it also outperforms larger open-source and closed-source models.

academic

iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use

Basic Information

Paper ID: 2501.09766
Title: iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use
Authors: Yirong Zeng, Xiao Ding, Yuxian Wang, Weiwen Liu, Wu Ning, Yutai Hou, Xu Huang, Duyu Tang, Dandan Tu, Bing Qin, Ting Liu
Institutions: Harbin Institute of Technology Social Computing and Information Retrieval Research Center, Huawei Technologies Co., Ltd., Shanghai Jiao Tong University, University of Science and Technology of China
Classification: cs.CL cs.AI cs.LG
Publication Date: January 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2501.09766

Abstract

The integration of Large Language Models (LLMs) with external tools is an effective approach to enhance their capabilities, particularly in complex tasks. Generating synthetic tool-use data through real-world simulations is an effective means to achieve this goal. However, research reveals that training gains diminish significantly as synthetic data increases. Models struggle to benefit from additional synthetic data and fail to acquire advanced tool-use capabilities in complex scenarios. The authors identify that this limitation typically manifests as fragmented defects in responses (i.e., parameter errors). To address this, they propose an iterative reinforcement fine-tuning strategy comprising: (1) enhanced response diversity through Monte Carlo Tree Search (MCTS) path exploration; (2) iterative model deficiency localization through fine-grained preference pair construction and targeted improvement via preference optimization algorithms. Experiments demonstrate that this approach achieves 13.11% performance improvement over baseline models of comparable scale, 6.5% improvement over baselines in complex scenarios, and outperforms larger-scale open-source and closed-source models.

Research Background and Motivation

Problem Definition

Core Problem: Existing tool-use training methods suffer from diminishing training gains when processing synthetic data, with models unable to effectively learn from increased synthetic data volumes
Significance: Tool-use capability is a critical ability for LLMs in practical applications, involving information retrieval, precise computation, and hallucination reduction
Limitations of Existing Methods:
- Traditional supervised fine-tuning (SFT) performs poorly in complex tool-use scenarios
- Performance gains show marginal diminishing returns as synthetic data scale increases
- Models exhibit systematic deficiencies in parameter extraction and reasoning

Research Findings

Preliminary investigation reveals:

In BFCL evaluation, 51% of errors stem from parameter value errors, 26% from parameter name errors
Errors typically affect only small fragments of responses, with most content aligning with ground truth
Traditional SFT methods show significantly slowed performance improvement after utilizing 30% of data

Core Contributions

Identified and analyzed the diminishing returns problem in synthetic tool-use data training, discovering that errors concentrate on parameter-related fragmented defects
Proposed the iTool framework, incorporating progressive warm-up training and iterative reinforcement learning as core components
Designed an MCTS-based fine-grained preference data generation method capable of effectively identifying and correcting erroneous fragments in responses
Achieved significant improvements across multiple benchmarks, with 8B parameter models surpassing larger-scale open-source and closed-source models

Methodology Details

Task Definition

In tool-use tasks, the LLM receives a user query q and a candidate tool set T = {t₀, t₁, ..., t|T|}, with the objective of satisfying user intent by executing a specific sequence of tools. The decision process can be described as y ~ π(y | s₀, q, T), where π(·) denotes the policy model, s₀ represents the initial task state, and y represents the action taken by the model.

Model Architecture

1. Progressive Warm-up Training

Employs a curriculum learning strategy progressing from easy to difficult:

Data Classification Criteria:

Simple: tool count ≤ 1, tool set string length < 1000, required tool calls ≤ 1
Medium: 1 < tool count < 4, string length < 2000, tool calls < 4
Difficult: tool count ≥ 4, string length > 2000, tool calls ≥ 4

Training Loss:

L_warm-up = Σ(i=1 to 3) L_i
where L_i = -E_(q,y)~D_i [log P_M(y | q, T)]

2. MCTS-based Iterative Reinforcement Learning

Complex Data Sampling: Uses generation perplexity to measure sample complexity:

h = ⁿ√(1/P_M(y | q, T))

Each iteration selects the top 10% most complex data for subsequent processing.

MCTS Step-level Preference Generation:

Selection Phase: Employs PUCT algorithm balancing exploration and exploitation
```
s_{t+1} = argmax_a [Q(s_t, a) + c·p(a|s_t)√(N(s_t))/(1+N(n(s_t,a)))]
```
Expansion Phase: Integrates new nodes at leaf nodes and evaluates rewards
```
R(s_t) = O(s_t) + C(s_t)
```
Backpropagation Phase: Updates visit counts and state values bottom-up

Iterative Preference Optimization: Employs SimPO algorithm for preference optimization:

ℓ_i(π_θ) = -E_{(x,y^w,y^l)~D_i} [log σ(h^{y^w}_{π_θ} - h^{y^l}_{π_θ} - γ)]

Technical Innovations

Fragment-level Error Identification: Through MCTS-generated fine-grained preference pairs, precisely locates erroneous fragments in responses
Dynamic Complexity Calibration: Dynamically selects complex samples based on generation perplexity, improving training efficiency
Iterative Optimization Strategy: Combines curriculum learning and reinforcement learning to progressively enhance model performance in complex scenarios

Experimental Setup

Datasets

Training Data: ToolACE dataset containing 100K samples of general tool-use data
Evaluation Datasets:
- Berkeley Function-Calling Leaderboard (BFCL): 4K+ instances including Non-live (simple), Live (complex), Multi-turn, and Hallucination detection
- API-Bank: 314 tool-use conversations, 753 API calls

Evaluation Metrics

Accuracy: Performance accuracy across various subtasks
Overall Performance: Weighted average scores across multiple dimensions

Baseline Methods

Closed-source Models: GPT-4 series, Gemini series, o1-mini, etc.
Open-source Base Models: LLaMA-3.1 series, Qwen2.5 series, etc.
Fine-tuned Models: ToolACE-8B, xLAM series, Hammer series, etc.

Implementation Details

Base Model: LLaMA3.1-8B-Instruct
Training Strategy: LoRA for warm-up phase, QLoRA for reinforcement learning phase
Hardware Configuration: 8×32GB V100 GPUs, total training time 28 hours

Experimental Results

Main Results

BFCL Benchmark Results:

iTool-8B achieves 63.26% overall accuracy, ranking first
Achieves 78.29% in Live (complex scenarios), surpassing GPT-4o-2024-08-06's 75.43%
Achieves 23.84% in Multi-turn tasks, significantly outperforming other comparable-scale models

API-Bank Results:

L1 task: 78.89% (vs ToolACE-8B's 75.94%)
L2 task: 52.87% (vs ToolACE-8B's 47.41%)

Ablation Study

Component Contribution Analysis:

Component	Non-live	Live	Multi-turn
Base Model	81.15	57.93	11.38
+ SFT	+7.8	+17.0	+6.0
+ Warm-up	+7.2	+17.9	+8.3
+ IRL (iTool)	+9.5	+21.2	+12.5

Key Findings:

Warm-up training and iterative reinforcement learning contribute 2.3 and 4.2 points respectively
Improvements are most significant in complex scenarios (Live and Multi-turn)

Training Gains Analysis

Compared to traditional SFT, iTool exhibits better gain curves as data scale increases:

SFT methods plateau after 30% of data
iTool maintains steeper improvement curves on Live metrics

Generalization Verification

Performance across different datasets and model architectures:

Synthetic datasets (ToolACE, xLAM): +4.42 to +6.49 improvement
Non-synthetic datasets (BFCL-half): +2.17 to +3.65 improvement
Consistent improvements across 3B to 8B models of varying scales

Tool-Use Research

Early Work: Toolformer, ToolAlpaca and others explored LLMs' tool-use potential
Tuning-free Methods: Unlocking inherent capabilities through prompt engineering (ReAct, RestGPT)
Tuning-based Methods: ToolLLaMA extends tool sets and investigates data scale effects

Reinforcement Learning Methods

Traditional Methods: Online RL algorithms like PPO are complex and difficult to optimize
Direct Preference Optimization: DPO and variants (SimPO, IPO, ORPO) provide simpler offline algorithms
Iterative Training: Continuous improvement through reference model updates and new preference pair generation

Conclusions and Discussion

Main Conclusions

Identified critical problems in synthetic tool-use data training: Training gain diminishment primarily stems from parameter-related fragmented errors
Proposed effective solutions: Enhanced data diversity through MCTS and iterative reinforcement learning to correct erroneous fragments
Achieved significant performance improvements: 8B parameter models surpass larger-scale models across multiple benchmarks

Limitations

Computational Resource Requirements: MCTS process demands substantial computational resources (7 hours per iteration on 8 V100 GPUs)
Scale Constraints: Due to resource limitations, validation on larger models (30B or 70B) remains incomplete
Dataset Coverage: In-depth analysis conducted only on a single synthetic dataset

Future Directions

Efficiency Optimization: Develop more efficient preference data generation methods
Scale Extension: Validate method effectiveness on larger-scale models
Data Diversity: Test method generalization across more public datasets

In-depth Evaluation

Strengths

Accurate Problem Identification: Through detailed error type analysis, precisely identifies the root cause of training gain diminishment
Reasonable Method Design: The strategy combining curriculum learning and reinforcement learning aligns with human learning principles
Comprehensive Experiments: Includes thorough ablation studies, generalization verification, and cost-benefit analysis
Significant Results: Achieves consistent and substantial improvements across multiple benchmarks

Weaknesses

High Computational Cost: MCTS process overhead may limit practical applicability
Insufficient Theoretical Analysis: Lacks theoretical explanation for why MCTS effectively resolves fragment error problems
Incomplete Comparisons: Limited comparison with other methods addressing training gain diminishment

Impact

Academic Contribution: Provides novel solutions to the training gain diminishment problem in tool-use training
Practical Value: Achieves significant improvements while maintaining computational feasibility
Reproducibility: Provides detailed implementation details and open-source code

Applicable Scenarios

Complex Tool-Use Scenarios: Particularly suitable for tasks requiring multi-tool coordination and complex parameter reasoning
Synthetic Data Training: Provides effective solutions for leveraging synthetic data to enhance model capabilities
Resource-Rich Research Environments: Requires adequate computational resources to support MCTS processes

References

The paper cites important works in tool-use, reinforcement learning, and preference optimization, including:

Toolformer (Schick et al., 2023)
DPO (Rafailov et al., 2024)
SimPO (Meng et al., 2024)
ToolLLaMA (Qin et al., 2023)
MCTS-related work (Coulom, 2006; Grill et al., 2020)

Overall Assessment: This is a high-quality research paper that accurately identifies key problems in tool-use training, proposes innovative and effective solutions, and validates method effectiveness through comprehensive experiments. Despite computational cost limitations, it demonstrates significant academic contributions and practical value.