2025-11-11T13:34:09.510990

iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use

Zeng, Ding, Wang et al.
Augmenting large language models (LLMs) with external tools is a promising approach to enhance their capabilities, especially for complex tasks. Synthesizing tool-use data through real-world simulations is an effective way to achieve this. However, our investigation reveals that training gains significantly decay as synthetic data increases. The model struggles to benefit from additional synthetic data, which fails to endow it with advanced tool-use capabilities in complex scenarios Moreover, we discovered that the above limitation usually manifests as a fragment deficiency (i.e., parameter errors) in response. To this end, we propose an iterative reinforced fine-tuning strategy designed to alleviate this limitation. This strategy involves: (1) enhancing the diversity of response for synthetic data through path exploration of Monte Carlo Tree Search. (2) iteratively pinpointing the model's deficiency by constructing fine-grained preference pairs, and then improving it by preference optimization algorithms for targeted improvement. The experiments show that our method achieves 13.11% better performance than the same-size base model. It achieves an improvement of 6.5% in complex scenarios compared to the baseline, and it also outperforms larger open-source and closed-source models.
academic

iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use

Basic Information

  • Paper ID: 2501.09766
  • Title: iTool: Reinforced Fine-Tuning with Dynamic Deficiency Calibration for Advanced Tool Use
  • Authors: Yirong Zeng, Xiao Ding, Yuxian Wang, Weiwen Liu, Wu Ning, Yutai Hou, Xu Huang, Duyu Tang, Dandan Tu, Bing Qin, Ting Liu
  • Institutions: Harbin Institute of Technology Social Computing and Information Retrieval Research Center, Huawei Technologies Co., Ltd., Shanghai Jiao Tong University, University of Science and Technology of China
  • Classification: cs.CL cs.AI cs.LG
  • Publication Date: January 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2501.09766

Abstract

The integration of Large Language Models (LLMs) with external tools is an effective approach to enhance their capabilities, particularly in complex tasks. Generating synthetic tool-use data through real-world simulations is an effective means to achieve this goal. However, research reveals that training gains diminish significantly as synthetic data increases. Models struggle to benefit from additional synthetic data and fail to acquire advanced tool-use capabilities in complex scenarios. The authors identify that this limitation typically manifests as fragmented defects in responses (i.e., parameter errors). To address this, they propose an iterative reinforcement fine-tuning strategy comprising: (1) enhanced response diversity through Monte Carlo Tree Search (MCTS) path exploration; (2) iterative model deficiency localization through fine-grained preference pair construction and targeted improvement via preference optimization algorithms. Experiments demonstrate that this approach achieves 13.11% performance improvement over baseline models of comparable scale, 6.5% improvement over baselines in complex scenarios, and outperforms larger-scale open-source and closed-source models.

Research Background and Motivation

Problem Definition

  1. Core Problem: Existing tool-use training methods suffer from diminishing training gains when processing synthetic data, with models unable to effectively learn from increased synthetic data volumes
  2. Significance: Tool-use capability is a critical ability for LLMs in practical applications, involving information retrieval, precise computation, and hallucination reduction
  3. Limitations of Existing Methods:
    • Traditional supervised fine-tuning (SFT) performs poorly in complex tool-use scenarios
    • Performance gains show marginal diminishing returns as synthetic data scale increases
    • Models exhibit systematic deficiencies in parameter extraction and reasoning

Research Findings

Preliminary investigation reveals:

  • In BFCL evaluation, 51% of errors stem from parameter value errors, 26% from parameter name errors
  • Errors typically affect only small fragments of responses, with most content aligning with ground truth
  • Traditional SFT methods show significantly slowed performance improvement after utilizing 30% of data

Core Contributions

  1. Identified and analyzed the diminishing returns problem in synthetic tool-use data training, discovering that errors concentrate on parameter-related fragmented defects
  2. Proposed the iTool framework, incorporating progressive warm-up training and iterative reinforcement learning as core components
  3. Designed an MCTS-based fine-grained preference data generation method capable of effectively identifying and correcting erroneous fragments in responses
  4. Achieved significant improvements across multiple benchmarks, with 8B parameter models surpassing larger-scale open-source and closed-source models

Methodology Details

Task Definition

In tool-use tasks, the LLM receives a user query q and a candidate tool set T = {t₀, t₁, ..., t|T|}, with the objective of satisfying user intent by executing a specific sequence of tools. The decision process can be described as y ~ π(y | s₀, q, T), where π(·) denotes the policy model, s₀ represents the initial task state, and y represents the action taken by the model.

Model Architecture

1. Progressive Warm-up Training

Employs a curriculum learning strategy progressing from easy to difficult:

Data Classification Criteria:

  • Simple: tool count ≤ 1, tool set string length < 1000, required tool calls ≤ 1
  • Medium: 1 < tool count < 4, string length < 2000, tool calls < 4
  • Difficult: tool count ≥ 4, string length > 2000, tool calls ≥ 4

Training Loss:

L_warm-up = Σ(i=1 to 3) L_i
where L_i = -E_(q,y)~D_i [log P_M(y | q, T)]

2. MCTS-based Iterative Reinforcement Learning

Complex Data Sampling: Uses generation perplexity to measure sample complexity:

h = ⁿ√(1/P_M(y | q, T))

Each iteration selects the top 10% most complex data for subsequent processing.

MCTS Step-level Preference Generation:

  • Selection Phase: Employs PUCT algorithm balancing exploration and exploitation
    s_{t+1} = argmax_a [Q(s_t, a) + c·p(a|s_t)√(N(s_t))/(1+N(n(s_t,a)))]
    
  • Expansion Phase: Integrates new nodes at leaf nodes and evaluates rewards
    R(s_t) = O(s_t) + C(s_t)
    
  • Backpropagation Phase: Updates visit counts and state values bottom-up

Iterative Preference Optimization: Employs SimPO algorithm for preference optimization:

ℓ_i(π_θ) = -E_{(x,y^w,y^l)~D_i} [log σ(h^{y^w}_{π_θ} - h^{y^l}_{π_θ} - γ)]

Technical Innovations

  1. Fragment-level Error Identification: Through MCTS-generated fine-grained preference pairs, precisely locates erroneous fragments in responses
  2. Dynamic Complexity Calibration: Dynamically selects complex samples based on generation perplexity, improving training efficiency
  3. Iterative Optimization Strategy: Combines curriculum learning and reinforcement learning to progressively enhance model performance in complex scenarios

Experimental Setup

Datasets

  • Training Data: ToolACE dataset containing 100K samples of general tool-use data
  • Evaluation Datasets:
    • Berkeley Function-Calling Leaderboard (BFCL): 4K+ instances including Non-live (simple), Live (complex), Multi-turn, and Hallucination detection
    • API-Bank: 314 tool-use conversations, 753 API calls

Evaluation Metrics

  • Accuracy: Performance accuracy across various subtasks
  • Overall Performance: Weighted average scores across multiple dimensions

Baseline Methods

  • Closed-source Models: GPT-4 series, Gemini series, o1-mini, etc.
  • Open-source Base Models: LLaMA-3.1 series, Qwen2.5 series, etc.
  • Fine-tuned Models: ToolACE-8B, xLAM series, Hammer series, etc.

Implementation Details

  • Base Model: LLaMA3.1-8B-Instruct
  • Training Strategy: LoRA for warm-up phase, QLoRA for reinforcement learning phase
  • Hardware Configuration: 8×32GB V100 GPUs, total training time 28 hours

Experimental Results

Main Results

BFCL Benchmark Results:

  • iTool-8B achieves 63.26% overall accuracy, ranking first
  • Achieves 78.29% in Live (complex scenarios), surpassing GPT-4o-2024-08-06's 75.43%
  • Achieves 23.84% in Multi-turn tasks, significantly outperforming other comparable-scale models

API-Bank Results:

  • L1 task: 78.89% (vs ToolACE-8B's 75.94%)
  • L2 task: 52.87% (vs ToolACE-8B's 47.41%)

Ablation Study

Component Contribution Analysis:

ComponentNon-liveLiveMulti-turn
Base Model81.1557.9311.38
+ SFT+7.8+17.0+6.0
+ Warm-up+7.2+17.9+8.3
+ IRL (iTool)+9.5+21.2+12.5

Key Findings:

  • Warm-up training and iterative reinforcement learning contribute 2.3 and 4.2 points respectively
  • Improvements are most significant in complex scenarios (Live and Multi-turn)

Training Gains Analysis

Compared to traditional SFT, iTool exhibits better gain curves as data scale increases:

  • SFT methods plateau after 30% of data
  • iTool maintains steeper improvement curves on Live metrics

Generalization Verification

Performance across different datasets and model architectures:

  • Synthetic datasets (ToolACE, xLAM): +4.42 to +6.49 improvement
  • Non-synthetic datasets (BFCL-half): +2.17 to +3.65 improvement
  • Consistent improvements across 3B to 8B models of varying scales

Tool-Use Research

  • Early Work: Toolformer, ToolAlpaca and others explored LLMs' tool-use potential
  • Tuning-free Methods: Unlocking inherent capabilities through prompt engineering (ReAct, RestGPT)
  • Tuning-based Methods: ToolLLaMA extends tool sets and investigates data scale effects

Reinforcement Learning Methods

  • Traditional Methods: Online RL algorithms like PPO are complex and difficult to optimize
  • Direct Preference Optimization: DPO and variants (SimPO, IPO, ORPO) provide simpler offline algorithms
  • Iterative Training: Continuous improvement through reference model updates and new preference pair generation

Conclusions and Discussion

Main Conclusions

  1. Identified critical problems in synthetic tool-use data training: Training gain diminishment primarily stems from parameter-related fragmented errors
  2. Proposed effective solutions: Enhanced data diversity through MCTS and iterative reinforcement learning to correct erroneous fragments
  3. Achieved significant performance improvements: 8B parameter models surpass larger-scale models across multiple benchmarks

Limitations

  1. Computational Resource Requirements: MCTS process demands substantial computational resources (7 hours per iteration on 8 V100 GPUs)
  2. Scale Constraints: Due to resource limitations, validation on larger models (30B or 70B) remains incomplete
  3. Dataset Coverage: In-depth analysis conducted only on a single synthetic dataset

Future Directions

  1. Efficiency Optimization: Develop more efficient preference data generation methods
  2. Scale Extension: Validate method effectiveness on larger-scale models
  3. Data Diversity: Test method generalization across more public datasets

In-depth Evaluation

Strengths

  1. Accurate Problem Identification: Through detailed error type analysis, precisely identifies the root cause of training gain diminishment
  2. Reasonable Method Design: The strategy combining curriculum learning and reinforcement learning aligns with human learning principles
  3. Comprehensive Experiments: Includes thorough ablation studies, generalization verification, and cost-benefit analysis
  4. Significant Results: Achieves consistent and substantial improvements across multiple benchmarks

Weaknesses

  1. High Computational Cost: MCTS process overhead may limit practical applicability
  2. Insufficient Theoretical Analysis: Lacks theoretical explanation for why MCTS effectively resolves fragment error problems
  3. Incomplete Comparisons: Limited comparison with other methods addressing training gain diminishment

Impact

  1. Academic Contribution: Provides novel solutions to the training gain diminishment problem in tool-use training
  2. Practical Value: Achieves significant improvements while maintaining computational feasibility
  3. Reproducibility: Provides detailed implementation details and open-source code

Applicable Scenarios

  • Complex Tool-Use Scenarios: Particularly suitable for tasks requiring multi-tool coordination and complex parameter reasoning
  • Synthetic Data Training: Provides effective solutions for leveraging synthetic data to enhance model capabilities
  • Resource-Rich Research Environments: Requires adequate computational resources to support MCTS processes

References

The paper cites important works in tool-use, reinforcement learning, and preference optimization, including:

  • Toolformer (Schick et al., 2023)
  • DPO (Rafailov et al., 2024)
  • SimPO (Meng et al., 2024)
  • ToolLLaMA (Qin et al., 2023)
  • MCTS-related work (Coulom, 2006; Grill et al., 2020)

Overall Assessment: This is a high-quality research paper that accurately identifies key problems in tool-use training, proposes innovative and effective solutions, and validates method effectiveness through comprehensive experiments. Despite computational cost limitations, it demonstrates significant academic contributions and practical value.