2025-11-20T14:40:15.388685

Efficient Compositional Multi-tasking for On-device Large Language Models

Bohdal, Ozay, Moon et al.

Adapter parameters provide a mechanism to modify the behavior of machine learning models and have gained significant popularity in the context of large language models (LLMs) and generative AI. These parameters can be merged to support multiple tasks via a process known as task merging. However, prior work on merging in LLMs, particularly in natural language processing, has been limited to scenarios where each test example addresses only a single task. In this paper, we focus on on-device settings and study the problem of text-based compositional multi-tasking, where each test example involves the simultaneous execution of multiple tasks. For instance, generating a translated summary of a long text requires solving both translation and summarization tasks concurrently. To facilitate research in this setting, we propose a benchmark comprising four practically relevant compositional tasks. We also present an efficient method (Learnable Calibration) tailored for on-device applications, where computational resources are limited, emphasizing the need for solutions that are both resource-efficient and high-performing. Our contributions lay the groundwork for advancing the capabilities of LLMs in real-world multi-tasking scenarios, expanding their applicability to complex, resource-constrained use cases.

academic

Efficient Compositional Multi-tasking for On-device Large Language Models

基本信息

论文ID: 2507.16083
标题: Efficient Compositional Multi-tasking for On-device Large Language Models
作者: Ondrej Bohdal¹, Mete Ozay¹, Jijoong Moon², Kyeng-Hun Lee², Hyeonmok Ko², Umberto Michieli¹
机构: ¹Samsung R&D Institute UK, ²Samsung Research, South Korea
分类: cs.CL cs.AI cs.LG
发表时间: 2025年10月11日 (arXiv v2)
论文链接: https://arxiv.org/abs/2507.16083

摘要

适配器参数为修改机器学习模型行为提供了机制，在大语言模型(LLMs)和生成式AI领域获得了广泛关注。这些参数可以通过任务合并过程来支持多任务处理。然而，先前在LLMs中的合并工作，特别是在自然语言处理领域，仅限于每个测试样本只处理单一任务的场景。本文聚焦于设备端设置，研究基于文本的组合式多任务问题，其中每个测试样本需要同时执行多个任务。例如，生成长文本的翻译摘要需要同时解决翻译和摘要任务。为促进该领域研究，我们提出了包含四个实用组合任务的基准。我们还提出了一种针对设备端应用的高效方法(Learnable Calibration)，在计算资源受限的环境中，强调需要既资源高效又高性能的解决方案。

实用价值：组合式多任务在实际场景中需求广泛，如跨语言场景下的智能回复、需要特定语调的摘要生成等
效率需求：设备端LLMs资源受限，需要在单次推理中完成多任务，避免多次推理的效率损失
存储约束：移动设备存储有限，不能为每个组合任务训练独立的适配器

现有方法局限性

传统合并策略：如TIES、DARE等方法在组合多任务场景下性能不佳
多步骤方案：虽然有效但需要多次推理，效率低下
独立训练：为每个组合任务训练专门适配器，存储开销大

核心贡献

首次提出组合式多任务问题：定义了设备端LLMs的组合式多任务处理挑战
构建实用基准：开发了包含14个子任务的综合基准，涵盖摘要+翻译、摘要+语调调整、回复+翻译、回复+语调调整四大类
提出Learnable Calibration方法：设计了两种变体的高效解决方案，在保持高性能的同时最小化存储和计算开销
全面实验验证：在多个设备端LLM上验证了方法的有效性和通用性

方法详解

任务定义

组合式多任务定义为： $T_C^{[N]}(x) = T_N(\ldots T_2(T_1(x)))$

其中输入 $x$ 依次经过 $N$ 个任务处理，本文主要研究 $N=2$ 的情况，包括：

主任务 $T_1$ ：摘要或回复生成
辅助任务 $T_2$ ：翻译或语调调整

变体1 - Learnable Calibration：使用列向偏置向量 $p \in \mathbb{R}^d$ 进行校准： $\Delta W^c = p \oplus B'A' = \sum_{i=1}^d p_i \Delta W'_i$

变体2 - Learnable Calibration++：引入校准LoRA矩阵 $P_2P_1$ ： $\Delta W^c = P_2P_1 + \Delta W'$

技术创新点

轻量级校准：只需0.08-0.56%的额外参数，存储开销小于0.5MB
任务特异性：针对不同组合任务学习专门的校准参数
兼容性强：与现有框架（Android AI Core、Apple Intelligence）兼容
参数共享：支持跨任务参数共享以进一步降低存储需求

实验设置

数据集

基准数据集构建：

摘要任务：DialogSum数据集（12,460/500/1,500训练/验证/测试）
回复任务：Synthetic Persona Chat数据集（225,061/1,000/1,000）
翻译任务：TED Talks数据集，英语到西班牙语/法语/德语
语调调整：Sound Natural数据集，四种语调（专业/随意/幽默/转述）

组合任务生成：

使用OpusMT模型进行翻译
使用RedPajama-INCITE-Base 3B模型进行语调调整

评价指标

摘要类任务：ROUGE-L (R-L)
回复类任务：加权ROUGE (W-R) = $\frac{\text{ROUGE-1}}{6} + \frac{\text{ROUGE-2}}{3} + \frac{\text{ROUGE-3}}{2}$
LLM Judge：使用Llama 3.1 70B进行二元评估

对比方法

基线方法：

Zero-shot、主任务LoRA、辅助任务LoRA
上下文学习、多步骤LoRA使用
各种合并策略：Linear、TIES、DARE、Slerp、LoraHub等

参考方法：

多步骤LoRA使用（效率低但性能好）
联合专家LoRA（为每个组合任务专门训练）

实现细节

模型：LLaMA 3.2 1B、Qwen2.5 1.5B、StableLM2 1.6B
LoRA配置：rank=32，α=16，dropout=0.05
训练：Adam优化器，学习率5×10⁻⁵（LoRA）、5×10⁻⁴（校准参数）
校准训练：随机选择10,000个组合任务样本

实验结果

主要结果

方法类别	Sum.+翻译	Sum.+语调	回复+翻译	回复+语调	效率
高效基线
Zero-shot	0.44%	6.52%	4.11%	33.66%	✓
主任务LoRA	3.49%	4.18%	7.17%	36.25%	✓
Linear merge	0.33%	2.74%	12.81%	41.93%	✓
TIES merge	0.81%	6.06%	8.30%	47.87%	✓
低效基线
多步骤LoRA	72.92%	34.32%	69.83%	45.78%	✗
联合专家LoRA	49.85%	16.14%	65.73%	47.06%	✗
本文方法
Learnable Calibration	59.23%	28.89%	57.46%	44.99%	✓
Learnable Calibration++	65.15%	34.34%	63.81%	45.40%	✓