2025-11-20T14:40:15.388685

Efficient Compositional Multi-tasking for On-device Large Language Models

Bohdal, Ozay, Moon et al.

Adapter parameters provide a mechanism to modify the behavior of machine learning models and have gained significant popularity in the context of large language models (LLMs) and generative AI. These parameters can be merged to support multiple tasks via a process known as task merging. However, prior work on merging in LLMs, particularly in natural language processing, has been limited to scenarios where each test example addresses only a single task. In this paper, we focus on on-device settings and study the problem of text-based compositional multi-tasking, where each test example involves the simultaneous execution of multiple tasks. For instance, generating a translated summary of a long text requires solving both translation and summarization tasks concurrently. To facilitate research in this setting, we propose a benchmark comprising four practically relevant compositional tasks. We also present an efficient method (Learnable Calibration) tailored for on-device applications, where computational resources are limited, emphasizing the need for solutions that are both resource-efficient and high-performing. Our contributions lay the groundwork for advancing the capabilities of LLMs in real-world multi-tasking scenarios, expanding their applicability to complex, resource-constrained use cases.

academic

Efficient Compositional Multi-tasking for On-device Large Language Models

Basic Information

Paper ID: 2507.16083
Title: Efficient Compositional Multi-tasking for On-device Large Language Models
Authors: Ondrej Bohdal¹, Mete Ozay¹, Jijoong Moon², Kyeng-Hun Lee², Hyeonmok Ko², Umberto Michieli¹
Institutions: ¹Samsung R&D Institute UK, ²Samsung Research, South Korea
Classification: cs.CL cs.AI cs.LG
Publication Date: October 11, 2025 (arXiv v2)
Paper Link: https://arxiv.org/abs/2507.16083

Abstract

Adapter parameters provide a mechanism for modifying the behavior of machine learning models and have gained widespread attention in the large language models (LLMs) and generative AI domains. These parameters can support multi-task processing through task merging procedures. However, prior work on merging in LLMs, particularly in natural language processing, has been limited to scenarios where each test sample addresses only a single task. This paper focuses on on-device settings and investigates compositional multi-tasking problems based on text, where each test sample requires executing multiple tasks simultaneously. For example, generating a translated summary of long text requires simultaneously solving both translation and summarization tasks. To promote research in this area, we propose a benchmark containing four practical compositional tasks. We also propose an efficient method for on-device applications (Learnable Calibration), emphasizing the need for solutions that are both resource-efficient and high-performing in computationally constrained environments.

Research Background and Motivation

Problem Definition

Traditional LLM multi-task processing primarily focuses on single-task scenarios, where each test sample involves only one task (e.g., translation only or summarization only). However, practical applications frequently require compositional multi-tasking, where multiple tasks are executed simultaneously in a single inference pass, such as generating translated summaries, generating replies with specific tones, etc.

Importance Analysis

Practical Value: Compositional multi-tasking has widespread demand in real-world scenarios, such as intelligent replies in cross-lingual contexts, summary generation with specific tones, etc.
Efficiency Requirements: On-device LLMs have limited resources and need to complete multiple tasks in a single inference pass, avoiding efficiency losses from multiple inference rounds.
Storage Constraints: Mobile devices have limited storage and cannot train independent adapters for each compositional task.

Limitations of Existing Methods

Traditional Merging Strategies: Methods such as TIES and DARE perform poorly in compositional multi-task scenarios.
Multi-step Approaches: While effective, they require multiple inference passes, resulting in low efficiency.
Independent Training: Training specialized adapters for each compositional task incurs significant storage overhead.

Core Contributions

First Formulation of Compositional Multi-tasking: Defines the compositional multi-task processing challenge for on-device LLMs.
Practical Benchmark Construction: Develops a comprehensive benchmark with 14 sub-tasks, covering four major categories: summarization+translation, summarization+tone adjustment, reply+translation, and reply+tone adjustment.
Learnable Calibration Method: Designs two variants of efficient solutions that minimize storage and computational overhead while maintaining high performance.
Comprehensive Experimental Validation: Verifies the effectiveness and generalizability of the method across multiple on-device LLMs.

Method Details

Task Definition

Compositional multi-tasking is defined as: $T_C^{[N]}(x) = T_N(\ldots T_2(T_1(x)))$

where input $x$ is processed sequentially through $N$ tasks. This paper primarily investigates the case where $N=2$ , including:

Primary task $T_1$ : Summarization or reply generation
Secondary task $T_2$ : Translation or tone adjustment

Model Architecture

LoRA Foundation

Based on the LoRA adapter mechanism, the adjusted forward pass is: $h = W_0x + \Delta Wx = W_0x + BAx$

where $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times k}$ , and $r \ll \min(d,k)$ .

Learnable Calibration Method

Core Idea: Starting from linearly merged single-task LoRAs, calibration is performed using a small number of additional parameters.

Initial Merging: $B' = \frac{1}{N}\sum_{i=1}^N B_i, \quad A' = \frac{1}{N}\sum_{i=1}^N A_i$

Variant 1 - Learnable Calibration: Uses a column-wise bias vector $p \in \mathbb{R}^d$ for calibration: $\Delta W^c = p \oplus B'A' = \sum_{i=1}^d p_i \Delta W'_i$

Variant 2 - Learnable Calibration++: Introduces a calibration LoRA matrix $P_2P_1$ : $\Delta W^c = P_2P_1 + \Delta W'$

Technical Innovations

Lightweight Calibration: Requires only 0.08-0.56% additional parameters with storage overhead less than 0.5MB.
Task Specificity: Learns specialized calibration parameters for different compositional tasks.
Strong Compatibility: Compatible with existing frameworks (Android AI Core, Apple Intelligence).
Parameter Sharing: Supports cross-task parameter sharing to further reduce storage requirements.

Experimental Setup

Datasets

Benchmark Dataset Construction:

Summarization Task: DialogSum dataset (12,460/500/1,500 train/validation/test)
Reply Task: Synthetic Persona Chat dataset (225,061/1,000/1,000)
Translation Task: TED Talks dataset, English to Spanish/French/German
Tone Adjustment: Sound Natural dataset, four tones (professional/casual/humorous/narrative)

Compositional Task Generation:

Uses OpusMT model for translation
Uses RedPajama-INCITE-Base 3B model for tone adjustment

Evaluation Metrics

Summarization Tasks: ROUGE-L (R-L)
Reply Tasks: Weighted ROUGE (W-R) = $\frac{\text{ROUGE-1}}{6} + \frac{\text{ROUGE-2}}{3} + \frac{\text{ROUGE-3}}{2}$
LLM Judge: Binary evaluation using Llama 3.1 70B

Comparison Methods

Baseline Methods:

Zero-shot, primary task LoRA, secondary task LoRA
In-context learning, multi-step LoRA usage
Various merging strategies: Linear, TIES, DARE, Slerp, LoraHub, etc.

Reference Methods:

Multi-step LoRA usage (low efficiency but good performance)
Joint expert LoRA (specialized training for each compositional task)

Implementation Details

Models: LLaMA 3.2 1B, Qwen2.5 1.5B, StableLM2 1.6B
LoRA Configuration: rank=32, α=16, dropout=0.05
Training: Adam optimizer, learning rate 5×10⁻⁵ (LoRA), 5×10⁻⁴ (calibration parameters)
Calibration Training: Randomly selected 10,000 compositional task samples

Experimental Results

Main Results

Method Category	Sum.+Trans.	Sum.+Tone	Reply+Trans.	Reply+Tone	Efficiency
Efficient Baselines
Zero-shot	0.44%	6.52%	4.11%	33.66%	✓
Primary Task LoRA	3.49%	4.18%	7.17%	36.25%	✓
Linear merge	0.33%	2.74%	12.81%	41.93%	✓
TIES merge	0.81%	6.06%	8.30%	47.87%	✓
Inefficient Baselines
Multi-step LoRA	72.92%	34.32%	69.83%	45.78%	✗
Joint Expert LoRA	49.85%	16.14%	65.73%	47.06%	✗
Proposed Method
Learnable Calibration	59.23%	28.89%	57.46%	44.99%	✓
Learnable Calibration++	65.15%	34.34%	63.81%	45.40%	✓

Values in table represent LLM Judge scores (%)

Key Findings

Traditional Merging Strategies Fail: Existing merging methods perform extremely poorly in compositional multi-task scenarios (LLM Judge scores typically <10%).
Efficiency-Performance Trade-off: The proposed method achieves performance close to or exceeding multi-step baselines under single-inference constraints.
Consistent Performance: Learnable Calibration++ achieves the best performance across all tasks.

Ablation Studies

Storage Efficiency Analysis:

Multi-step LoRA: 0 additional parameters, but requires 2 inference passes
Joint Expert LoRA: 30M parameters, 57.10MB storage
Learnable Calibration: 23K parameters, 0.05MB storage
Learnable Calibration++: 166K parameters, 0.32MB storage

Pre-trained Adapter Contribution: Removing pre-trained LoRAs results in slight performance degradation but still outperforms most baselines, demonstrating the value of leveraging existing adapters.

Extended Analysis

Model Scale Adaptability: Performs well on models with 0.5B-3B parameters.
Out-of-Domain Generalization: Maintains stable performance across different dialogue datasets.
Three-task Extension: Supports three-way compositional tasks (summarization+tone+translation).

Parameter-Efficient Fine-Tuning (PEFT)

LoRA and Variants: DoRA, AdaLoRA, Delta-LoRA and other extension methods
Other PEFT Methods: BitFit and other bias parameter training methods

Model Merging

Early Work: Model Soup and other linear merging methods
Advanced Techniques: TIES, DARE, Slerp and other conflict resolution strategies
Adaptive Methods: LoraHub, LM-Cocktail, DAM and other learning-based merging approaches

On-device LLMs

Compression Techniques: Model quantization, knowledge distillation, etc.
Representative Models: LLaMA 3.2, Qwen2.5, StableLM2 and other 1-3B parameter models
Deployment Challenges: Storage limitations, computational constraints, privacy requirements

Conclusions and Discussion

Main Conclusions

Problem Importance: Compositional multi-tasking is an important requirement for on-device LLMs, and traditional methods cannot effectively address it.
Method Effectiveness: Learnable Calibration achieves performance comparable to inefficient baselines while maintaining efficiency.
Practical Value: Minimal storage overhead (<0.5MB) makes the method suitable for practical deployment.

Limitations

Evaluation Scope: Primarily focuses on 1-3B parameter on-device models; validation on larger models is lacking.
Task Quantity: Primarily investigates combinations of 2-3 tasks; scalability to more tasks requires further verification.
Data Dependency: Requires compositional task data for training calibration parameters, unlike completely data-free merging methods.

Future Directions

Safety Research: Explore the impact of compositional multi-tasking on model safety mechanisms.
Scalability Optimization: Investigate methods for handling more task combinations.
Zero-shot Merging: Develop compositional multi-task methods without requiring additional data.

In-depth Evaluation

Strengths

Problem Innovation: First systematic investigation of compositional multi-tasking, filling an important research gap.
Method Practicality: Minimal storage and computational overhead suitable for practical deployment.
Experimental Comprehensiveness: Thorough baseline comparisons, ablation studies, and extended analyses.
Benchmark Contribution: The constructed 14 sub-task benchmark provides a standard evaluation platform for subsequent research.

Weaknesses

Insufficient Theoretical Analysis: Lacks in-depth theoretical explanation for why calibration parameters are effective.
Limited Task Selection: Primarily focuses on NLP tasks; applicability to other modalities is unknown.
Single Evaluation Metric: Primarily relies on ROUGE and LLM Judge; lacks human evaluation.

Impact

Academic Value: Opens a new research direction with expected follow-up work.
Industrial Application: Directly applicable to AI application development on mobile devices.
Reproducibility: Provides detailed implementation details and benchmark data.

Applicable Scenarios

Mobile Applications: Smartphones, tablets, and other resource-constrained devices.
Edge Computing: IoT devices, embedded systems.
Privacy-sensitive Scenarios: Applications requiring local processing to avoid data uploads.

References

The paper cites extensive related work, primarily including:

Hu et al. (2022): Original LoRA paper
Wortsman et al. (2022): Model Soup merging method
Yadav et al. (2024): TIES merging strategy
Gunter et al. (2024): Apple Intelligence on-device deployment experience

Overall Assessment: This is a high-quality research paper that addresses a practically important problem, proposes an effective solution, and conducts comprehensive experimental validation. This work provides new insights for multi-task processing in on-device LLMs and possesses significant academic and practical value.