2025-11-11T17:07:09.499066

Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification

Zadenoori, De Martino, Dabrowski et al.

[Context and motivation] Large language models (LLMs) show notable results in natural language processing (NLP) tasks for requirements engineering (RE). However, their use is compromised by high computational cost, data sharing risks, and dependence on external services. In contrast, small language models (SLMs) offer a lightweight, locally deployable alternative. [Question/problem] It remains unclear how well SLMs perform compared to LLMs in RE tasks in terms of accuracy. [Results] Our preliminary study compares eight models, including three LLMs and five SLMs, on requirements classification tasks using the PROMISE, PROMISE Reclass, and SecReq datasets. Our results show that although LLMs achieve an average F1 score of 2% higher than SLMs, this difference is not statistically significant. SLMs almost reach LLMs performance across all datasets and even outperform them in recall on the PROMISE Reclass dataset, despite being up to 300 times smaller. We also found that dataset characteristics play a more significant role in performance than model size. [Contribution] Our study contributes with evidence that SLMs are a valid alternative to LLMs for requirements classification, offering advantages in privacy, cost, and local deployability.

academic

Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification

基本信息

论文ID: 2510.21443
标题: Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification
作者: Mohammad Amin Zadenoori, Vincenzo De Martino, Jacek Dąbrowski, Xavier Franch, Alessio Ferrari
分类: cs.SE (Software Engineering), cs.AI (Artificial Intelligence), cs.CL (Computational Linguistics)
发表时间: 2025年10月24日 (arXiv预印本)
论文链接: https://arxiv.org/abs/2510.21443

摘要

本研究比较了大型语言模型(LLMs)和小型语言模型(SLMs)在需求工程分类任务中的性能表现。虽然LLMs在自然语言处理任务中表现出色，但存在高计算成本、数据共享风险和依赖外部服务等问题。SLMs提供了轻量级、可本地部署的替代方案。研究使用PROMISE、PROMISE Reclass和SecReq数据集，对比了3个LLMs和5个SLMs的性能。结果显示，尽管LLMs的平均F1分数比SLMs高2%，但这一差异在统计上并不显著。SLMs几乎达到了LLMs的性能水平，甚至在PROMISE Reclass数据集的召回率上超越了LLMs，尽管参数量少了300倍。研究还发现数据集特征对性能的影响比模型大小更显著。

自动化需求: 大型项目中需求数量庞大，自动化分类能显著提高效率
支持其他RE活动: 需求分类支持需求管理和可追溯性等其他RE活动
实际应用需求: 工业界迫切需要既准确又实用的解决方案

现有方法局限性

LLMs的问题:

高计算成本
数据隐私和安全风险(云端部署)
依赖外部服务
专有性质限制定制化
可重现性问题

研究空白:

SLMs与LLMs在RE任务中的性能比较尚未被系统研究
缺乏对模型大小与分类准确性关系的深入理解

核心贡献

首次系统比较: 在需求分类任务中首次系统比较SLMs和LLMs的性能
统计显著性分析: 使用Scheirer-Ray-Hare检验等统计方法验证性能差异的显著性
多数据集验证: 在三个公开数据集(PROMISE、PROMISE Reclass、SecReq)上进行全面评估
实用性证据: 提供SLMs作为LLMs可行替代方案的实证证据
数据集影响分析: 揭示数据集特征比模型大小对性能影响更大的重要发现

方法详解

任务定义

输入: 自然语言需求文本输出: 需求类别标签(二分类)

PROMISE: 功能性需求(FR) vs 非功能性需求(NFR)
PROMISE Reclass: FR vs NFR 和质量需求(QR) vs 非QR (双标签)
SecReq: 安全相关需求 vs 非安全需求

模型选择

SLMs (7-8B参数):

Qwen2-7B-Instruct
Falcon-7B-Instruct
Granite-3.2-8B-Instruct
Ministral-8B-Instruct-2410
Meta-Llama-3-8B-Instruct

LLMs (1-2万亿参数):

GPT-5
xAI Grok-4
Claude-4

技术方法

提示策略:

采用思维链(Chain-of-Thought, CoT)结合少样本学习(Few-Shot)
每个类别提供4个示例
基于专家定义的RE定义提供类别定义

实验设置:

温度参数设为0确保确定性输出
每个任务执行3次，采用多数投票(2/3)决定最终标签
使用宏平均计算指标

实验设置

数据集详情

数据集	任务类型	样本数量	类别分布
PROMISE	FR vs NFR	625	FR:255, NFR:370
PROMISE Reclass	FR vs NFR & QR vs Non-QR	625	FR:310, QR:382
SecReq	Security vs Non-Security	510	Sec:187, NSec:323

评价指标

精确率(Precision, P): 正确预测的正例占所有预测正例的比例
召回率(Recall, R): 正确预测的正例占所有实际正例的比例
F1分数: 精确率和召回率的调和平均数

硬件环境

SLMs: Linux 6.14服务器，Intel i9-13900K CPU，128GB RAM，NVIDIA RTX 4090 GPU
LLMs: 通过商业API访问

模型	PROMISE			PROMISE Reclass			SecReq
	P	R	F1	P	R	F1	P	R	F1
SLMs平均	0.85	0.79	0.82	0.62	0.91	0.73	0.83	0.90	0.86
LLMs平均	0.86	0.81	0.83	0.67	0.87	0.75	0.85	0.90	0.88

最佳性能模型:

Claude-4 (LLM): PROMISE (F1=0.82), PROMISE Reclass (F1=0.80), SecReq (F1=0.89)
Llama-3-8B (SLM): PROMISE (F1=0.80), PROMISE Reclass (F1=0.78), SecReq (F1=0.88)