2025-11-11T17:07:09.499066

Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification

Zadenoori, De Martino, Dabrowski et al.

[Context and motivation] Large language models (LLMs) show notable results in natural language processing (NLP) tasks for requirements engineering (RE). However, their use is compromised by high computational cost, data sharing risks, and dependence on external services. In contrast, small language models (SLMs) offer a lightweight, locally deployable alternative. [Question/problem] It remains unclear how well SLMs perform compared to LLMs in RE tasks in terms of accuracy. [Results] Our preliminary study compares eight models, including three LLMs and five SLMs, on requirements classification tasks using the PROMISE, PROMISE Reclass, and SecReq datasets. Our results show that although LLMs achieve an average F1 score of 2% higher than SLMs, this difference is not statistically significant. SLMs almost reach LLMs performance across all datasets and even outperform them in recall on the PROMISE Reclass dataset, despite being up to 300 times smaller. We also found that dataset characteristics play a more significant role in performance than model size. [Contribution] Our study contributes with evidence that SLMs are a valid alternative to LLMs for requirements classification, offering advantages in privacy, cost, and local deployability.

academic

Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification

Basic Information

Paper ID: 2510.21443
Title: Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification
Authors: Mohammad Amin Zadenoori, Vincenzo De Martino, Jacek Dąbrowski, Xavier Franch, Alessio Ferrari
Categories: cs.SE (Software Engineering), cs.AI (Artificial Intelligence), cs.CL (Computational Linguistics)
Publication Date: October 24, 2025 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2510.21443

Abstract

This study compares the performance of Large Language Models (LLMs) and Small Language Models (SLMs) on requirements engineering classification tasks. While LLMs demonstrate excellent performance in natural language processing tasks, they present challenges including high computational costs, data sharing risks, and dependency on external services. SLMs offer lightweight, locally deployable alternatives. Using the PROMISE, PROMISE Reclass, and SecReq datasets, the research compares the performance of 3 LLMs and 5 SLMs. Results show that although LLMs achieve an average F1 score 2% higher than SLMs, this difference is not statistically significant. SLMs nearly match LLM performance levels, even surpassing LLMs in recall on the PROMISE Reclass dataset, despite having 300 times fewer parameters. The study also reveals that dataset characteristics have a more significant impact on performance than model size.

Research Background and Motivation

Problem Definition

Requirements classification is a critical task in requirements engineering (RE), involving the categorization of requirements into different types, such as functional/non-functional requirements, or more fine-grained categories (e.g., security, performance). As project scale grows, requirements can number in the thousands, making manual classification labor-intensive and error-prone.

Research Significance

Automation of Requirements: Large projects contain vast numbers of requirements; automated classification can significantly improve efficiency
Support for Other RE Activities: Requirements classification supports other RE activities such as requirements management and traceability
Practical Industry Needs: Industry urgently requires solutions that are both accurate and practical

Limitations of Existing Approaches

LLM Challenges:

High computational costs
Data privacy and security risks (cloud deployment)
Dependency on external services
Proprietary nature limiting customization
Reproducibility issues

Research Gaps:

Systematic comparison of SLMs and LLMs on RE tasks remains unexplored
Lack of in-depth understanding of the relationship between model size and classification accuracy

Core Contributions

First Systematic Comparison: First systematic comparison of SLM and LLM performance on requirements classification tasks
Statistical Significance Analysis: Uses statistical methods such as Scheirer-Ray-Hare test to verify the significance of performance differences
Multi-Dataset Validation: Comprehensive evaluation across three public datasets (PROMISE, PROMISE Reclass, SecReq)
Practical Evidence: Provides empirical evidence that SLMs are viable alternatives to LLMs
Dataset Impact Analysis: Reveals the important finding that dataset characteristics have greater impact on performance than model size

Methodology Details

Task Definition

Input: Natural language requirement text Output: Requirement category labels (binary classification)

PROMISE: Functional Requirements (FR) vs Non-Functional Requirements (NFR)
PROMISE Reclass: FR vs NFR and Quality Requirements (QR) vs Non-QR (dual-label)
SecReq: Security-related requirements vs non-security requirements

Model Selection

SLMs (7-8B parameters):

Qwen2-7B-Instruct
Falcon-7B-Instruct
Granite-3.2-8B-Instruct
Ministral-8B-Instruct-2410
Meta-Llama-3-8B-Instruct

LLMs (10-20 trillion parameters):

GPT-5
xAI Grok-4
Claude-4

Technical Approach

Prompting Strategy:

Employs Chain-of-Thought (CoT) combined with Few-Shot learning
Provides 4 examples per category
Supplies category definitions based on expert-defined RE definitions

Experimental Setup:

Temperature parameter set to 0 for deterministic output
Each task executed 3 times with majority voting (2/3) determining final labels
Macro-averaging used for metric calculation

Experimental Setup

Dataset Details

Dataset	Task Type	Sample Size	Class Distribution
PROMISE	FR vs NFR	625	FR:255, NFR:370
PROMISE Reclass	FR vs NFR & QR vs Non-QR	625	FR:310, QR:382
SecReq	Security vs Non-Security	510	Sec:187, NSec:323

Evaluation Metrics

Precision (P): Ratio of correctly predicted positive instances to all predicted positive instances
Recall (R): Ratio of correctly predicted positive instances to all actual positive instances
F1 Score: Harmonic mean of precision and recall

Hardware Environment

SLMs: Linux 6.14 server, Intel i9-13900K CPU, 128GB RAM, NVIDIA RTX 4090 GPU
LLMs: Accessed via commercial APIs

Statistical Testing

Scheirer-Ray-Hare test (non-parametric two-way ANOVA) used to analyze the effects of model type and dataset on performance.

Experimental Results

Main Results

Model	PROMISE			PROMISE Reclass			SecReq
	P	R	F1	P	R	F1	P	R	F1
SLMs Average	0.85	0.79	0.82	0.62	0.91	0.73	0.83	0.90	0.86
LLMs Average	0.86	0.81	0.83	0.67	0.87	0.75	0.85	0.90	0.88

Best Performing Models:

Claude-4 (LLM): PROMISE (F1=0.82), PROMISE Reclass (F1=0.80), SecReq (F1=0.89)
Llama-3-8B (SLM): PROMISE (F1=0.80), PROMISE Reclass (F1=0.78), SecReq (F1=0.88)

Statistical Significance Analysis

Hypothesis	Variable	Effect Size (η²H)	p-value	Conclusion
H0A	Model Type	0.04	0.296	No Significant Difference
H0B	Dataset	0.63	<0.001	Significant Difference
H0C	Interaction Effect	0.001	0.790	No Significant Interaction

Key Findings

Comparable Performance: LLMs exceed SLMs by only 2% average F1 score, with no statistically significant difference
SLM Advantages: On the PROMISE Reclass dataset, SLMs significantly outperform LLMs in recall (0.96 vs maximum 0.90)
Dataset Dominance: Dataset characteristics have far greater impact on performance than model size (effect size 0.63 vs 0.04)
Performance Hierarchy: SecReq (median F1=0.865) > PROMISE (0.805) > PROMISE Reclass (0.730)

Execution Time Analysis

LLMs: 138-300 seconds (cloud-based high-performance infrastructure)
SLMs: Average 400 seconds (single local server)

NLP in Requirements Engineering

Traditional approaches primarily use classical machine learning techniques for requirements classification, with deep learning methods gradually emerging in recent years.

Large Language Models in RE

LLMs demonstrate strong capabilities in RE tasks including requirements classification, traceability, and model generation, though practical deployment faces challenges.

Small Language Model Research

SLMs receive attention as lightweight alternatives, but systematic research in the RE domain remains limited.

Conclusions and Discussion

Main Conclusions

Addressing the Research Question: LLMs slightly outperform SLMs with a 2% F1 score advantage, but this difference is not statistically significant. On specific recall metrics across datasets, SLMs even surpass LLMs.

Practical Implications

Cost-Effectiveness: SLMs provide comparable performance to LLMs at lower cost
Data Privacy: SLMs enable local deployment, avoiding data leakage risks
Resource Efficiency: SLMs require significantly reduced computational resources
Customization: Open-source SLMs are more amenable to fine-tuning for specific requirements

Limitations

Sample Size: Only 8 models evaluated, potentially introducing Type II errors
Task Scope: Only binary classification tasks considered; results may not generalize to other RE tasks
Prompt Dependency: Single prompting strategy employed; results may lack generalizability
Data Leakage Risk: LLMs may have encountered evaluation datasets during pretraining

In-Depth Evaluation

Strengths

Significant Research Value: Fills the gap in SLM-LLM comparisons in the RE domain
Methodologically Rigorous: Employs appropriate statistical testing methods to validate conclusions
Well-Designed Experiments: Multi-dataset validation enhances result credibility
High Practical Value: Provides empirical guidance for industry model selection
Good Transparency: Provides complete reproducibility package

Weaknesses

Limited Model Selection: SLMs restricted to 7-8B parameter range; larger open-source models not included
Single Task Type: Only classification tasks evaluated; generative RE tasks not covered
Insufficient Statistical Power: Small sample size may result in low statistical test power
Lacking Cost Analysis: No detailed computational cost and energy consumption comparison provided

Impact

Academic Impact:

Provides important reference for model selection in RE
Inspires deeper consideration of the relationship between model size and performance

Practical Value:

Provides basis for enterprises to balance privacy, cost, and performance
Promotes adoption of localized AI solutions in RE

Applicable Scenarios

Privacy-Sensitive Environments: Industries with stringent data privacy requirements (finance, healthcare)
Resource-Constrained Settings: Small-to-medium enterprises or environments with limited computational resources
Offline Deployment Requirements: Scenarios requiring operation in network-disconnected environments
Cost Control: Applications sensitive to API call costs

Future Research Directions

Directions Proposed by Authors

Interpretability: Develop models capable of generating classification explanations to enhance decision transparency
Multi-Task Evaluation: Extend to other RE tasks such as requirements traceability and model generation
Hybrid Pipelines: Design RE workflows where SLMs and LLMs work synergistically
Energy Consumption Research: Quantify environmental impact of different models
Tool Support: Develop practical tools supporting flexible model selection

Suggested Extended Research

Larger-Scale Studies: Include more models and larger datasets
Fine-Grained Analysis: Investigate classification difficulty variations across different requirement types
Domain Adaptation: Evaluate model generalization across different application domains
Human-AI Collaboration: Study collaboration patterns between human experts and AI models

References

The paper cites 17 relevant references covering important works in requirements engineering, natural language processing, and language models, providing a solid theoretical foundation for the research.

Overall Assessment: This is a high-quality empirical research paper providing valuable insights on an important and practical question. Despite certain limitations, its findings hold significant importance for both academia and industry, particularly regarding current AI model selection and deployment strategy formulation.