[Context and motivation] Large language models (LLMs) show notable results in natural language processing (NLP) tasks for requirements engineering (RE). However, their use is compromised by high computational cost, data sharing risks, and dependence on external services. In contrast, small language models (SLMs) offer a lightweight, locally deployable alternative. [Question/problem] It remains unclear how well SLMs perform compared to LLMs in RE tasks in terms of accuracy. [Results] Our preliminary study compares eight models, including three LLMs and five SLMs, on requirements classification tasks using the PROMISE, PROMISE Reclass, and SecReq datasets. Our results show that although LLMs achieve an average F1 score of 2% higher than SLMs, this difference is not statistically significant. SLMs almost reach LLMs performance across all datasets and even outperform them in recall on the PROMISE Reclass dataset, despite being up to 300 times smaller. We also found that dataset characteristics play a more significant role in performance than model size. [Contribution] Our study contributes with evidence that SLMs are a valid alternative to LLMs for requirements classification, offering advantages in privacy, cost, and local deployability.
Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification
- Paper ID: 2510.21443
- Title: Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification
- Authors: Mohammad Amin Zadenoori, Vincenzo De Martino, Jacek Dąbrowski, Xavier Franch, Alessio Ferrari
- Categories: cs.SE (Software Engineering), cs.AI (Artificial Intelligence), cs.CL (Computational Linguistics)
- Publication Date: October 24, 2025 (arXiv preprint)
- Paper Link: https://arxiv.org/abs/2510.21443
This study compares the performance of Large Language Models (LLMs) and Small Language Models (SLMs) on requirements engineering classification tasks. While LLMs demonstrate excellent performance in natural language processing tasks, they present challenges including high computational costs, data sharing risks, and dependency on external services. SLMs offer lightweight, locally deployable alternatives. Using the PROMISE, PROMISE Reclass, and SecReq datasets, the research compares the performance of 3 LLMs and 5 SLMs. Results show that although LLMs achieve an average F1 score 2% higher than SLMs, this difference is not statistically significant. SLMs nearly match LLM performance levels, even surpassing LLMs in recall on the PROMISE Reclass dataset, despite having 300 times fewer parameters. The study also reveals that dataset characteristics have a more significant impact on performance than model size.
Requirements classification is a critical task in requirements engineering (RE), involving the categorization of requirements into different types, such as functional/non-functional requirements, or more fine-grained categories (e.g., security, performance). As project scale grows, requirements can number in the thousands, making manual classification labor-intensive and error-prone.
- Automation of Requirements: Large projects contain vast numbers of requirements; automated classification can significantly improve efficiency
- Support for Other RE Activities: Requirements classification supports other RE activities such as requirements management and traceability
- Practical Industry Needs: Industry urgently requires solutions that are both accurate and practical
LLM Challenges:
- High computational costs
- Data privacy and security risks (cloud deployment)
- Dependency on external services
- Proprietary nature limiting customization
- Reproducibility issues
Research Gaps:
- Systematic comparison of SLMs and LLMs on RE tasks remains unexplored
- Lack of in-depth understanding of the relationship between model size and classification accuracy
- First Systematic Comparison: First systematic comparison of SLM and LLM performance on requirements classification tasks
- Statistical Significance Analysis: Uses statistical methods such as Scheirer-Ray-Hare test to verify the significance of performance differences
- Multi-Dataset Validation: Comprehensive evaluation across three public datasets (PROMISE, PROMISE Reclass, SecReq)
- Practical Evidence: Provides empirical evidence that SLMs are viable alternatives to LLMs
- Dataset Impact Analysis: Reveals the important finding that dataset characteristics have greater impact on performance than model size
Input: Natural language requirement text
Output: Requirement category labels (binary classification)
- PROMISE: Functional Requirements (FR) vs Non-Functional Requirements (NFR)
- PROMISE Reclass: FR vs NFR and Quality Requirements (QR) vs Non-QR (dual-label)
- SecReq: Security-related requirements vs non-security requirements
SLMs (7-8B parameters):
- Qwen2-7B-Instruct
- Falcon-7B-Instruct
- Granite-3.2-8B-Instruct
- Ministral-8B-Instruct-2410
- Meta-Llama-3-8B-Instruct
LLMs (10-20 trillion parameters):
Prompting Strategy:
- Employs Chain-of-Thought (CoT) combined with Few-Shot learning
- Provides 4 examples per category
- Supplies category definitions based on expert-defined RE definitions
Experimental Setup:
- Temperature parameter set to 0 for deterministic output
- Each task executed 3 times with majority voting (2/3) determining final labels
- Macro-averaging used for metric calculation
| Dataset | Task Type | Sample Size | Class Distribution |
|---|
| PROMISE | FR vs NFR | 625 | FR:255, NFR:370 |
| PROMISE Reclass | FR vs NFR & QR vs Non-QR | 625 | FR:310, QR:382 |
| SecReq | Security vs Non-Security | 510 | Sec:187, NSec:323 |
- Precision (P): Ratio of correctly predicted positive instances to all predicted positive instances
- Recall (R): Ratio of correctly predicted positive instances to all actual positive instances
- F1 Score: Harmonic mean of precision and recall
- SLMs: Linux 6.14 server, Intel i9-13900K CPU, 128GB RAM, NVIDIA RTX 4090 GPU
- LLMs: Accessed via commercial APIs
Scheirer-Ray-Hare test (non-parametric two-way ANOVA) used to analyze the effects of model type and dataset on performance.
| Model | PROMISE | | | PROMISE Reclass | | | SecReq | | |
|---|
| P | R | F1 | P | R | F1 | P | R | F1 |
| SLMs Average | 0.85 | 0.79 | 0.82 | 0.62 | 0.91 | 0.73 | 0.83 | 0.90 | 0.86 |
| LLMs Average | 0.86 | 0.81 | 0.83 | 0.67 | 0.87 | 0.75 | 0.85 | 0.90 | 0.88 |
Best Performing Models:
- Claude-4 (LLM): PROMISE (F1=0.82), PROMISE Reclass (F1=0.80), SecReq (F1=0.89)
- Llama-3-8B (SLM): PROMISE (F1=0.80), PROMISE Reclass (F1=0.78), SecReq (F1=0.88)
| Hypothesis | Variable | Effect Size (η²H) | p-value | Conclusion |
|---|
| H0A | Model Type | 0.04 | 0.296 | No Significant Difference |
| H0B | Dataset | 0.63 | <0.001 | Significant Difference |
| H0C | Interaction Effect | 0.001 | 0.790 | No Significant Interaction |
- Comparable Performance: LLMs exceed SLMs by only 2% average F1 score, with no statistically significant difference
- SLM Advantages: On the PROMISE Reclass dataset, SLMs significantly outperform LLMs in recall (0.96 vs maximum 0.90)
- Dataset Dominance: Dataset characteristics have far greater impact on performance than model size (effect size 0.63 vs 0.04)
- Performance Hierarchy: SecReq (median F1=0.865) > PROMISE (0.805) > PROMISE Reclass (0.730)
- LLMs: 138-300 seconds (cloud-based high-performance infrastructure)
- SLMs: Average 400 seconds (single local server)
Traditional approaches primarily use classical machine learning techniques for requirements classification, with deep learning methods gradually emerging in recent years.
LLMs demonstrate strong capabilities in RE tasks including requirements classification, traceability, and model generation, though practical deployment faces challenges.
SLMs receive attention as lightweight alternatives, but systematic research in the RE domain remains limited.
Addressing the Research Question: LLMs slightly outperform SLMs with a 2% F1 score advantage, but this difference is not statistically significant. On specific recall metrics across datasets, SLMs even surpass LLMs.
- Cost-Effectiveness: SLMs provide comparable performance to LLMs at lower cost
- Data Privacy: SLMs enable local deployment, avoiding data leakage risks
- Resource Efficiency: SLMs require significantly reduced computational resources
- Customization: Open-source SLMs are more amenable to fine-tuning for specific requirements
- Sample Size: Only 8 models evaluated, potentially introducing Type II errors
- Task Scope: Only binary classification tasks considered; results may not generalize to other RE tasks
- Prompt Dependency: Single prompting strategy employed; results may lack generalizability
- Data Leakage Risk: LLMs may have encountered evaluation datasets during pretraining
- Significant Research Value: Fills the gap in SLM-LLM comparisons in the RE domain
- Methodologically Rigorous: Employs appropriate statistical testing methods to validate conclusions
- Well-Designed Experiments: Multi-dataset validation enhances result credibility
- High Practical Value: Provides empirical guidance for industry model selection
- Good Transparency: Provides complete reproducibility package
- Limited Model Selection: SLMs restricted to 7-8B parameter range; larger open-source models not included
- Single Task Type: Only classification tasks evaluated; generative RE tasks not covered
- Insufficient Statistical Power: Small sample size may result in low statistical test power
- Lacking Cost Analysis: No detailed computational cost and energy consumption comparison provided
Academic Impact:
- Provides important reference for model selection in RE
- Inspires deeper consideration of the relationship between model size and performance
Practical Value:
- Provides basis for enterprises to balance privacy, cost, and performance
- Promotes adoption of localized AI solutions in RE
- Privacy-Sensitive Environments: Industries with stringent data privacy requirements (finance, healthcare)
- Resource-Constrained Settings: Small-to-medium enterprises or environments with limited computational resources
- Offline Deployment Requirements: Scenarios requiring operation in network-disconnected environments
- Cost Control: Applications sensitive to API call costs
- Interpretability: Develop models capable of generating classification explanations to enhance decision transparency
- Multi-Task Evaluation: Extend to other RE tasks such as requirements traceability and model generation
- Hybrid Pipelines: Design RE workflows where SLMs and LLMs work synergistically
- Energy Consumption Research: Quantify environmental impact of different models
- Tool Support: Develop practical tools supporting flexible model selection
- Larger-Scale Studies: Include more models and larger datasets
- Fine-Grained Analysis: Investigate classification difficulty variations across different requirement types
- Domain Adaptation: Evaluate model generalization across different application domains
- Human-AI Collaboration: Study collaboration patterns between human experts and AI models
The paper cites 17 relevant references covering important works in requirements engineering, natural language processing, and language models, providing a solid theoretical foundation for the research.
Overall Assessment: This is a high-quality empirical research paper providing valuable insights on an important and practical question. Despite certain limitations, its findings hold significant importance for both academia and industry, particularly regarding current AI model selection and deployment strategy formulation.