2025-11-11T17:07:09.499066

Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification

Zadenoori, De Martino, Dabrowski et al.
[Context and motivation] Large language models (LLMs) show notable results in natural language processing (NLP) tasks for requirements engineering (RE). However, their use is compromised by high computational cost, data sharing risks, and dependence on external services. In contrast, small language models (SLMs) offer a lightweight, locally deployable alternative. [Question/problem] It remains unclear how well SLMs perform compared to LLMs in RE tasks in terms of accuracy. [Results] Our preliminary study compares eight models, including three LLMs and five SLMs, on requirements classification tasks using the PROMISE, PROMISE Reclass, and SecReq datasets. Our results show that although LLMs achieve an average F1 score of 2% higher than SLMs, this difference is not statistically significant. SLMs almost reach LLMs performance across all datasets and even outperform them in recall on the PROMISE Reclass dataset, despite being up to 300 times smaller. We also found that dataset characteristics play a more significant role in performance than model size. [Contribution] Our study contributes with evidence that SLMs are a valid alternative to LLMs for requirements classification, offering advantages in privacy, cost, and local deployability.
academic

Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification

Basic Information

  • Paper ID: 2510.21443
  • Title: Does Model Size Matter? A Comparison of Small and Large Language Models for Requirements Classification
  • Authors: Mohammad Amin Zadenoori, Vincenzo De Martino, Jacek Dąbrowski, Xavier Franch, Alessio Ferrari
  • Categories: cs.SE (Software Engineering), cs.AI (Artificial Intelligence), cs.CL (Computational Linguistics)
  • Publication Date: October 24, 2025 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2510.21443

Abstract

This study compares the performance of Large Language Models (LLMs) and Small Language Models (SLMs) on requirements engineering classification tasks. While LLMs demonstrate excellent performance in natural language processing tasks, they present challenges including high computational costs, data sharing risks, and dependency on external services. SLMs offer lightweight, locally deployable alternatives. Using the PROMISE, PROMISE Reclass, and SecReq datasets, the research compares the performance of 3 LLMs and 5 SLMs. Results show that although LLMs achieve an average F1 score 2% higher than SLMs, this difference is not statistically significant. SLMs nearly match LLM performance levels, even surpassing LLMs in recall on the PROMISE Reclass dataset, despite having 300 times fewer parameters. The study also reveals that dataset characteristics have a more significant impact on performance than model size.

Research Background and Motivation

Problem Definition

Requirements classification is a critical task in requirements engineering (RE), involving the categorization of requirements into different types, such as functional/non-functional requirements, or more fine-grained categories (e.g., security, performance). As project scale grows, requirements can number in the thousands, making manual classification labor-intensive and error-prone.

Research Significance

  1. Automation of Requirements: Large projects contain vast numbers of requirements; automated classification can significantly improve efficiency
  2. Support for Other RE Activities: Requirements classification supports other RE activities such as requirements management and traceability
  3. Practical Industry Needs: Industry urgently requires solutions that are both accurate and practical

Limitations of Existing Approaches

LLM Challenges:

  • High computational costs
  • Data privacy and security risks (cloud deployment)
  • Dependency on external services
  • Proprietary nature limiting customization
  • Reproducibility issues

Research Gaps:

  • Systematic comparison of SLMs and LLMs on RE tasks remains unexplored
  • Lack of in-depth understanding of the relationship between model size and classification accuracy

Core Contributions

  1. First Systematic Comparison: First systematic comparison of SLM and LLM performance on requirements classification tasks
  2. Statistical Significance Analysis: Uses statistical methods such as Scheirer-Ray-Hare test to verify the significance of performance differences
  3. Multi-Dataset Validation: Comprehensive evaluation across three public datasets (PROMISE, PROMISE Reclass, SecReq)
  4. Practical Evidence: Provides empirical evidence that SLMs are viable alternatives to LLMs
  5. Dataset Impact Analysis: Reveals the important finding that dataset characteristics have greater impact on performance than model size

Methodology Details

Task Definition

Input: Natural language requirement text Output: Requirement category labels (binary classification)

  • PROMISE: Functional Requirements (FR) vs Non-Functional Requirements (NFR)
  • PROMISE Reclass: FR vs NFR and Quality Requirements (QR) vs Non-QR (dual-label)
  • SecReq: Security-related requirements vs non-security requirements

Model Selection

SLMs (7-8B parameters):

  • Qwen2-7B-Instruct
  • Falcon-7B-Instruct
  • Granite-3.2-8B-Instruct
  • Ministral-8B-Instruct-2410
  • Meta-Llama-3-8B-Instruct

LLMs (10-20 trillion parameters):

  • GPT-5
  • xAI Grok-4
  • Claude-4

Technical Approach

Prompting Strategy:

  • Employs Chain-of-Thought (CoT) combined with Few-Shot learning
  • Provides 4 examples per category
  • Supplies category definitions based on expert-defined RE definitions

Experimental Setup:

  • Temperature parameter set to 0 for deterministic output
  • Each task executed 3 times with majority voting (2/3) determining final labels
  • Macro-averaging used for metric calculation

Experimental Setup

Dataset Details

DatasetTask TypeSample SizeClass Distribution
PROMISEFR vs NFR625FR:255, NFR:370
PROMISE ReclassFR vs NFR & QR vs Non-QR625FR:310, QR:382
SecReqSecurity vs Non-Security510Sec:187, NSec:323

Evaluation Metrics

  • Precision (P): Ratio of correctly predicted positive instances to all predicted positive instances
  • Recall (R): Ratio of correctly predicted positive instances to all actual positive instances
  • F1 Score: Harmonic mean of precision and recall

Hardware Environment

  • SLMs: Linux 6.14 server, Intel i9-13900K CPU, 128GB RAM, NVIDIA RTX 4090 GPU
  • LLMs: Accessed via commercial APIs

Statistical Testing

Scheirer-Ray-Hare test (non-parametric two-way ANOVA) used to analyze the effects of model type and dataset on performance.

Experimental Results

Main Results

ModelPROMISEPROMISE ReclassSecReq
PRF1PRF1PRF1
SLMs Average0.850.790.820.620.910.730.830.900.86
LLMs Average0.860.810.830.670.870.750.850.900.88

Best Performing Models:

  • Claude-4 (LLM): PROMISE (F1=0.82), PROMISE Reclass (F1=0.80), SecReq (F1=0.89)
  • Llama-3-8B (SLM): PROMISE (F1=0.80), PROMISE Reclass (F1=0.78), SecReq (F1=0.88)

Statistical Significance Analysis

HypothesisVariableEffect Size (η²H)p-valueConclusion
H0AModel Type0.040.296No Significant Difference
H0BDataset0.63<0.001Significant Difference
H0CInteraction Effect0.0010.790No Significant Interaction

Key Findings

  1. Comparable Performance: LLMs exceed SLMs by only 2% average F1 score, with no statistically significant difference
  2. SLM Advantages: On the PROMISE Reclass dataset, SLMs significantly outperform LLMs in recall (0.96 vs maximum 0.90)
  3. Dataset Dominance: Dataset characteristics have far greater impact on performance than model size (effect size 0.63 vs 0.04)
  4. Performance Hierarchy: SecReq (median F1=0.865) > PROMISE (0.805) > PROMISE Reclass (0.730)

Execution Time Analysis

  • LLMs: 138-300 seconds (cloud-based high-performance infrastructure)
  • SLMs: Average 400 seconds (single local server)

NLP in Requirements Engineering

Traditional approaches primarily use classical machine learning techniques for requirements classification, with deep learning methods gradually emerging in recent years.

Large Language Models in RE

LLMs demonstrate strong capabilities in RE tasks including requirements classification, traceability, and model generation, though practical deployment faces challenges.

Small Language Model Research

SLMs receive attention as lightweight alternatives, but systematic research in the RE domain remains limited.

Conclusions and Discussion

Main Conclusions

Addressing the Research Question: LLMs slightly outperform SLMs with a 2% F1 score advantage, but this difference is not statistically significant. On specific recall metrics across datasets, SLMs even surpass LLMs.

Practical Implications

  1. Cost-Effectiveness: SLMs provide comparable performance to LLMs at lower cost
  2. Data Privacy: SLMs enable local deployment, avoiding data leakage risks
  3. Resource Efficiency: SLMs require significantly reduced computational resources
  4. Customization: Open-source SLMs are more amenable to fine-tuning for specific requirements

Limitations

  1. Sample Size: Only 8 models evaluated, potentially introducing Type II errors
  2. Task Scope: Only binary classification tasks considered; results may not generalize to other RE tasks
  3. Prompt Dependency: Single prompting strategy employed; results may lack generalizability
  4. Data Leakage Risk: LLMs may have encountered evaluation datasets during pretraining

In-Depth Evaluation

Strengths

  1. Significant Research Value: Fills the gap in SLM-LLM comparisons in the RE domain
  2. Methodologically Rigorous: Employs appropriate statistical testing methods to validate conclusions
  3. Well-Designed Experiments: Multi-dataset validation enhances result credibility
  4. High Practical Value: Provides empirical guidance for industry model selection
  5. Good Transparency: Provides complete reproducibility package

Weaknesses

  1. Limited Model Selection: SLMs restricted to 7-8B parameter range; larger open-source models not included
  2. Single Task Type: Only classification tasks evaluated; generative RE tasks not covered
  3. Insufficient Statistical Power: Small sample size may result in low statistical test power
  4. Lacking Cost Analysis: No detailed computational cost and energy consumption comparison provided

Impact

Academic Impact:

  • Provides important reference for model selection in RE
  • Inspires deeper consideration of the relationship between model size and performance

Practical Value:

  • Provides basis for enterprises to balance privacy, cost, and performance
  • Promotes adoption of localized AI solutions in RE

Applicable Scenarios

  1. Privacy-Sensitive Environments: Industries with stringent data privacy requirements (finance, healthcare)
  2. Resource-Constrained Settings: Small-to-medium enterprises or environments with limited computational resources
  3. Offline Deployment Requirements: Scenarios requiring operation in network-disconnected environments
  4. Cost Control: Applications sensitive to API call costs

Future Research Directions

Directions Proposed by Authors

  1. Interpretability: Develop models capable of generating classification explanations to enhance decision transparency
  2. Multi-Task Evaluation: Extend to other RE tasks such as requirements traceability and model generation
  3. Hybrid Pipelines: Design RE workflows where SLMs and LLMs work synergistically
  4. Energy Consumption Research: Quantify environmental impact of different models
  5. Tool Support: Develop practical tools supporting flexible model selection

Suggested Extended Research

  1. Larger-Scale Studies: Include more models and larger datasets
  2. Fine-Grained Analysis: Investigate classification difficulty variations across different requirement types
  3. Domain Adaptation: Evaluate model generalization across different application domains
  4. Human-AI Collaboration: Study collaboration patterns between human experts and AI models

References

The paper cites 17 relevant references covering important works in requirements engineering, natural language processing, and language models, providing a solid theoretical foundation for the research.


Overall Assessment: This is a high-quality empirical research paper providing valuable insights on an important and practical question. Despite certain limitations, its findings hold significant importance for both academia and industry, particularly regarding current AI model selection and deployment strategy formulation.