2025-11-23T20:13:16.600138

Can Large Language Models Improve SE Active Learning via Warm-Starts?

Senthilkumar, Menzies

When SE data is scarce, "active learners" use models learned from tiny samples of the data to find the next most informative example to label. In this way, effective models can be generated using very little data. For multi-objective software engineering (SE) tasks, active learning can benefit from an effective set of initial guesses (also known as "warm starts"). This paper explores the use of Large Language Models (LLMs) for creating warm-starts. Those results are compared against Gaussian Process Models and Tree of Parzen Estimators. For 49 SE tasks, LLM-generated warm starts significantly improved the performance of low- and medium-dimensional tasks. However, LLM effectiveness diminishes in high-dimensional problems, where Bayesian methods like Gaussian Process Models perform best.

academic

Can Large Language Models Improve SE Active Learning via Warm-Starts?

Basic Information

Paper ID: 2501.00125
Title: Can Large Language Models Improve SE Active Learning via Warm-Starts?
Authors: Lohith Senthilkumar, Tim Menzies (NC State University)
Category: cs.SE (Software Engineering)
Publication Date: December 30, 2024 (arXiv preprint)
Paper Link: https://arxiv.org/abs/2501.00125

Abstract

When software engineering (SE) data is scarce, "active learners" use models learned from small data samples to identify the next most informative example for annotation. In this manner, effective models can be generated using minimal data. For multi-objective software engineering tasks, active learning can benefit from effective initial guess sets, also known as "warm-starts." This paper explores using large language models (LLMs) to create warm-starts and compares the results with Gaussian process models and Parzen tree estimators. Across 49 SE tasks, LLM-generated warm-starts significantly improve performance on low- and medium-dimensional tasks. However, LLM effectiveness diminishes on high-dimensional problems, where Bayesian methods such as Gaussian process models perform best.

Research Background and Motivation

Problem Definition

Software engineering involves numerous multi-objective optimization problems requiring trade-offs between competing constraints, such as:

How to deliver more code at lower cost?
How to answer database queries faster while consuming less energy?

Core Challenges

Data Scarcity: Three categories of data collection problems exist in SE:
- Naive or Erroneous Data Collection: Such as >90% false positive annotation errors in defect prediction
- Data Collection Specificity: Independent variables x are easily obtainable, but dependent variable y annotation is costly
- Slow Expert Annotation: Subject matter experts (SMEs) can only annotate 10-20 high-quality samples per hour
Limitations of Existing Methods:
- Traditional optimization algorithms require large amounts of annotated data
- Random sampling is inefficient
- Lack of effective initialization strategies

Research Motivation

This paper proposes using LLMs' background knowledge to generate better initial guesses (warm-starts) to improve active learning performance on SE multi-objective optimization tasks.

Core Contributions

Proposes a novel method leveraging LLMs to warm-start active learning for SE optimization tasks
Conducts empirical comparison of LLM methods with alternative approaches on 49 datasets
Reveals advantages and limitations of LLMs in solving multi-objective SE problems
Provides reproducible data and script packages for benchmarking active learning strategies

Methodology Details

Task Definition

Given tabular data where:

x columns: Independent input variables (observable/controllable)
y columns: Dependent variables (requiring expensive annotation processes)
Objective: Find optimal y values under limited annotation budget (≤30 samples)

Core Method Architecture

1. LLM Warm-Start Workflow

E0 (Initial random annotation) → Ranking (best to worst) → LLM few-shot learning → 
Generate E1 (synthetic samples) → Nearest neighbor mapping to E2 → Warm-start active learning

2. Active Learning Framework

Gaussian Process Model (GPM):

Computes mean μ and standard deviation σ by fitting numerous possible functions
Uses acquisition functions to determine next sampling point
Supports three acquisition functions: UCB, PI, EI

Parzen Tree Estimator (TPE):

Partitions observed data into "best" and "rest" distributions
Models p(x|y) rather than p(y|x)
Supports explore and exploit acquisition strategies

3. LLM Prompt Engineering

Uses Gemini 1.5 Pro with prompt templates containing:

System Message: Defines LLM role and dataset metadata
Few-Shot Examples: Randomly sampled examples labeled as "best"/"rest"
Task Description: Requests generation of 2 better and 2 worse samples

Technical Innovations

Multi-Dimensional Geometric Analysis Capability: LLMs can perform PCA-like multi-dimensional analysis, identifying critical dimensions and extrapolating
Background Knowledge Utilization: Activates LLM's relevant domain knowledge through attribute names
Nearest Neighbor Mapping Strategy: Maps LLM-generated synthetic samples to real data space

Experimental Setup

Datasets

Uses 49 SE optimization tasks from the MOOT (Multi Objective Optimization Testing) repository:

Scale: 93 to 86,000 rows
Dimensions: 3 to 38 independent variables, 1 to 5 dependent variables
Classification:
- Low-dimensional (<6 features): 12 datasets
- Medium-dimensional (6-11 features): 14 datasets
- High-dimensional (>11 features): 19 datasets

Evaluation Metrics

Uses Chebyshev distance to evaluate multi-objective optimization performance:

d_Chebyshev(y,o) = max_{i=1,...,n} |y_i - l_i|

where l_i is the ideal value; smaller Chebyshev distances indicate better performance.

Comparison Methods

GPM Methods: UCB_GPM, PI_GPM, EI_GPM
TPE Methods: explore, exploit
Baseline: Random sampling
Warm-Start Strategies: LLM vs. random initialization

Implementation Details

Warm-start sample count: B0 = 4
Total evaluation budget: B1 ∈ {10, 15, 20, 25, 30}
Repetitions: 20 times (statistical validity)
Statistical Methods: Scott-Knott ranking + Cliff's Delta effect size

Experimental Results

Main Results

RQ1: Is Active Learning Useful for SE Tasks?

Conclusion: Active learning outperforms random methods
Evidence: Most optimization gains are achieved within 30 annotations; pure random methods never achieve top ranking in any dimensional category

RQ2: Are Warm-Starts Useful for Active Learning?

Low-Dimensional Data: LLM/Exploit achieves 100% top ranking vs. random/Exploit's 27%
Medium-Dimensional Data: LLM/Exploit achieves 50% top ranking vs. random/Exploit's 21%

RQ3: Are LLMs the Best Method for Generating Warm-Starts?

Ranking Frequency by Dimensionality:

Method	Low-Dim (rank 0)	Medium-Dim (rank 0)	High-Dim (rank 0)
LLM Exploit	100%	50%	33%
random UCB_GPM	45%	36%	50%
random EI_GPM	45%	36%	44%
random PI_GPM	9%	36%	39%

Key Findings

Dimensionality Effect: LLMs excel on low- and medium-dimensional problems but show diminishing effectiveness on high-dimensional problems
Acquisition Function Sensitivity: LLM pairs best with exploit strategy but performs poorly with explore strategy
Computational Efficiency: TPE methods run significantly faster than GPM or LLM methods

Case Study

Using the SS-A dataset as an example, LLM/exploit achieves top ranking (rank 0) across different budgets, with median Chebyshev distance of 0.07-0.08, significantly outperforming the baseline of 0.18.

Literature Review Findings

Analysis of 1000 related papers on Google Scholar reveals limitations in existing research:

Most studies use <6 test sets
Focus primarily on single-objective tasks
Rarely use background knowledge for warm-starts
Annotation budgets typically >1000 samples

Paper Positioning

This paper fills a research gap in multi-objective, tabular data, small annotation budget SE optimization.

Conclusions and Discussion

Main Conclusions

LLM Warm-Starts Are Effective: Significantly improve active learning performance on low- and medium-dimensional SE tasks
Dimensionality Limitations: LLMs face challenges on high-dimensional problems; Bayesian methods remain superior
Practical Value: Reduces dependence on large amounts of annotated data

Limitations

High-Dimensional Performance Degradation: Possibly due to lack of complex problem solutions in training data
Model Dependency: Only uses Gemini 1.5 Pro; lacks comparison with other LLMs
Domain Specificity: Primarily targets SE optimization tasks; generalization capability remains to be verified

Future Directions

Dimensionality Extension: Explore dimensionality reduction techniques to mitigate high-dimensional problems
Hybrid Methods: Combine strengths of LLM and Bayesian approaches
Cost Efficiency: Study trade-offs between computational cost and performance

In-Depth Evaluation

Strengths

Large Experimental Scale: Evaluation across 49 datasets is rare in this field
Novel Methodology: First systematic exploration of LLMs in SE active learning
Statistical Rigor: Employs strict statistical methods such as Scott-Knott ranking
Strong Reproducibility: Provides complete code and data

Weaknesses

Insufficient Theoretical Analysis: Lacks theoretical explanation for why LLMs are effective on low-dimensional problems
Single LLM Selection: Tests only one LLM; lacks inter-model comparison
Simple Prompt Engineering: May exist more optimal prompt strategies

Impact

Academic Value: Provides new insights for the intersection of SE optimization and active learning
Practical Value: Direct application potential in data-scarce SE scenarios
Methodological Contribution: Demonstrates novel applications of LLMs in traditional machine learning tasks

Applicable Scenarios

Software configuration optimization
Cloud service parameter tuning
Software process modeling
Trade-off decision-making in requirements engineering

References

The paper cites 87 related references covering multiple domains including active learning, multi-objective optimization, software engineering, and large language models, providing a solid theoretical foundation for the research.

Summary: This is an innovative research contribution in the software engineering optimization domain, systematically exploring LLM applications in active learning warm-starts for the first time. Despite certain limitations, its large-scale experimental validation and practical value make it an important contribution to the field.