2025-11-23T20:13:16.600138

Can Large Language Models Improve SE Active Learning via Warm-Starts?

Senthilkumar, Menzies
When SE data is scarce, "active learners" use models learned from tiny samples of the data to find the next most informative example to label. In this way, effective models can be generated using very little data. For multi-objective software engineering (SE) tasks, active learning can benefit from an effective set of initial guesses (also known as "warm starts"). This paper explores the use of Large Language Models (LLMs) for creating warm-starts. Those results are compared against Gaussian Process Models and Tree of Parzen Estimators. For 49 SE tasks, LLM-generated warm starts significantly improved the performance of low- and medium-dimensional tasks. However, LLM effectiveness diminishes in high-dimensional problems, where Bayesian methods like Gaussian Process Models perform best.
academic

Can Large Language Models Improve SE Active Learning via Warm-Starts?

Basic Information

  • Paper ID: 2501.00125
  • Title: Can Large Language Models Improve SE Active Learning via Warm-Starts?
  • Authors: Lohith Senthilkumar, Tim Menzies (NC State University)
  • Category: cs.SE (Software Engineering)
  • Publication Date: December 30, 2024 (arXiv preprint)
  • Paper Link: https://arxiv.org/abs/2501.00125

Abstract

When software engineering (SE) data is scarce, "active learners" use models learned from small data samples to identify the next most informative example for annotation. In this manner, effective models can be generated using minimal data. For multi-objective software engineering tasks, active learning can benefit from effective initial guess sets, also known as "warm-starts." This paper explores using large language models (LLMs) to create warm-starts and compares the results with Gaussian process models and Parzen tree estimators. Across 49 SE tasks, LLM-generated warm-starts significantly improve performance on low- and medium-dimensional tasks. However, LLM effectiveness diminishes on high-dimensional problems, where Bayesian methods such as Gaussian process models perform best.

Research Background and Motivation

Problem Definition

Software engineering involves numerous multi-objective optimization problems requiring trade-offs between competing constraints, such as:

  • How to deliver more code at lower cost?
  • How to answer database queries faster while consuming less energy?

Core Challenges

  1. Data Scarcity: Three categories of data collection problems exist in SE:
    • Naive or Erroneous Data Collection: Such as >90% false positive annotation errors in defect prediction
    • Data Collection Specificity: Independent variables x are easily obtainable, but dependent variable y annotation is costly
    • Slow Expert Annotation: Subject matter experts (SMEs) can only annotate 10-20 high-quality samples per hour
  2. Limitations of Existing Methods:
    • Traditional optimization algorithms require large amounts of annotated data
    • Random sampling is inefficient
    • Lack of effective initialization strategies

Research Motivation

This paper proposes using LLMs' background knowledge to generate better initial guesses (warm-starts) to improve active learning performance on SE multi-objective optimization tasks.

Core Contributions

  1. Proposes a novel method leveraging LLMs to warm-start active learning for SE optimization tasks
  2. Conducts empirical comparison of LLM methods with alternative approaches on 49 datasets
  3. Reveals advantages and limitations of LLMs in solving multi-objective SE problems
  4. Provides reproducible data and script packages for benchmarking active learning strategies

Methodology Details

Task Definition

Given tabular data where:

  • x columns: Independent input variables (observable/controllable)
  • y columns: Dependent variables (requiring expensive annotation processes)
  • Objective: Find optimal y values under limited annotation budget (≤30 samples)

Core Method Architecture

1. LLM Warm-Start Workflow

E0 (Initial random annotation) → Ranking (best to worst) → LLM few-shot learning → 
Generate E1 (synthetic samples) → Nearest neighbor mapping to E2 → Warm-start active learning

2. Active Learning Framework

Gaussian Process Model (GPM):

  • Computes mean μ and standard deviation σ by fitting numerous possible functions
  • Uses acquisition functions to determine next sampling point
  • Supports three acquisition functions: UCB, PI, EI

Parzen Tree Estimator (TPE):

  • Partitions observed data into "best" and "rest" distributions
  • Models p(x|y) rather than p(y|x)
  • Supports explore and exploit acquisition strategies

3. LLM Prompt Engineering

Uses Gemini 1.5 Pro with prompt templates containing:

  • System Message: Defines LLM role and dataset metadata
  • Few-Shot Examples: Randomly sampled examples labeled as "best"/"rest"
  • Task Description: Requests generation of 2 better and 2 worse samples

Technical Innovations

  1. Multi-Dimensional Geometric Analysis Capability: LLMs can perform PCA-like multi-dimensional analysis, identifying critical dimensions and extrapolating
  2. Background Knowledge Utilization: Activates LLM's relevant domain knowledge through attribute names
  3. Nearest Neighbor Mapping Strategy: Maps LLM-generated synthetic samples to real data space

Experimental Setup

Datasets

Uses 49 SE optimization tasks from the MOOT (Multi Objective Optimization Testing) repository:

  • Scale: 93 to 86,000 rows
  • Dimensions: 3 to 38 independent variables, 1 to 5 dependent variables
  • Classification:
    • Low-dimensional (<6 features): 12 datasets
    • Medium-dimensional (6-11 features): 14 datasets
    • High-dimensional (>11 features): 19 datasets

Evaluation Metrics

Uses Chebyshev distance to evaluate multi-objective optimization performance:

d_Chebyshev(y,o) = max_{i=1,...,n} |y_i - l_i|

where l_i is the ideal value; smaller Chebyshev distances indicate better performance.

Comparison Methods

  • GPM Methods: UCB_GPM, PI_GPM, EI_GPM
  • TPE Methods: explore, exploit
  • Baseline: Random sampling
  • Warm-Start Strategies: LLM vs. random initialization

Implementation Details

  • Warm-start sample count: B0 = 4
  • Total evaluation budget: B1 ∈ {10, 15, 20, 25, 30}
  • Repetitions: 20 times (statistical validity)
  • Statistical Methods: Scott-Knott ranking + Cliff's Delta effect size

Experimental Results

Main Results

RQ1: Is Active Learning Useful for SE Tasks?

  • Conclusion: Active learning outperforms random methods
  • Evidence: Most optimization gains are achieved within 30 annotations; pure random methods never achieve top ranking in any dimensional category

RQ2: Are Warm-Starts Useful for Active Learning?

  • Low-Dimensional Data: LLM/Exploit achieves 100% top ranking vs. random/Exploit's 27%
  • Medium-Dimensional Data: LLM/Exploit achieves 50% top ranking vs. random/Exploit's 21%

RQ3: Are LLMs the Best Method for Generating Warm-Starts?

Ranking Frequency by Dimensionality:

MethodLow-Dim (rank 0)Medium-Dim (rank 0)High-Dim (rank 0)
LLM Exploit100%50%33%
random UCB_GPM45%36%50%
random EI_GPM45%36%44%
random PI_GPM9%36%39%

Key Findings

  1. Dimensionality Effect: LLMs excel on low- and medium-dimensional problems but show diminishing effectiveness on high-dimensional problems
  2. Acquisition Function Sensitivity: LLM pairs best with exploit strategy but performs poorly with explore strategy
  3. Computational Efficiency: TPE methods run significantly faster than GPM or LLM methods

Case Study

Using the SS-A dataset as an example, LLM/exploit achieves top ranking (rank 0) across different budgets, with median Chebyshev distance of 0.07-0.08, significantly outperforming the baseline of 0.18.

Literature Review Findings

Analysis of 1000 related papers on Google Scholar reveals limitations in existing research:

  • Most studies use <6 test sets
  • Focus primarily on single-objective tasks
  • Rarely use background knowledge for warm-starts
  • Annotation budgets typically >1000 samples

Paper Positioning

This paper fills a research gap in multi-objective, tabular data, small annotation budget SE optimization.

Conclusions and Discussion

Main Conclusions

  1. LLM Warm-Starts Are Effective: Significantly improve active learning performance on low- and medium-dimensional SE tasks
  2. Dimensionality Limitations: LLMs face challenges on high-dimensional problems; Bayesian methods remain superior
  3. Practical Value: Reduces dependence on large amounts of annotated data

Limitations

  1. High-Dimensional Performance Degradation: Possibly due to lack of complex problem solutions in training data
  2. Model Dependency: Only uses Gemini 1.5 Pro; lacks comparison with other LLMs
  3. Domain Specificity: Primarily targets SE optimization tasks; generalization capability remains to be verified

Future Directions

  1. Dimensionality Extension: Explore dimensionality reduction techniques to mitigate high-dimensional problems
  2. Hybrid Methods: Combine strengths of LLM and Bayesian approaches
  3. Cost Efficiency: Study trade-offs between computational cost and performance

In-Depth Evaluation

Strengths

  1. Large Experimental Scale: Evaluation across 49 datasets is rare in this field
  2. Novel Methodology: First systematic exploration of LLMs in SE active learning
  3. Statistical Rigor: Employs strict statistical methods such as Scott-Knott ranking
  4. Strong Reproducibility: Provides complete code and data

Weaknesses

  1. Insufficient Theoretical Analysis: Lacks theoretical explanation for why LLMs are effective on low-dimensional problems
  2. Single LLM Selection: Tests only one LLM; lacks inter-model comparison
  3. Simple Prompt Engineering: May exist more optimal prompt strategies

Impact

  1. Academic Value: Provides new insights for the intersection of SE optimization and active learning
  2. Practical Value: Direct application potential in data-scarce SE scenarios
  3. Methodological Contribution: Demonstrates novel applications of LLMs in traditional machine learning tasks

Applicable Scenarios

  • Software configuration optimization
  • Cloud service parameter tuning
  • Software process modeling
  • Trade-off decision-making in requirements engineering

References

The paper cites 87 related references covering multiple domains including active learning, multi-objective optimization, software engineering, and large language models, providing a solid theoretical foundation for the research.


Summary: This is an innovative research contribution in the software engineering optimization domain, systematically exploring LLM applications in active learning warm-starts for the first time. Despite certain limitations, its large-scale experimental validation and practical value make it an important contribution to the field.