2025-11-15T16:58:11.980929

Prompt engineering and its implications on the energy consumption of Large Language Models

Rubei, Moussaid, di Sipio et al.
Reducing the environmental impact of AI-based software systems has become critical. The intensive use of large language models (LLMs) in software engineering poses severe challenges regarding computational resources, data centers, and carbon emissions. In this paper, we investigate how prompt engineering techniques (PETs) can impact the carbon emission of the Llama 3 model for the code generation task. We experimented with the CodeXGLUE benchmark to evaluate both energy consumption and the accuracy of the generated code using an isolated testing environment. Our initial results show that the energy consumption of LLMs can be reduced by using specific tags that distinguish different prompt parts. Even though a more in-depth evaluation is needed to confirm our findings, this work suggests that prompt engineering can reduce LLMs' energy consumption during the inference phase without compromising performance, paving the way for further investigations.
academic

Prompt Engineering and Its Implications on the Energy Consumption of Large Language Models

Basic Information

  • Paper ID: 2501.05899
  • Title: Prompt Engineering and Its Implications on the Energy Consumption of Large Language Models
  • Authors: Riccardo Rubei, Aicha Moussaid, Claudio Di Sipio, Davide Di Ruscio (University of L'Aquila)
  • Classification: cs.SE (Software Engineering)
  • Publication Date: January 10, 2025
  • Paper Link: https://arxiv.org/abs/2501.05899

Abstract

As the environmental impact of AI systems receives increasing attention, the intensive use of Large Language Models (LLMs) in software engineering presents significant challenges in computational resources, data center operations, and carbon emissions. This paper investigates how prompt engineering techniques (PETs) affect carbon emissions of the Llama 3 model in code generation tasks. The study employs the CodeXGLUE benchmark to evaluate energy consumption and accuracy of generated code in an isolated testing environment. Preliminary results demonstrate that using specific tags to distinguish different prompt sections can reduce LLM energy consumption. Although deeper evaluation is needed to confirm the findings, this work indicates that prompt engineering can reduce energy consumption during the LLM inference phase without compromising performance.

Research Background and Motivation

Problem Definition

The core research question addressed is: How can prompt engineering techniques reduce energy consumption of Large Language Models during the inference phase while maintaining performance in code generation tasks?

Importance Analysis

  1. Environmental Impact: LLM training and inference processes consume substantial computational resources, producing significant carbon footprints. For instance, certain models' carbon emissions are equivalent to the lifetime emissions of five automobiles.
  2. Resource Challenges: LLMs require high-performance computing clusters, with training processes potentially lasting weeks or months.
  3. Measurement Difficulties: Measuring energy consumption in HPC environments is particularly challenging due to parallel tasks and non-exclusive cluster usage.
  4. Lack of Standards: Even well-maintained LLM leaderboards fail to report energy consumption, focusing solely on accuracy metrics.

Limitations of Existing Approaches

  1. Existing research primarily focuses on hardware-level impact measurement, lacking systematic studies on the energy-saving effects of prompt engineering techniques.
  2. Absence of standardized carbon emission measurement guidelines and information.
  3. Difficulty in evaluating LLM consumption due to higher variability in code generation.

Research Motivation

Based on the development needs of Green Software Engineering (GSE), this paper focuses on leveraging prompt engineering techniques to mitigate LLM energy consumption during the inference phase, providing new solutions for sustainable AI system development.

Core Contributions

  1. First Systematic Investigation: Examines how multiple prompt engineering techniques and custom tags affect LLM energy consumption during code completion tasks.
  2. Trade-off Analysis: Investigates the trade-offs between carbon emissions, execution time, and generated code accuracy, exploring the balance between energy efficiency and model accuracy.
  3. Experimental Findings: Demonstrates that custom tags can significantly reduce energy consumption (99% reduction in one-shot, 83% in few-shots).
  4. Open-Source Contribution: Provides a complete reproducibility package to facilitate further research in this field.

Methodology Details

Task Definition

Task: Code Completion

  • Input: Incomplete Java code snippets
  • Output: A single line of code completing the snippet
  • Constraint: Minimize energy consumption while maintaining accuracy

Experimental Architecture

The study designed a comprehensive experimental workflow:

  1. Data Source: CodeXGLUE dataset
  2. Prompt Creator: Converts input into Llama 3-compatible format
  3. Prompt Enhancer: Augments prompts using custom tags
  4. Locally Deployed Llama 3: Executes code completion tasks
  5. Energy Monitoring: Uses CodeCarbon tool to monitor each execution
  6. Result Storage: Saves problems, answers, and measurement results

Prompt Configuration Design

The study defines five distinct prompt configurations:

C0 - Default Configuration:

  • Defines model role, provides incomplete code snippet, no customization
  • Zero-shot without examples, one-shot with one example, few-shots with five examples

C1 - Custom Tags Without Explanation:

{
  "role": "user",
  "content": "<code>package com.lmax.disruptor.support;</code><incomplete>public final</incomplete>"
}

C2 - Custom Tags With Explanation: Embeds custom tag meaning explanations within the prompt

C3 - Custom Prompt in System Role: Places tag explanations in the system role section

C4 - Without System Definition: Completely omits system role definition, including task instructions directly in user prompts

Technical Innovations

  1. Custom Tag System: Introduces <code> and <incomplete> tags to explicitly distinguish input code from parts requiring completion.
  2. Multi-dimensional Evaluation: Simultaneously considers energy consumption, execution time, and accuracy metrics.
  3. Quantization Technique Integration: Uses 16-bit floating-point numbers instead of default 32-bit, reducing computational cost.
  4. Isolated Testing Environment: Ensures measurement accuracy and reproducibility.

Experimental Setup

Dataset

  • Dataset: CodeXGLUE code completion task
  • Scale: 1,000 randomly selected incomplete Java code snippets
  • Selection Rationale: Specifically designed for LLM code-related tasks, supporting direct comparison with ground truth.

Evaluation Metrics

Energy Efficiency Metrics:

  • Energy Consumption: GPU energy consumption (kWh), calculated by CodeCarbon
  • Execution Time: Inference phase duration (seconds), excluding model loading time

Accuracy Metrics:

  • Edit Distance: Uses Levenshtein Distance to calculate similarity with ground truth
  • Exact Match: Edit distance ≤2 considered as exact match (accounting for random characters in LLM output)

Comparison Methods

  • Baseline Methods: Three standard prompt engineering techniques (zero-shot, one-shot, few-shots)
  • Enhanced Methods: Five custom tag configurations

Implementation Details

  • Model: Llama 3 8B-Instruct (quantized version)
  • Hardware: AMD Ryzen 7 5800X CPU + Nvidia RTX 4060 TI (8GB)
  • Operating System: Xubuntu 23.04
  • Repetitions: Each test repeated 5 times with 10-second intervals
  • Total Execution Time: Over 250 hours

Experimental Results

Main Results

RQ1: Impact of Custom Tags on Energy Efficiency

Energy consumption results show significant improvements:

  • Zero-shot: Reduced from 0.0000157 kWh to 0.0000146 kWh in C2 configuration (-7%)
  • One-shot: Reduced from 0.0000347 kWh to 0.0000174 kWh in C2 configuration (-99%)
  • Few-shots: Reduced from 0.0000537 kWh to 0.0000293 kWh in C2 configuration (-83%)

Execution time improvements:

  • One-shot: Reduced from 1.54 seconds to 0.74 seconds (-52%)
  • Few-shots: Reduced from 2.1 seconds to 1.09 seconds (-48%)
  • Zero-shot: Reduced from 0.74 seconds to 0.63 seconds in C1 configuration (-14.8%)

RQ2: Impact of Custom Tags on Accuracy

Exact match improvements:

  • Zero-shot: Increased from 63 to 82 in C1 configuration (+23%)
  • One-shot and Few-shots: Approximately 44% improvement in C3 configuration

Edit distance reduction:

  • Zero-shot: 24% improvement in C2 configuration
  • One-shot: 64% reduction in C2 configuration
  • Few-shots: 70% improvement in C2 configuration

Key Findings

  1. C2 Configuration Optimal: Configuration including tag explanations in prompts performs best in most cases.
  2. C4 Configuration Issues: Complete omission of system role definition results in uncontrolled model responses.
  3. Few-shots Robustness: Few-shots technique shows minimal impact when lacking explicit role definition.
  4. Positive Correlation Between Energy and Accuracy: Custom tags simultaneously improve both energy efficiency and accuracy.

Statistical Significance

Through 5 repetitions and 10-second intervals, the study ensures statistical reliability of results, reducing measurement bias and outlier effects.

LLM Energy Consumption Assessment Research

  1. Time-Shifting Techniques: Jagannadharao et al. studied reducing carbon emissions through training pause and resume.
  2. Model Comparison: Liu and Yin compared carbon emissions across BERT, DistilBERT, and T5 models.
  3. Hardware Impact: Samsi et al. compared energy consumption across different Llama model sizes and GPU configurations.
  4. Code Generation Efficiency: Cursaro et al. studied energy efficiency comparisons between CodeLlama-generated and human-written code.

Prompt Customization Research

  1. Feature Impact: Fagadau et al. analyzed eight prompt features' effects on Copilot code output.
  2. Structure Optimization: Reynolds and McDonell explored prompt engineering with zero-shot strategies.
  3. Metamorphic Testing: Li et al. used metamorphic testing to study prompt modifications.
  4. Soft Prompts: Wang et al. proposed prompt tuning techniques using virtual tokens.

Conclusions and Discussion

Main Conclusions

  1. Energy Efficiency Improvement: Custom tags can significantly reduce LLM energy consumption in code completion tasks.
  2. Performance Maintenance: Model accuracy improves while energy consumption decreases.
  3. Configuration Dependency: LLM energy consumption is highly dependent on employed prompt engineering techniques.
  4. Dual Optimization: Prompt engineering can simultaneously optimize both energy efficiency and performance.

Limitations

  1. Dataset Constraints: Only tested 1,000 code snippets, limited by time costs (approximately 900 seconds per snippet).
  2. Single Task: Focuses solely on code completion; other tasks may require different energy resources.
  3. Single Model: Only tested Llama 3; generalizability of results requires verification.
  4. Hardware Dependency: Experiments conducted on specific hardware configuration; different environments may yield different results.

Future Directions

  1. Extended Research: Expand studies to more LLMs and code-related tasks.
  2. Advanced Techniques: Investigate effects of RAG or fine-tuning on carbon emissions.
  3. Multi-task Evaluation: Investigate custom prompt effectiveness across different software engineering tasks.
  4. Standardization: Establish standardized methodologies for LLM energy consumption measurement.

In-Depth Evaluation

Strengths

Methodological Innovation:

  1. First systematic investigation of prompt engineering's impact on LLM energy consumption.
  2. Designed multi-dimensional custom tag configuration schemes.
  3. Established trade-off analysis framework between energy efficiency and accuracy.

Experimental Sufficiency:

  1. Utilized standardized CodeXGLUE benchmark.
  2. Employed isolated testing environment ensuring measurement accuracy.
  3. Multiple experimental repetitions enhancing result reliability.
  4. Provided complete reproducibility package.

Result Convincingness:

  1. Significant energy reduction (up to 99%).
  2. Simultaneous accuracy improvement.
  3. Detailed ablation study analysis.

Weaknesses

Methodological Limitations:

  1. Quantization technique usage may affect result generalizability.
  2. Custom tag design relatively simple, lacking more complex semantic structures.
  3. Only considers GPU energy consumption, ignoring CPU and memory contributions.

Experimental Setup Defects:

  1. Limited sample size (1,000 snippets).
  2. Single programming language (Java).
  3. Fixed few-shots example count (5).
  4. Lack of comparison with other energy-saving techniques.

Analysis Insufficiencies:

  1. Lacks analysis across different code complexity levels.
  2. Insufficient exploration of theoretical foundations for tag mechanisms.
  3. Inadequate analysis of anomalous results (e.g., C4 configuration).

Impact

Academic Contributions:

  1. Pioneered new research direction in LLM green computing.
  2. Established connection between prompt engineering and energy efficiency optimization.
  3. Provided practical methods for sustainable AI development.

Practical Value:

  1. Directly applicable to existing code generation systems.
  2. Low implementation cost, easy deployment.
  3. Significantly reduces energy consumption while maintaining performance.

Reproducibility: Provides detailed experimental setup and open-source reproducibility package supporting result verification and extension.

Applicable Scenarios

  1. Code Generation Services: Online code completion and generation platforms.
  2. Development Environment Integration: Intelligent code assistants in IDEs.
  3. Large-Scale Deployment: Enterprise systems processing high volumes of code generation requests.
  4. Resource-Constrained Environments: Code generation applications on edge computing or mobile devices.
  5. Green Computing Initiatives: AI system development prioritizing environmental impact.

References

This paper cites 42 relevant references covering important works across multiple research domains including green software engineering, LLM energy consumption assessment, and prompt engineering, providing solid theoretical foundation and comparative references for the research.


Overall Assessment: This is a research work with significant practical value, systematically exploring prompt engineering's impact on LLM energy consumption for the first time. Despite certain limitations, the encouraging results provide new insights and methods for sustainable AI development. This work is expected to promote further research on green AI and energy efficiency optimization.