Efficient and green LLMs for software engineering

Published: 2 Nov 2025
Written by: Chun Fei Lung

Training and using large language models to develop software is bad for the planet – but it doesn’t have to be that way.

Being green is hot / Being green is cool / Eat your salad, save the planet / Being green is sexy as fuck

Large language models (LLMs) can help software engineers with common tasks such as writing and summarising code, and finding and repairing bugs. However, LLMs are computationally intensive and energy-demanding, so training and running them usually requires deep pockets. Unless we find ways to drastically reduce their computational costs and energy use, this is unlikely to improve.

About the article

Title	Efficient and green large language models for software engineering: Literature review, vision, and the road ahead
Year	2025
Author(s)	Jieke Shi (Singapore Management University) Zhou Yang (Singapore Management University) David Lo (Singapore Management University)
Venue	ACM Transactions on Software Engineering and Methodology

Techniques to make large language models for software engineering more green and efficient can be categorised from four perspectives: data-centric, model-centric, system-centric, and program-centric.

Data-centric techniques reduce or optimise the data required to train LLMs:

Parameter-efficient fine-tuning (PEFT), in which only a small subset of models parameters is updated while the rest remain frozen, can be highly effective. One study reported competitive performance on code-clone detection using just 1,000 labelled examples.
Another successful approach involves curriculum learning, which presents examples in a structured order from simple to complex. Models trained this way have been shown to outperform state-of-the-art LLMs while using only 10% of the full dataset.
Some researchers have achieved similar results by refining and filtering training using a smaller LLM, retaining no more than 30% of the original data and achieving up to 20 times lower computational cost.

Model-centric techniques optimise the LLMs themselves. There are three main approaches here:

The aforementioned PEFT is one such approach.
Model compression aims to to reduce the size of models an thus their inference latency, memory usage, and energy consumption. This is often done using knowledge distillation. Distilled models have been produced at sizes as small as 3 MB. Other effective methods include quantisation and low-rank decomposition.
Architectural improvements can also help: for example, pairing a slow but accurate LLM with a fast but less accurate one so the slower model only needs to refine outputs, or using hash-based techniques to accelerate lookups.

System-centric techniques optimise parts of the system or pipeline, such as the inference process or decoding strategy:

Dynamic inference can reduce the time spent on inference by rejecting invalid prompts early or by using fewer model layers when possible. Grishina et al. showed that, for vulnerability detection, only 3 of CodeBERT’s 12 layers were sufficient using their EarlyBIRD approach.
Other methods, such as CodeFast, improve inference speeds by using a lightweight GenGuard model to predict whether to halt inference at each step; CodeFast reports up to a 452% speed-up.

Program-centric techniques optimise the input programs that are fed into LLMs:

DietCode manages to reduce CodeBERT’s inference cost by 40% simply by removing tokens that are unlikely to be needed to produce valid responses.
Token counts can also be reduced by representing inputs as abstract syntax trees or program dependence graphs. SlimCode reports being up to 133 times faster than DietCode using such representations.
SynCode integrates a context-free grammar into the decoding process of LLMs so theygenerate syntactically valid code more efficiently, accelerating the decoding process by up to 19%.

The main audience for this paper is the research community working on large language models for software engineering. Many of its proposals for the future – more efficient training, improved inference acceleration, and program optimisation – will be of limited immediate use to practising software engineers.

There is, however, one additional technique worth calling out for readers who may not be familiar with it: retrieval-augmented generation (RAG). RAG retrieves texts that semantically match a query from an external knowledge base and passes them to an LLM, which then generates an appropriate answer. This allows LLMs to generate factually accurate responses without the need for extensive retraining.

RAG is of course more of a duct tape solution (side note: Or… rag). It adds latency and operational cost compared with simple prompting, but can be a practical way to teach an LLM new facts when you lack the resources or expertise to retrain models.