Potential threats to the validity of LLM-based software engineering research

Large language models (LLMs) have gained a lot of traction within the software engineering community. Researchers have shown that LLMs can be used for a wide variety of tasks, ranging from code generation to bug detection and natural-language interaction with codebases.
Although research results obtained using LLMs appear promising, the authors of a 2024 paper argue that software engineering researchers must be cautious when making claims about the effectiveness of their approaches.
The paper focusses on three key threats to validity: the use of closed-source LLMs, the blurry separation between training, validation and test sets, and the reproducibility of published research outcomes over time.
Many researchers use ChatGPT for their research due to its effectiveness, availability, and cost-effectiveness. To put that last point into perspective: deploying a comparable open-source model such as the Falcon 180B model on AWS would cost almost $30,000 per month!
However, using closed-source models also poses significant threats to validity.
The main issue is that closed-source models are not always versioned. They may change during or after research has been conducted, potentially rendering results obsolete. For longitudinal studies, it can be difficult to determine whether claimed improvements are due to researchers’ contributions or to changes in the LLM.
Another major concern is privacy: it’s often unclear what the privacy implications are of using a closed-source model, and whether there may be potential copyright infringements.
LLMs are pre-trained on vast amounts of textual data, which may include anything from Wikipedia articles to source code on GitHub and academic datasets. This helps LLMs learn the meaning of words and general knowledge about the world.
Code encountered during the pre-training phase may “leak” into the usage phase.
For example, several studies have highlighted vulnerabilities in code generated by Codex that originate from its training data. In another study, researchers used three LLMs to generate unit tests for code from two datasets – one hosted on GitHub, the other on SourceForge – and found that the LLMs performed well for the GitHub dataset but poorly on the SourceForge dataset.
For supervised learning, data should be separated into training, validation, and test sets. This is hard to do, especially for source code as code from different projects may have common dependencies and use the same predefined APIs. Data leakage may happen if an LLM that is trained on a project that uses a specific API, and is then used to fix a usage of that same API in another project within the test set.
A promising method to evaluate the performance of LLMs is metamorphic testing, which involves making small changes to test code that do not alter its meaning or behaviour. it has seen during pre-training.
Reproducibility of results is important in science, but results obtained using LLMs are often not reproducible, even with identical inputs. Without the ability to set a fixed random seed, executing the same prompt multiple times will usually produce different responses.
When closed-source LLMs are used, there is also no guarantee that results remain stable over time. This is especially problematic given that regression testing is rarely performed to account for output variability, even though regression is a very real problem: one study by Chen et al. showed that accuracy can drop dramatically between different versions of GPT.
Another problem with the widespread adoption of LLMs is the lack of traceability. Studies typically describe prompts and resulting outputs, but information about the model version used, the date when prompts were run, and other execution details, is often missing.
The use of LLMs for software engineering research poses three threats to validity:
-
Closed-source models may change at any time, and it’s not clear where their output originates
-
Data leakage may occur between the data used to pre-train LLMs and the datasets used to evaluate an LLM’s performance
-
Findings from LLM-based research are often not reproducible

