Potential threats to the validity of LLM-based software engineering research

Published: 16 Nov 2025
Written by: Chun Fei Lung

Large language models have become a popular tool for software engineering research, but they can also influence the outcome of experiments.

Is this Sparta? No, this is the 300th blog post. 🥳

Large language models (LLMs) have gained a lot of traction within the software engineering community. Researchers have shown that LLMs can be used for a wide variety of tasks, ranging from code generation to bug detection and natural-language interaction with codebases.

Although research results obtained using LLMs appear promising, the authors of a 2024 paper argue that software engineering researchers must be cautious when making claims about the effectiveness of their approaches.

The paper focusses on three key threats to validity: the use of closed-source LLMs, the blurry separation between training, validation and test sets, and the reproducibility of published research outcomes over time.

About the article

Title	Breaking the silence: The threats of using LLMs in software engineering
Year	2024
Author(s)	June Sallou (Delft University of Technology) Thomas Durieux (Delft University of Technology) Annibale Panichella (Delft University of Technology)
Venue	International Conference on Software Engineering

Closed-source models

Many researchers use ChatGPT for their research due to its effectiveness, availability, and cost-effectiveness. To put that last point into perspective: deploying a comparable open-source model such as the Falcon 180B model on AWS would cost almost $30,000 per month!

However, using closed-source models also poses significant threats to validity.

The main issue is that closed-source models are not always versioned. They may change during or after research has been conducted, potentially rendering results obsolete. For longitudinal studies, it can be difficult to determine whether claimed improvements are due to researchers’ contributions or to changes in the LLM.

Another major concern is privacy: it’s often unclear what the privacy implications are of using a closed-source model, and whether there may be potential copyright infringements.

Blurry separation between datasets

LLMs are pre-trained on vast amounts of textual data, which may include anything from Wikipedia articles to source code on GitHub and academic datasets. This helps LLMs learn the meaning of words and general knowledge about the world.

Code encountered during the pre-training phase may “leak” into the usage phase.

For example, several studies have highlighted vulnerabilities in code generated by Codex that originate from its training data. In another study, researchers used three LLMs to generate unit tests for code from two datasets – one hosted on GitHub, the other on SourceForge – and found that the LLMs performed well for the GitHub dataset but poorly on the SourceForge dataset.

For supervised learning, data should be separated into training, validation, and test sets. This is hard to do, especially for source code as code from different projects may have common dependencies and use the same predefined APIs. Data leakage may happen if an LLM that is trained on a project that uses a specific API, and is then used to fix a usage of that same API in another project within the test set.

A promising method to evaluate the performance of LLMs is metamorphic testing, which involves making small changes to test code that do not alter its meaning or behaviour. This increases the likelihood that an LLM will not recognise code snippets (side note: In the same way that your dog or infant might not recognise you after you shave off your moustache, even though you are still the same person…) it has seen during pre-training.

Reproducibility of outcomes

Reproducibility of results is important in science, but results obtained using LLMs are often not reproducible, even with identical inputs. Without the ability to set a fixed random seed, executing the same prompt multiple times will usually produce different responses.

When closed-source LLMs are used, there is also no guarantee that results remain stable over time. This is especially problematic given that regression testing is rarely performed to account for output variability, even though regression is a very real problem: one study by Chen et al. showed that accuracy can drop dramatically between different versions of GPT.

Another problem with the widespread adoption of LLMs is the lack of traceability. Studies typically describe prompts and resulting outputs, but information about the model version used, the date when prompts were run, and other execution details, is often missing.

Summary

The use of LLMs for software engineering research poses three threats to validity:

Closed-source models may change at any time, and it’s not clear where their output originates
Data leakage may occur between the data used to pre-train LLMs and the datasets used to evaluate an LLM’s performance
Findings from LLM-based research are often not reproducible