How batch size affects LLMs’ classification of requirements

Large language models (LLMs) are being crammed into just about everything these days. And why wouldn’t you? They’re powerful, flexible, and – when used well – can save you a lot of time. Requirements engineering (RE) is no exception.
This week’s paper looks at the use of LLMs for a seemingly trivial RE task: classifying requirements as either functional or .
Many studies keep things simple by classifying a single requirement per prompt. That works, but it also increases resource consumption and can lead to higher costs and energy usage.
An obvious alternative is to provide multiple requirements in a single prompt. This reduces the number of LLM calls, and the extra context might even help models produce better results. But there’s a tradeoff: excessively large batch sizes , and lead to worse results.
As we’ll see below, there isn’t necessarily a “best” batch size, although a worst one arguably does exist.
The paper reports an experiment in which three popular open-source LLMs – DeepSeek R1 Distill Qwen 14B, Llama 3.1 8B-instruct, and Gemma 3 12B – were evaluated on four pre-existing datasets containing software requirements. Each model was tested using seven different batch sizes (1, 2, 4, 8, 16, 32, and 64). To ensure reproducibility, the researchers used a temperature of 0.01 and a fixed seed.
Classification of requirements consists of two distinct tasks: determining whether a requirement is about a functionality, and determining whether a requirement addresses a quality aspect.
Because prompt design heavily influences performance, the authors followed three well-established prompting patterns:
-
The persona pattern prompts the LLM to adopt a specific point of view so it knows what details to focus on.
-
The template pattern forces the LLM to produce output in a specific, consistent format.
-
The few-shot pattern involves including a number of examples in the prompt so that the model can see what kind of output is expected.
Although requirement classification seems like a narrow, well‑defined task, the results show surprising variability – similar to what happens when you ask different models to generate code.
DeepSeek’s performance on functional classification is relatively stable across datasets and batch sizes, but its performance on quality classification varies depending on batch size. A size of 1 performs the worst, while a size of 32 yields the best results.
Llama’s performance on both tasks gradually decreases as batch size increases, though the decline is less pronounced than DeepSeek’s.
Gemma shows stable performance on functional classification as well, but also has varying performance on quality classification. Accuracy improves steadily up to a batch size of 8, then drops again at larger sizes.
Finally, an ensemble approach that combines the three models’ predictions using majority voting yields consistently stable performance across all datasets and batch sizes. For quality classification, the average performance increases up to size 8, after which it decreases only slightly.
What this experiment shows is that each large language model responds differently to different batch sizes. This means batch size should be chosen based on the specific characteristics of both the dataset and the model. Interestingly, a batch size of 1 leads to the largest standard deviation for all models, suggesting that, despite its popularity among researchers, it’s probably not a great default choice.
The ensemble model’s performance is more balanced in terms of precision, recall, and specificity, but it still varies noticeably between batch sizes. So even when using an ensemble, batch size must be selected carefully.
All models also struggle more with correctly classifying quality requirements than functional ones. A qualitative analysis suggests this happens because . This issue can potentially be mitigated by including more few-shot examples that feature requirements containing both functional and quality aspects.
-
Large language models can be used to classify requirements as functional or quality-related
-
Batching multiple requirements into one prompt can improve performance in terms of accuracy and computational efficiency
-
Batch size must be chosen carefully based on characteristics of the model and the requirements being classified

