How batch size affects LLMs’ classification of requirements

Published: 25 Jan 2026
Written by: Chun Fei Lung

We all know that larger large language models (LLMs) tend to perform better than smaller ones – but what about the size of their inputs?

Is quality time non-functional?

Large language models (LLMs) are being crammed into just about everything these days. And why wouldn’t you? They’re powerful, flexible, and – when used well – can save you a lot of time. Requirements engineering (RE) is no exception.

This week’s paper looks at the use of LLMs for a seemingly trivial RE task: classifying requirements as either functional or quality requirements (side note: Which are also known as “non-functional” requirements for some stupid reason.).

Many studies keep things simple by classifying a single requirement per prompt. That works, but it also increases resource consumption and can lead to higher costs and energy usage.

An obvious alternative is to provide multiple requirements in a single prompt. This reduces the number of LLM calls, and the extra context might even help models produce better results. But there’s a tradeoff: excessively large batch sizes may overwhelm the model’s memory capacity (side note: Did you know that models perform worse when they need information in the middle of large texts?), and lead to worse results.

As we’ll see below, there isn’t necessarily a “best” batch size, although a worst one arguably does exist.

About the article

Title	One size does not fit all: On the role of batch size in classifying requirements with LLMs
Year	2025
Author(s)	Ashley T. van Can (Utrecht University) Fatma Başak Aydemir (Utrecht University) Fabiano Dalpiaz (Utrecht University)
Venue	International Requirements Engineering Conference Workshops (REW)

Experiment

The paper reports an experiment in which three popular open-source LLMs – DeepSeek R1 Distill Qwen 14B, Llama 3.1 8B-instruct, and Gemma 3 12B – were evaluated on four pre-existing datasets containing software requirements. Each model was tested using seven different batch sizes (1, 2, 4, 8, 16, 32, and 64). To ensure reproducibility, the researchers used a temperature of 0.01 and a fixed seed.

Classification of requirements consists of two distinct tasks: determining whether a requirement is about a functionality, and determining whether a requirement addresses a quality aspect.

Because prompt design heavily influences performance, the authors followed three well-established prompting patterns:

The persona pattern prompts the LLM to adopt a specific point of view so it knows what details to focus on.
The template pattern forces the LLM to produce output in a specific, consistent format.
The few-shot pattern involves including a number of examples in the prompt so that the model can see what kind of output is expected.

Although requirement classification seems like a narrow, well‑defined task, the results show surprising variability – similar to what happens when you ask different models to generate code.

DeepSeek’s performance on functional classification is relatively stable across datasets and batch sizes, but its performance on quality classification varies depending on batch size. A size of 1 performs the worst, while a size of 32 yields the best results.

Llama’s performance on both tasks gradually decreases as batch size increases, though the decline is less pronounced than DeepSeek’s.

Gemma shows stable performance on functional classification as well, but also has varying performance on quality classification. Accuracy improves steadily up to a batch size of 8, then drops again at larger sizes.

Finally, an ensemble approach that combines the three models’ predictions using majority voting yields consistently stable performance across all datasets and batch sizes. For quality classification, the average performance increases up to size 8, after which it decreases only slightly.

What it means

What this experiment shows is that each large language model responds differently to different batch sizes. This means batch size should be chosen based on the specific characteristics of both the dataset and the model. Interestingly, a batch size of 1 leads to the largest standard deviation for all models, suggesting that, despite its popularity among researchers, it’s probably not a great default choice.

The ensemble model’s performance is more balanced in terms of precision, recall, and specificity, but it still varies noticeably between batch sizes. So even when using an ensemble, batch size must be selected carefully.

All models also struggle more with correctly classifying quality requirements than functional ones. A qualitative analysis suggests this happens because models find it difficult to recognise a quality aspect when the requirement also includes a functional aspect (side note: Llama seems to suffer less from this.). This issue can potentially be mitigated by including more few-shot examples that feature requirements containing both functional and quality aspects.

Summary

Large language models can be used to classify requirements as functional or quality-related
Batching multiple requirements into one prompt can improve performance in terms of accuracy and computational efficiency
Batch size must be chosen carefully based on characteristics of the model and the requirements being classified

How batch size affects LLMs’ classification of requirements

Experiment

What it means

Summary

More about requirements engineering

More about large language models