Chuniversiteit logomarkChuniversiteit.nl
The Toilet Paper

The current(-ish) state of LLM-based multi-agent systems for software engineering

At this point, we may just be reinventing real-world software engineering practices using LLMs.

Two agents high-fiving each other
It’s a high vibe

When ChatGPT first came out in 2022, it blew everyone’s minds. For the first time, we had a somewhat believable “AI” that could answer questions about pretty much anything – from whether it’s okay in DDD to use a repository class to save entities, to when and how long you should hold eye contact when passing a colleague in an uncomfortably long hallway. It’s only been a few years since LLMs became a “normal” part of our lives, but it honestly feels like an eternity.

By contrast, this week’s paper on LLM-based multi-agent systems was published just over half a year ago and still feels fresh even though quite a lot has already changed in that relatively short period.

Nevertheless, we’ll look at what LLM-based multi-agent systems actually are, what the paper considers to be the state of the art and how well such systems work in practice. Many things will sound familiar to people who orchestrate LLMs in their daily work, but the paper also includes a few concepts that you might not have heard of yet, or that are likely a better version of what you are currently doing.

LMAs in a nutshell

Link

Singular LLM-based agents typically try to act as a jack of all trades, are prompted to take on a specific role, or . However, real-world problems often span multiple domains, requiring expertise from various fields. This limits the quality of the work or responses.

LLM-based multi-agent (LMA) systems get around this problem by having multiple specialised agents, each with their own unique skills and abilities, work together towards a common goal, using collaborative activities like debate and discussion. This mechanism has proven to encourage divergent thinking, enhance factuality and reasoning, and ensure thorough validation.

Integrating LMA systems will likely speed up software development and transform the way we work. Some of the expected benefits include:

  • Automation of certain software engineering tasks; high-level requirements can be broken down into sub-tasks and detailed implementation, mirroring how in agile methodologies tasks are broken down and assigned to teams or individuals. This frees developers to focus on strategic planning, design thinking, and innovation.

  • LMA systems make it easier to detect and correct faults early in the development process. On their own, LLMs are prone to hallucination. However, by debating, examining, and validating responses from multiple agents, LMA systems ensure convergence on a single, more accurate, and robust solution.

  • Software systems increasingly grow in complexity. LMA systems offer an effective scaling solution by incorporating additional agents for new technologies and reallocating tasks among agents based on evolving project needs.

An LMA system consists of two primary components: an orchestration platform and LLM-based agents.

The orchestration platform manages interactions and interaction flow among agents. It defines several key characteristics, including how agents interact (cooperatively, competitively, hierarchical, or mixed), how interaction flows (centralised, decentralised, or hierarchical), and how planning and learning happen (centralised or decentralised).

Each LLM-based agent may have unique abilities and specialised roles, which enhances the system’s ability to handle diverse tasks effectively. Agents can be explicitly predefined or dynamically generated by LLMs, and may be homogeneous (do the same things) or heterogeneous (have diverse functions and expertise).

Literature review on LMA systems

Link

The authors conducted a systematic literature review on recent studies of LMA systems in software engineering, which yields a fairly long list of projects that use LMA systems in some specific way. , but I will describe a few areas in which they are used, in the hope that at least one is directly useful to you or provides inspiration for you to develop something of your own.

Requirements engineering

Link

Elicitron uses LLM-based agents to represent a diverse array of simulated users who engage in product interactions, and provide insights into user needs.

MARE uses a combination of agents (stakeholder, collector, modeller, checker and documenter) to help generate high-quality requirements and specifications.

Sami et al. proposed a framework for generating, evaluating, and prioritising user stories through a collaborative process involving four agents: a product owner, developer, QA, and manager.

Code generation

Link

Frameworks for code generation typically rely on role specialisation and iterative feedback loops to optimise collaboration among agents. Such systems usually involve several roles.

An Orchestrator manages high-level planning. A Programmer is responsible for writing the initial version of the code. Once that’s done, a Reviewer and Tester step in to provide constructive feedback on quality, functionality, and adherence to requirements. This feedback creates an iterative cycle, where the Programmer improves the code or a Debugger resolves identified issues. Some frameworks use an Information Retriever to find information that may help with certain tasks, for example to find examples of similar problems, or to interact with databases.

Agent Forest is a different approach, where multiple agents independently generate candidate outputs. Outputs are then evaluated against each other based on similarity, and the output with the highest score – indicating the greatest consensus among agents – is selected as the final solution.

Software QA

Link

Fuzz4All generates testing input for systems across multiple programming languages. AXNav automates accessibility testing. It interprets natural language instructions and executes accessibility tests, such as VoiceOver, on iOS devices. LMA systems are also used for other traditional QA tasks such as penetration testing, user acceptance testing, and GUI testing.

LMA systems can also be used after things go wrong. For example, RCAgent performs root cause analysis in cloud environments by using LLM-based agents to collect system data, analyse logs, and diagnose issues.

Software maintenance

Link

Quite a number of LMA systems have been proposed for debugging, using different agents for bug reproduction, fault localisation, patch generation, and validation. Many appear to be domain-agnostic, although some domain-specific examples exist, for example for access control vulnerabilities in smart contracts.

Code review is also an area where LMA systems are used, with agents specialised in topics such as bug detection, code smells, and optimisation. Multi-agent architectures have also been proposed to predict which test cases need maintenance after source code changes.

Software process models

Link

End-to-end software development is about the entire process of creating software products. Human developers typically adopt established software process models such as agile and waterfall. The design of LMA systems often draws inspiration from these existing models, emulating parts or all of them using either agile or waterfall.

Think-on-process (ToP), on the other hand, uses a dynamic process generation framework, assuming that there is no such thing as a one-size-fits-all process and instead uses LLMs to create tailored process instances suitable for the given requirements.

Finally, some papers describe using experiences from past software projects to enhance new development efforts. Co-learning does this by using insights gathered from historical communications. When done iteratively, agents can continually adapt by learning from experiences from previous tasks.

LMAs in practice

Link

To study the effectiveness of LMA systems, the authors conducted two case studies. They used ChatDev to autonomously develop two classic games using GPT-3.5 Turbo: Snake and Tetris.

The first attempt at recreating Snake was unsuccessful. After resubmitting the same prompt, the second attempt produced a playable version, taking an average of 76 seconds and costing $0.019.

ChatDev had more difficulties with recreating Tetris’ gameplay across the first nine attempts. It was only on the tenth attempt that ChatDev produced a Tetris game that met most of the prompt requirements, but still lacked some core functionality to remove completed rows. Nevertheless, the development process remained efficient, with an average time of 70 seconds and a cost of $0.020 per attempt.

These case studies show that LMA systems perform well at reasonably complex tasks but still come with limitations that prevent them from handling more complex tasks that require deeper logical reasoning and abstraction.

Summary

Link
  1. LLM-based multi-agent (LMA) systems achieve better outcomes by having specialised agents work together towards a common goal