How laypeople try (and fail) to design LLM prompts
Modern pre-trained large language models (LLMs) like ChatGPT can engage in fluent conversations out-of-the-box. Their outputs can be tweaked by prepending prompts – textual instructions and examples of desired interactions – to LLM inputs.
While prompting LLMs may seem easy – after all, prompts are simply written in plain English – designing effective prompting strategies requires a lot of work: you’ll need to understand in what situations an LLM might make mistakes, devise strategies to overcome potential mistakes, and systematically evaluate the effectiveness of those strategies. These tasks are typically done by , and are challenging even for LLM experts.
How do non-experts fare in comparison?
Even for experts, prompt engineering requires a lot of trial and error. Having said that, ongoing NLP research does offer some hints towards effective prompt design strategies:
-
Give examples of desired interactions in prompts: Examples substantially improve the performance of on a wide variety of tasks.
-
Write prompts that look (somewhat) like code: For example, clearly demarcating prompt text and input data using a templating language like Jinja results in prompts that are more robust.
-
Repeat yourself: Repetition in prompts works really well for some reason.
Non-experts are unlikely to be familiar with these strategies. Previous research on the use of machine learning programming tools by end-users (non-ML experts) suggests that laypeople may face barriers and challenges related to:
-
Design:
I don’t even know what I want the computer to do
-
Selection:
I know what I want the computer to do, but I don’t know what to use
-
Coordination:
I know what things to use, but I don’t know how to make them work together
-
Use:
I know what to use, but I don’t know how to use it
-
Understanding:
I thought I knew how to use this, but it didn’t do what I expected
-
Information:
I know why it didn’t do what I expected, but I don’t know how to check
These previously identified challenges may help us better understand why non-experts struggle with prompt engineering.
This authors of this paper developed BotDesigner, a no-code prompt design tool that assists its users in creating their own chatbots. Ten non-expert prompt designers were asked to use this tool to fine-tune a simple chatbot while thinking aloud.
The baseline version of this bot can list out ingredients and steps. The prompt designers were asked to prompt the chatbot such that it acts like a professional chef who walks an amateur (the user) through the various steps of cooking a recipe, while engaging in humour and social conversation, making analogies, simplifying complicated concepts, and asking for confirmation that specific steps have been completed.
All participants managed to iterate on chatbot prompts, but did so in an ad hoc opportunistic way. Participants’ struggles with generating good prompts were largely caused by over-generalisation from single observations of success or failure, and erroneously treating interactions with the chatbot as interaction with a real human being.
Participants start conversations as the baseline chatbot’s “user”, continuing until the bot does something wrong, e.g. providing a terse, humourless response, overly complicated instructions, or moving on too quickly.
When this happens, most participants immediately stop the conversation. Only a few participants proceed past this error or try to “repair” the chat.
Participants then alter the chatbot’s “instructions” to explicitly request some other behaviour (“Make some jokes”). When stuck, most participants asked the interviewer for advice. A few searched the internet for possible solutions.
When attempting to fix an “error” in the bot’s responses, participants typically iterate until they observe a successful outcome. Most participants declare success after just a single correct instance, and give up when they do not see the desired behaviour on a first or second attempt.
The specific behaviours and effects that were observed can be organised by subtask.
The first subtask is related to generating prompts:
-
Confusions getting started: About half of participants began unsure of what kinds of behaviours they could expect from the bot and how to make modifications, which echoes the design and selection barriers to end-user programming machine learning tools. Invalid assumptions about how such chatbots work also create learning barriers.
-
Choosing the right instructions: Almost all participants struggled with finding the right instructions. This is often the result of over-generalising from a single failure, which causes them to abandon prompts that would have led to success if phrased a little bit differently.
-
Expecting human capabilities: A number of participants mixed behavioural commands directed at the bot with recipe commands directed at the bot’s users, with the expectation that the bot would know which is which (it doesn’t).
Similarly, two participants seemed to think that the chatbot would remember prior conversations with users. Non-programmers in particular found the idea that a chatbot “resets” between conversations to be unintuitive.
-
Socially appropriate ways of prompting: – even when they were visibly frustrated – seemingly mimicking human-human interactions.
Other signs are the rare use of repetition and participants’ bias towards direct instruction over examples, despite evidence of their effectiveness. In the case of examples, participants expressed concern that they’re not generalisable – now inferring too little rather than too much capability on the part of the LLM.
-
Seeking help: Two participants who looked for solutions on the internet had a hard time making use of the examples they found, because they could not see how specific prompt designs translated across domains.
Despite struggles with generating prompts, all participants were ultimately successful in prompting the chatbot that led to some, if not all, of the desired effects. Struggles with evaluating prompts (especially their robustness) were much more severe:
-
Collecting data: Participants were exclusively opportunistic in their approaches to prompt design, and tried to fix errors without fully understanding problems. This behaviour echoes previous findings that non-experts debug programs opportunistically rather than systematically.
-
Systematic testing: None of the participants made use of the systematic prompt testing interface in the BotDesigner tool. This behaviour echoes behaviour seen in end-user programmers who are overconfident in testing and verification, rather than cautious. However, some of the participants with programming experience did express concern about their process and the generalisability of their prompts.
Given the highly probabilistic nature of LLM output, participants had a hard time explaining prompt’s effects:
-
Generalising across contexts: Participants often found themselves in situations where a prompt design that worked in one context did not work in another, similar context. This is often due to incorrect assumptions about how the bot should work, based on experiences giving instructions to other humans.
-
Incapability as explanation: Participants were quick to make assumptions about what the chatbot could and could not do. Attempts to rephrase prompts were rarely made. The participants expected that semantically equivalent instructions would have semantically equivalent results, but in reality .
-
But why…?: Every participant at one point asked their interview “why did it do that?” In nearly every case, the answer is sadly “I don’t know”: LLMs are black boxes, so one can only speculate about why they show certain behaviour.
-
Non-experts design and evaluate LLM chatbot prompts in an opportunistic way
-
Non-experts make incorrect assumptions about capabilities of LLMs due to experiences with human-human interactions