How laypeople try (and fail) to design LLM prompts

Published: 29 Oct 2023
Written by: Chun Fei Lung

Coming up with good prompts for LLMs like ChatGPT can be quite a challenge. This study sheds some light on what makes it so hard.

“Hello computer?”

Modern pre-trained large language models (LLMs) like ChatGPT can engage in fluent conversations out-of-the-box. Their outputs can be tweaked by prepending prompts – textual instructions and examples of desired interactions – to LLM inputs.

While prompting LLMs may seem easy – after all, prompts are simply written in plain English – designing effective prompting strategies requires a lot of work: you’ll need to understand in what situations an LLM might make mistakes, devise strategies to overcome potential mistakes, and systematically evaluate the effectiveness of those strategies. These tasks are typically done by prompt engineers (side note: Often designers or domain experts), and are challenging even for LLM experts.

How do non-experts fare in comparison?

About the article

Title	Why Johnny can’t prompt: How non-AI experts try (and fail) to design LLM prompts
Year	2023
Author(s)	J.D. Zamfirescu-Pereira (UC Berkeley) Richmond Wong (Georgia Institute of Technologyy) Björn Hartmann (UC Berkeley) Qian Yang (Cornell University)
Venue	CHI Conference on Human Factors in Computing Systems

What we already knew

Even for NLP (side note: Natural language processing) experts, prompt engineering requires a lot of trial and error. Having said that, ongoing NLP research does offer some hints towards effective prompt design strategies:

Give examples of desired interactions in prompts: Examples substantially improve the performance of GPT-3 (side note: The original version of ChatGPT was powered by GPT-3.) on a wide variety of tasks.
Write prompts that look (somewhat) like code: For example, clearly demarcating prompt text and input data using a templating language like Jinja results in prompts that are more robust.
Repeat yourself: Repetition in prompts works really well for some reason.

Non-experts are unlikely to be familiar with these strategies. Previous research on the use of machine learning programming tools by end-users (non-ML experts) suggests that laypeople may face barriers and challenges related to:

Design: I don’t even know what I want the computer to do
Selection: I know what I want the computer to do, but I don’t know what to use
Coordination: I know what things to use, but I don’t know how to make them work together
Use: I know what to use, but I don’t know how to use it
Understanding: I thought I knew how to use this, but it didn’t do what I expected
Information: I know why it didn’t do what I expected, but I don’t know how to check

These previously identified challenges may help us better understand why non-experts struggle with prompt engineering.

Study design

This authors of this paper developed BotDesigner, a no-code prompt design tool that assists its users in creating their own chatbots. Ten non-expert prompt designers were asked to use this tool to fine-tune a simple chatbot while thinking aloud.

The baseline version of this bot can list out ingredients and steps. The prompt designers were asked to prompt the chatbot such that it acts like a professional chef who walks an amateur (the user) through the various steps of cooking a recipe, while engaging in humour and social conversation, making analogies, simplifying complicated concepts, and asking for confirmation that specific steps have been completed.

What we learned from this study

All participants managed to iterate on chatbot prompts, but did so in an ad hoc opportunistic way. Participants’ struggles with generating good prompts were largely caused by over-generalisation from single observations of success or failure, and erroneously treating interactions with the chatbot as interaction with a real human being.

What non-experts do

Participants start conversations as the baseline chatbot’s “user”, continuing until the bot does something wrong, e.g. providing a terse, humourless response, overly complicated instructions, or moving on too quickly.

When this happens, most participants immediately stop the conversation. Only a few participants proceed past this error or try to “repair” the chat.

Participants then alter the chatbot’s “instructions” to explicitly request some other behaviour (“Make some jokes”). When stuck, most participants asked the interviewer for advice. A few searched the internet for possible solutions.

When attempting to fix an “error” in the bot’s responses, participants typically iterate until they observe a successful outcome. Most participants declare success after just a single correct instance, and give up when they do not see the desired behaviour on a first or second attempt.

Challenges in prompt design

The specific behaviours and effects that were observed can be organised by subtask.

The first subtask is related to generating prompts:

Confusions getting started: About half of participants began unsure of what kinds of behaviours they could expect from the bot and how to make modifications, which echoes the design and selection barriers to end-user programming machine learning tools. Invalid assumptions about how such chatbots work also create learning barriers.
Choosing the right instructions: Almost all participants struggled with finding the right instructions. This is often the result of over-generalising from a single failure, which causes them to abandon prompts that would have led to success if phrased a little bit differently.
Expecting human capabilities: A number of participants mixed behavioural commands directed at the bot with recipe commands directed at the bot’s users, with the expectation that the bot would know which is which (it doesn’t).

Similarly, two participants seemed to think that the chatbot would remember prior conversations with users. Non-programmers in particular found the idea that a chatbot “resets” between conversations to be unintuitive.
Socially appropriate ways of prompting: Participants wrote prompts that were surprisingly polite (side note: The fact that they were observed by researchers may or may not have something to do with this…) – even when they were visibly frustrated – seemingly mimicking human-human interactions.

Other signs are the rare use of repetition and participants’ bias towards direct instruction over examples, despite evidence of their effectiveness. In the case of examples, participants expressed concern that they’re not generalisable – now inferring too little rather than too much capability on the part of the LLM.
Seeking help: Two participants who looked for solutions on the internet had a hard time making use of the examples they found, because they could not see how specific prompt designs translated across domains.

Despite struggles with generating prompts, all participants were ultimately successful in prompting the chatbot that led to some, if not all, of the desired effects. Struggles with evaluating prompts (especially their robustness) were much more severe:

Collecting data: Participants were exclusively opportunistic in their approaches to prompt design, and tried to fix errors without fully understanding problems. This behaviour echoes previous findings that non-experts debug programs opportunistically rather than systematically.
Systematic testing: None of the participants made use of the systematic prompt testing interface in the BotDesigner tool. This behaviour echoes behaviour seen in end-user programmers who are overconfident in testing and verification, rather than cautious. However, some of the participants with programming experience did express concern about their process and the generalisability of their prompts.

Given the highly probabilistic nature of LLM output, participants had a hard time explaining prompt’s effects:

Generalising across contexts: Participants often found themselves in situations where a prompt design that worked in one context did not work in another, similar context. This is often due to incorrect assumptions about how the bot should work, based on experiences giving instructions to other humans.
Incapability as explanation: Participants were quick to make assumptions about what the chatbot could and could not do. Attempts to rephrase prompts were rarely made. The participants expected that semantically equivalent instructions would have semantically equivalent results, but in reality seemingly trivial changes to a prompt can lead to dramatically different results (side note: The paper mentions that positive prompts (“Do this”) work much better than negative prompts (“Do not do this”).).
But why…?: Every participant at one point asked their interview “why did it do that?” In nearly every case, the answer is sadly “I don’t know”: LLMs are black boxes, so one can only speculate about why they show certain behaviour.

Summary

Non-experts design and evaluate LLM chatbot prompts in an opportunistic way
Non-experts make incorrect assumptions about capabilities of LLMs due to experiences with human-human interactions