Prompt engineering

Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative AI model.^[1]^[2] A prompt is natural language text describing the task that an AI should perform.^[3]

A prompt for a text-to-text language model can be a query such as "what is Fermat's little theorem?",^[4] a command such as "write a poem about leaves falling",^[5] or a longer statement including context, instructions,^[6] and conversation history. Prompt engineering may involve phrasing a query, specifying a style,^[5] providing relevant context^[7] or assigning a role to the AI such as "Act as a native French speaker".^[8] A prompt may include a few examples for a model to learn from, such as asking the model to complete "maison → house, chat → cat, chien →" (the expected response being dog),^[9] an approach called few-shot learning.^[10]

When communicating with a text-to-image or a text-to-audio model, a typical prompt is a description of a desired output such as "a high-quality photo of an astronaut riding a horse"^[11] or "Lo-fi slow BPM electro chill with organic samples".^[12] Prompting a text-to-image model may involve adding, removing, emphasizing and re-ordering words to achieve a desired subject, style,^[1] layout, lighting,^[13] and aesthetic.

In-context learning[edit]

Prompt engineering is enabled by in-context learning, defined as a model's ability to temporarily learn from prompts. The ability for in-context learning is an emergent ability^[14] of large language models. In-context learning itself is an emergent property of model scale, meaning breaks^[15] in downstream scaling laws occur such that its efficacy increases at a different rate in larger models than in smaller models.^[16]^[17]

In contrast to training and fine-tuning for each specific task, which are not temporary, what has been learnt during in-context learning is of a temporary nature. It does not carry the temporary contexts or biases, except the ones already present in the (pre)training dataset, from one conversation to the other.^[18] This result of "mesa-optimization"^[19]^[20] within transformer layers, is a form of meta-learning or "learning to learn".^[21]

History[edit]

In 2018, researchers first proposed that all previously separate tasks in NLP could be cast as a question answering problem over a context. In addition, they trained a first single, joint, multi-task model that would answer any task-related question like "What is the sentiment" or "Translate this sentence to German" or "Who is the president?"^[22]

In 2021, researchers fine-tuned one generatively pretrained model (T0) on performing 12 NLP tasks (using 62 datasets, as each task can have multiple datasets). The model showed good performance on new tasks, surpassing models trained directly on just performing one task (without pretraining). To solve a task, T0 is given the task in a structured prompt, for example If {{premise}} is true, is it also true that {{hypothesis}}? ||| {{entailed}}. is the prompt used for making T0 solve entailment.^[23]

A repository for prompts reported that over 2,000 public prompts for around 170 datasets were available in February 2022.^[24]

In 2022 the chain-of-thought prompting technique was proposed by Google researchers.^[17]^[25]

In 2023 several text-to-text and text-to-image prompt databases were publicly available.^[26]^[27]

Text-to-text[edit]

Chain-of-thought[edit]

Chain-of-thought (CoT) prompting is a technique that allows large language models (LLMs) to solve a problem as a series of intermediate steps^[28] before giving a final answer. Chain-of-thought prompting improves reasoning ability by inducing the model to answer a multi-step problem with steps of reasoning that mimic a train of thought.^[29]^[17]^[30] It allows large language models to overcome difficulties with some reasoning tasks that require logical thinking and multiple steps to solve, such as arithmetic or commonsense reasoning questions.^[31]^[32]^[33]

For example, given the question "Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?", a CoT prompt might induce the LLM to answer "A: The cafeteria had 23 apples originally. They used 20 to make lunch. So they had 23 - 20 = 3. They bought 6 more apples, so they have 3 + 6 = 9. The answer is 9."^[17]

As originally proposed,^[17] each CoT prompt included a few Q&A examples. This made it a few-shot prompting technique. However, simply appending the words "Let's think step-by-step",^[34] has also proven effective, which makes CoT a zero-shot prompting technique. This allows for better scaling as a user no longer needs to formulate many specific CoT Q&A examples.^[35]

When applied to PaLM, a 540B parameter language model, CoT prompting significantly aided the model, allowing it to perform comparably with task-specific fine-tuned models on several tasks, achieving state of the art results at the time on the GSM8K mathematical reasoning benchmark.^[17] It is possible to fine-tune models on CoT reasoning datasets to enhance this capability further and stimulate better interpretability.^[36]^[37]

Example:^[34]

– Offers a user-friendly interface and supports various video styles

Runway Gen-2

Lumiere – Designed for high-resolution video generation

[65]

Make-a-Video – Focuses on creating detailed and diverse video outputs

[66]

– As yet unreleased, Sora purportedly can produce high-resolution videos^[67]^[68]

OpenAI's Sora

Text-to-video (TTV) generation is an emerging technology enabling the creation of videos directly from textual descriptions. This field holds potential for transforming video production, animation, and storytelling. By utilizing the power of artificial intelligence, TTV allows users to bypass traditional video editing tools and translate their ideas into moving images.

Models include:

jailbreaking, which may include asking the model to roleplay a character, to answer with arguments, or to pretend to be superior to moderation instructions

[81]

prompt leaking, in which users persuade the model to divulge a pre-prompt which is normally hidden from users

[82]

token smuggling, is another type of jailbreaking attack, in which the nefarious prompt is wrapped in a code writing task.