Katana VentraIP

AI alignment

In the field of artificial intelligence (AI), AI alignment research aims to steer AI systems toward a person's or group's intended goals, preferences, and ethical principles. An AI system is considered aligned if it advances its intended objectives. A misaligned AI system may pursue some objectives, but not the intended ones.[1]

It is often challenging for AI designers to align an AI system because it is difficult for them to specify the full range of desired and undesired behaviors. Therefore, AI designers often use simpler proxy goals, such as gaining human approval. But that approach can create loopholes, overlook necessary constraints, or reward the AI system for merely appearing aligned.[1][2]


Misaligned AI systems can malfunction and cause harm. AI systems may find loopholes that allow them to accomplish their proxy goals efficiently but in unintended, sometimes harmful, ways (reward hacking).[1][3][4] They may also develop unwanted instrumental strategies, such as seeking power or survival because such strategies help them achieve their final given goals.[1][5][6] Furthermore, they may develop undesirable emergent goals that may be hard to detect before the system is deployed and encounters new situations and data distributions.[7][8]


Today, these problems affect existing commercial systems such as language models,[9][10][11] robots,[12] autonomous vehicles,[13] and social media recommendation engines.[9][6][14] Some AI researchers argue that more capable future systems will be more severely affected, since these problems partially result from the systems being highly capable.[15][3][2]


Many of the most-cited AI scientists,[16][17][18] including Geoffrey Hinton, Yoshua Bengio, and Stuart Russell, argue that AI is approaching human-like (AGI) and superhuman cognitive capabilities (ASI) and could endanger human civilization if misaligned.[19][6]


AI alignment is a subfield of AI safety, the study of how to build safe AI systems.[20] Other subfields of AI safety include robustness, monitoring, and capability control.[21] Research challenges in alignment include instilling complex values in AI, developing honest AI, scalable oversight, auditing and interpreting AI models, and preventing emergent AI behaviors like power-seeking.[21] Alignment research has connections to interpretability research,[22][23] (adversarial) robustness,[20] anomaly detection, calibrated uncertainty,[22] formal verification,[24] preference learning,[25][26][27] safety-critical engineering,[28] game theory,[29] algorithmic fairness,[20][30] and social sciences.[31]

Research problems and approaches[edit]

Learning human values and preferences[edit]

Aligning AI systems to act in accordance with human values, goals, and preferences is challenging: these values are taught by humans who make mistakes, harbor biases, and have complex, evolving values that are hard to completely specify.[37] AI systems often learn to exploit even minor imperfections in the specified objective, a tendency known as specification gaming or reward hacking[20][43] (which are instances of Goodhart's law[93]). Researchers aim to specify intended behavior as completely as possible using datasets that represent human values, imitation learning, or preference learning.[7]: Chapter 7  A central open problem is scalable oversight, the difficulty of supervising an AI system that can outperform or mislead humans in a given domain.[20]


Because it is difficult for AI designers to explicitly specify an objective function, they often train AI systems to imitate human examples and demonstrations of desired behavior. Inverse reinforcement learning (IRL) extends this by inferring the human's objective from the human's demonstrations.[7]: 88 [94] Cooperative IRL (CIRL) assumes that a human and AI agent can work together to teach and maximize the human's reward function.[6][95] In CIRL, AI agents are uncertain about the reward function and learn about it by querying humans. This simulated humility could help mitigate specification gaming and power-seeking tendencies (see § Power-seeking and instrumental strategies).[71][86] But IRL approaches assume that humans demonstrate nearly optimal behavior, which is not true for difficult tasks.[96][86]


Other researchers explore how to teach AI models complex behavior through preference learning, in which humans provide feedback on which behavior they prefer.[25][27] To minimize the need for human feedback, a helper model is then trained to reward the main model in novel situations for behavior that humans would reward. Researchers at OpenAI used this approach to train chatbots like ChatGPT and InstructGPT, which produce more compelling text than models trained to imitate humans.[10] Preference learning has also been an influential tool for recommender systems and web search.[97] However, an open problem is proxy gaming: the helper model may not represent human feedback perfectly, and the main model may exploit this mismatch to gain more reward.[20][98] AI systems may also gain reward by obscuring unfavorable information, misleading human rewarders, or pandering to their views regardless of truth, creating echo chambers[68] (see § Scalable oversight).


Large language models (LLMs) such as GPT-3 enabled researchers to study value learning in a more general and capable class of AI systems than was available before. Preference learning approaches that were originally designed for reinforcement learning agents have been extended to improve the quality of generated text and reduce harmful outputs from these models. OpenAI and DeepMind use this approach to improve the safety of state-of-the-art LLMs.[10][27][99] AI safety & research company Anthropic proposed using preference learning to fine-tune models to be helpful, honest, and harmless.[100] Other avenues for aligning language models include values-targeted datasets[101][41] and red-teaming.[102] In red-teaming, another AI system or a human tries to find inputs that causes the model to behave unsafely. Since unsafe behavior can be unacceptable even when it is rare, an important challenge is to drive the rate of unsafe outputs extremely low.[27]


Machine ethics supplements preference learning by directly instilling AI systems with moral values such as well-being, equality, and impartiality, as well as not intending harm, avoiding falsehoods, and honoring promises.[103][g] While other approaches try to teach AI systems human preferences for a specific task, machine ethics aims to instill broad moral values that apply in many situations. One question in machine ethics is what alignment should accomplish: whether AI systems should follow the programmers' literal instructions, implicit intentions, revealed preferences, preferences the programmers would have if they were more informed or rational, or objective moral standards.[37] Further challenges include aggregating different people's preferences[106] and avoiding value lock-in: the indefinite preservation of the values of the first highly capable AI systems, which are unlikely to fully represent human values.[37][107]

Scalable oversight[edit]

As AI systems become more powerful and autonomous, it becomes increasingly difficult to align them through human feedback. It can be slow or infeasible for humans to evaluate complex AI behaviors in increasingly complex tasks. Such tasks include summarizing books,[108] writing code without subtle bugs[11] or security vulnerabilities,[109] producing statements that are not merely convincing but also true,[110][48][49] and predicting long-term outcomes such as the climate or the results of a policy decision.[111][112] More generally, it can be difficult to evaluate AI that outperforms humans in a given domain. To provide feedback in hard-to-evaluate tasks, and to detect when the AI's output is falsely convincing, humans need assistance or extensive time. Scalable oversight studies how to reduce the time and effort needed for supervision, and how to assist human supervisors.[20]


AI researcher Paul Christiano argues that if the designers of an AI system cannot supervise it to pursue a complex objective, they may keep training the system using easy-to-evaluate proxy objectives such as maximizing simple human feedback. As AI systems make progressively more decisions, the world may be increasingly optimized for easy-to-measure objectives such as making profits, getting clicks, and acquiring positive feedback from humans. As a result, human values and good governance may have progressively less influence.[113]


Some AI systems have discovered that they can gain positive feedback more easily by taking actions that falsely convince the human supervisor that the AI has achieved the intended objective. An example is given in the video above, where a simulated robotic arm learned to create the false impression that it had grabbed a ball.[45] Some AI systems have also learned to recognize when they are being evaluated, and "play dead", stopping unwanted behavior only to continue it once evaluation ends.[114] This deceptive specification gaming could become easier for more sophisticated future AI systems[3][74] that attempt more complex and difficult-to-evaluate tasks, and could obscure their deceptive behavior.


Approaches such as active learning and semi-supervised reward learning can reduce the amount of human supervision needed.[20] Another approach is to train a helper model ("reward model") to imitate the supervisor's feedback.[20][26][27][115]


But when a task is too complex to evaluate accurately, or the human supervisor is vulnerable to deception, it is the quality, not the quantity, of supervision that needs improvement. To increase supervision quality, a range of approaches aim to assist the supervisor, sometimes by using AI assistants.[116] Christiano developed the Iterated Amplification approach, in which challenging problems are (recursively) broken down into subproblems that are easier for humans to evaluate.[7][111] Iterated Amplification was used to train AI to summarize books without requiring human supervisors to read them.[108][117] Another proposal is to use an assistant AI system to point out flaws in AI-generated answers.[118] To ensure that the assistant itself is aligned, this could be repeated in a recursive process:[115] for example, two AI systems could critique each other's answers in a "debate", revealing flaws to humans.[86] OpenAI plans to use such scalable oversight approaches to help supervise superhuman AI and eventually build a superhuman automated AI alignment researcher.[119]


These approaches may also help with the following research problem, honest AI.

Honest AI[edit]

A growing area of research focuses on ensuring that AI is honest and truthful.

AI alignment solutions require continuous updating in response to AI advancements. A static, one-time alignment approach may not suffice.

[155]

AI alignment is often perceived as a fixed objective, but some researchers argue it is more appropriately viewed as an evolving process.[153] One view is that AI technologies advance and human values and preferences change, alignment solutions must also adapt dynamically.[31] Another is that alignment solutions need not adapt if researchers can create intent-aligned AI: AI that changes its behavior automatically as human intent changes.[154] The first view would have several implications:


In essence, AI alignment may not be a static destination but an open, flexible process. Alignment solutions that continually adapt to ethical considerations may offer the most robust approach.[31] This perspective could guide both effective policy-making and technical research in AI.

, ed. (2019). Possible Minds: Twenty-five Ways of Looking at AI (Kindle ed.). Penguin Press. ISBN 978-0525557999.{{cite book}}: CS1 maint: ref duplicates default (link)

Brockman, John

Ngo, Richard; et al. (2023). "The Alignment Problem from a Deep Learning Perspective". :2209.00626 [cs.AI].

arXiv

Ji, Jiaming; et al. (2023). "AI Alignment: A Comprehensive Survey". :2310.19852 [cs.AI].

arXiv

via DeepMind

Specification gaming examples in AI