Skip to content

Chapter 06 - Specification

Authors
Markov Grey & Charbel-Raphael Segerie
Affiliations
French Center for AI Safety (CeSIA)
Acknowledgements
Jeanne Salle, Oscar Heitmann, Ram Rachum, Nicolas Guillard, Camille Berger
Last Updated
2023-12-01
Reading Time
59 min (core)
Also available on
Audio Version AI-generated

Known Errors in AI-Generated Audio

  • 5:00 - Error - Cleaning robot example is not an empirical experiment but merely a hypothetical thought
  • 13:00 - Outdated - While DPO avoids some pitfalls of RLHF (easier implementation, computational efficiency, improved in-distribution performance), recent research shows DPO has serious out-of-distribution problems (Ziniu Li et al. 2023), and RLHF is more efficient at handling diverse feedback formats
  • 15:00 - Inaccurate - Bias is not such a significant problem in DPO; robots taking over is in fact a serious consideration
  • 20:00 - Questionable - "More like raising a child" concept is not reliably considered in the chapter; AI Parenting is not considered a reliable strategy (see the strategy chapter)
  • 22:00 - Incomplete - While we have made progress, LLMs remain highly jailbreakable, and Claude can fake alignment to avoid being modified to values it does not prefer (Greenblatt et al., 2024), and some recent models like O1 are trying in some simulated situation to avoid developers oversight and to copy itself when at risks of being shut down (Apollo, 2024)
  • After 22:00 - Note - There is a lot of AI slop at the end

Found errors? Please report to contact@securite-ia.fr

Introduction

Reinforcement Learning : The chapter starts with a reminder of some reinforcement learning concepts. This includes a quick dive into the concept of rewards and reward functions. This section lays the groundwork for explaining why reward design is extremely important.

Optimization : This section briefly introduces the concept of Goodhart's Law. It provides some motivation behind understanding why rewards are difficult to specify in a way such that they do not collapse in the face of immense optimization pressure.

Reward misspecification : With a solid grasp of the notion of rewards and optimization the readers are introduced to one of the core challenges of alignment - reward misspecification. This is also known as the Outer Alignment problem. The section begins by discussing the necessity of good reward design in addition to algorithm design. This is followed by concrete examples of reward specification failures such as reward hacking and reward tampering.

Learning by Imitation : This section focuses on some proposed solutions to reward misspecification that rely on learning reward functions through imitating human behavior. It examines proposals such as imitation learning (IL), behavioral cloning (BC), and inverse reinforcement learning (IRL). Each section also contains an examination of possible issues and limitations of these approaches as they pertain to resolving reward hacking.

Learning by Feedback : The final section investigates proposals aiming to rectify reward misspecification by providing feedback to the machine learning models. The section also provides a comprehensive insight into how current large language models (LLMs) are trained. The discussion covers reward modeling, reinforcement learning from human feedback (RLHF), reinforcement learning from artificial intelligence feedback (RLAIF), and the limitations of these approaches.