Skip to content

Chapter 07 - Generalization

Authors
Markov Grey
Affiliations
French Center for AI Safety (CeSIA)
Acknowledgements
RA Writer, Charbel-Raphael Segerie, Jeanne Salle, Oscar Heitmann, Camille Berger, Josh Thorsteinson, Nicolas Guillard
Last Updated
2023-12-13
Reading Time
35 min (core)
Also available on
Audio Version AI-generated

Known Errors in AI-Generated Audio

  • Note: - This is an AI-generated audio version.
  • General - Note - Nothing in this podcast is explicitly false, but it is a bit superficial
  • End section - Note - There is a lot of slop at the end

Found errors? Please report to contact@securite-ia.fr

Introduction

Goal Misgeneralization : This section introduces the concept of goals as distinct from rewards. It explains what it might be if a model's capabilities generalize, while the goals do not. The section provides various examples of game playing agents, LLMs and other thought experiments to show how this could be a potentially catastrophic failure mode distinct from reward misspecification.

Inner Alignment : The next section begins with an explanation of the machine learning process, and how it can be seen as analogous to search. Since the machine learning process can be seen analogous to search, one type of algorithm that can be “found" is an optimizer. This motivates a discussion of the distinction between base and mesa-optimizers.

Deceptive Alignment : Having understood mesa-optimizers, the next section introduces the different types of mesa-optimizers that can arise as well as the corresponding failure modes. This section also explores training dynamics that could potentially increase or decrease the likelihood of the emergence of deceptive alignment.