Chapter 07 - Generalization¶

Authors

Markov Grey

Affiliations

French Center for AI Safety (CeSIA)

Acknowledgements

RA Writer, Charbel-Raphael Segerie, Jeanne Salle, Oscar Heitmann, Camille Berger, Josh Thorsteinson, Nicolas Guillard

Last Updated

2023-12-13

Reading Time

35 min (core)

Also available on

Google Docs

Watch

Listen

Download

Feedback Facilitate

Known Errors in AI-Generated Audio

Note: - This is an AI-generated audio version.
General - Note - Nothing in this podcast is explicitly false, but it is a bit superficial
End section - Note - There is a lot of slop at the end

Found errors? Please report to contact@securite-ia.fr

Introduction¶

Goal Misgeneralization : This section introduces the concept of goals as distinct from rewards. It explains what it might be if a model's capabilities generalize, while the goals do not. The section provides various examples of game playing agents, LLMs and other thought experiments to show how this could be a potentially catastrophic failure mode distinct from reward misspecification.

Inner Alignment : The next section begins with an explanation of the machine learning process, and how it can be seen as analogous to search. Since the machine learning process can be seen analogous to search, one type of algorithm that can be “found" is an optimizer. This motivates a discussion of the distinction between base and mesa-optimizers.

Deceptive Alignment : Having understood mesa-optimizers, the next section introduces the different types of mesa-optimizers that can arise as well as the corresponding failure modes. This section also explores training dynamics that could potentially increase or decrease the likelihood of the emergence of deceptive alignment.