Skip to content

Interpretability

Authors
Jeanne Salle & Charbel-Raphael Segerie
Affiliations
French Center for AI Safety (CeSIA)
Acknowledgements
Markov Grey
Last Updated
2024-11-01
Reading Time
56 min (core)
Also available on
Watch
Listen
Download
Feedback Facilitate

Overview

We currently don’t understand how AI models work. We know how to train and build them, meaning we can design them and teach them to perform tasks, such as recognizing objects in images or generating coherent text in response to prompts. However, this does not mean we can always explain their behavior after training. As for now, we can’t explain why a network made a specific decision or produced a particular output. The goal of interpretability is to understand the inner workings of these networks and explain how they function, which in turn could allow us to better trust and control AI models.

Before reading this chapter it is recommended to be familiar with the transformer and CNN architectures.

Video 9.1: Optional Video. If you are unfamiliar with convolutional neural networks (CNNs), this video will help you get up to speed before reading this chapter.
Video 9.2: Optional Video. If you are unfamiliar with transformers, the videos on transformers in this playlist will help you get up to speed before reading this chapter.

For each method presented in this chapter, we first provide a high-level overview, followed by a more in-depth and technical explanation. The technical explanations can be skipped.