Interpretability¶
Overview¶
We currently don’t understand how AI models work. We know how to train and build them, meaning we can design them and teach them to perform tasks, such as recognizing objects in images or generating coherent text in response to prompts. However, this does not mean we can always explain their behavior after training. As for now, we can’t explain why a network made a specific decision or produced a particular output. The goal of interpretability is to understand the inner workings of these networks and explain how they function, which in turn could allow us to better trust and control AI models.
Before reading this chapter it is recommended to be familiar with the transformer and CNN architectures.
For each method presented in this chapter, we first provide a high-level overview, followed by a more in-depth and technical explanation. The technical explanations can be skipped.