Advanced Machine Learning - Co-Teacher
MSc. Course, ITU Copenhagen, Computer Science, 2024
Lecture on Mechanistic Interpretability
The lecture focused on challenges related to interpreting deep neural networks (DNNs) and more specifically transformer-based large language models (LLMs).
The lecture covered a variety of topics related to interpretabilty of AI systems, including:
- The difference between explainability and interpretability focused research
- Feature visualization techniques
- Attention visualization
- Circuit analysis
- Perspectives on the transformer archiecture (the residual stream view)
- The problems of polysemanticity
- How to extract monosemantic features
- Representation engineering
If interested I can provide slides and lecture notes from the presentation.