Advanced Machine Learning - Co-Teacher

MSc. Course, ITU Copenhagen, Computer Science, 2024

Lecture on Mechanistic Interpretability

The lecture focused on challenges related to interpreting deep neural networks (DNNs) and more specifically transformer-based large language models (LLMs).

The lecture covered a variety of topics related to interpretabilty of AI systems, including:

  • The difference between explainability and interpretability focused research
  • Feature visualization techniques
  • Attention visualization
  • Circuit analysis
  • Perspectives on the transformer archiecture (the residual stream view)
  • The problems of polysemanticity
    • How to extract monosemantic features
  • Representation engineering

If interested I can provide slides and lecture notes from the presentation.