I am teaching this module as part of the lecture series on Advanced Topics in Machine Learning and Natural Language Processing at the University of Cambridge.
Interpreting the Black Box: Explainable Neural Network Models
Neural networks are one of the most powerful classes of machine learning models, achieving state-of-the-art results on a wide range of benchmarks. A key aspect behind their success is the ability to discover representations that can capture relevant underlying structure in the training data. However, most of these architectures are known to be 'black box' models, as it is very difficult to infer why a neural model has made some specific prediction.
Information in a neural architecture generally passes through multiple non-linear layers and gets combined with millions of weights, making it extremely challenging to provide human-interpretable explanations or visualizations of the decision process. Recent work on adversarial examples has also shown that neural networks are often vulnerable to carefully constructed modifications of the inputs, which at the same time are indistinguishable to humans, leading researchers to ask what these models are actually learning and how they can be improved.
Creating neural network architectures that are interpretable is an active research area, as such models would provide multiple benefits:
- Data analysis. Knowing which information the model uses to make decisions can reveal patterns and regularities in the dataset, providing novel insight about the task that it is solving.
- Model improvement. Understanding why the model makes specific incorrect decisions can inform us how to improve it and guide the model development.
- Providing explanations. When automated systems are making potentially life-changing decisions, users will want to receive human-interpretable explanations for why these specific decisions were made.
The latest regulations also require that practical machine learning models, making decisions that can affect users, need to be able to provide an explanation for their behaviour, making the need for interpretable models even more pressing.
In this module we will discuss different methods for interpreting the internal decisions of neural models, along with explicitly designing the architectures to be human-interpretable.
Papers for student presentations:
Why should i trust you?: Explaining the predictions of any classifier (KDD 2016) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin.
Generating visual explanations (ECCV 2016) Hendricks, Lisa Anne, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, and Trevor Darrell.
Show, attend and tell: Neural image caption generation with visual attention (ICML 2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio.
Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. (ICCV 2017) Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra.
Explainable Prediction of Medical Codes from Clinical Text (NAACL 2018) James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, and Jacob Eisenstein.
Evaluating neural network explanation methods using hybrid documents and morphosyntactic agreement (ACL 2018) Nina Poerner, Hinrich Schütze, and Benjamin Roth.