Project proposals for the 2019/2020 MPhil in Advanced Computer Science, at the University of Cambridge.
Below are some project suggestions for the current academic year. I recommend discussing them with me before applying for these.
1) Supervised Interpretability for Text Classification
- Proposer: Marek Rei
- Supervisors: Marek Rei, Andrew Caines and Helen Yannakoudakis
- Resources: FCE dataset, CoNLL 2010 shared task dataset, SemEval 2013 Twitter dataset, Stanford Sentiment Treebank, e-SNLI.
- Course: ACS
- Requirements: Python experience. Pytorch or TensorFlow experience recommended.
Interpretability in machine learning models is important for any practical applications, allowing the models to justify their decisions and providing a method for analysing any errors. Possible approaches to neural network interpretability for natural language understanding include 1) attention architectures that allow us to inspect where the model is focusing when making decisions (e.g. Bahdanau et al. 2014), and 2) black-box interpretability methods that fit a simple interpretable model over the output of a more complex system (e.g. Ribeiro et al. 2016).
We previously showed how attention-based models can be used for zero-shot sequence labeling (Rei & Søgaard, 2018). Recent work has found that the LIME method can give a more accurate result for a related token-level interpretability task but is 6000 times slower (Thorne et al. 2019). In this work we aim to combine the benefits of the two approaches. We can use the LIME model to generate predictions on the training data, then train a regular model to either a) predict the output of the lime model as a multi-task objective, or b) use the LIME model to supervise the attention of the regular model. The aim is to create an architecture that is as fast as an attention-based model and as accurate on zero-shot sequence labeling as LIME. As an added bonus, the extra supervision could potentially also help improve text classification performance.
References:
Zero-shot Sequence Labeling: Transferring Knowledge from Sentences to Tokens. NAACL 2018. Marek Rei and Anders Søgaard.
Generating Token-Level Explanations for Natural Language Inference. ACL 2019. James Thorne, Andreas Vlachos, Christos Christodoulopoulos and Arpit Mittal.
Neural machine translation by jointly learning to align and translate. 2014. Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio.
Why should i trust you?: Explaining the predictions of any classifier. ACL 2016. Marco Tulio Ribeiro, Sameer Singh and Carlos Guestrin.
2) Unsupervised Error Detection
- Proposer: Marek Rei
- Supervisors: Chris Bryant, Marek Rei and Paula Buttery
- Resources: BEA-2019 Shared Task Data
- Course: ACS
- Requirements: Python experience. Pytorch or TensorFlow experience recommended.
- Note: This project is also listed on Chris Bryant's page, as he is a co-supervisor.
Description
Automated systems for detecting errors in learner writing are valuable tools for second language learning and assessment. Previous work has mostly treated error detection as a supervised sequence labeling task, requiring manually annotated training corpora (Rei & Yannakoudakis 2016, Rei 2017, Rei et al. 2017, Kasewa et al. 2018, Bell et al. 2019). Some recent work has also explored error correction and detection without training data, but relying on hand-curated lexicons of all possible word forms (Bryant & Briscoe 2018, Stahlberg et al. 2019). In this project, we will explore fully unsupervised error detection, using only unannotated corpora and methods that can also be applied to other languages where no error detection corpora are available.
One possible strategy is to construct a neural error detection model, provide it with various information learned from plain text corpora, and train it to be a discriminative error detector using synthetic data. Several components and extensions can be investigated:
- Pre-trained contextual word representations (BERT, ELMo, Flair, etc).
- Language models (GPT-2, LSTM-LM, Kneser-Ney, etc).
- Different methods for constructing synthetic data.
- Word and phrase occurrence statistics in different corpora.
- Features from other existing tools (POS taggers, parsers).
References
Compositional sequence labeling models for error detection in learner writing.
Marek Rei and Helen Yannakoudakis. 2016.
Semi-supervised multitask learning for sequence labeling.
Marek Rei. 2017.
Artificial error generation with machine translation and syntactic patterns.
Marek Rei, Mariano Felice, Zheng Yuan and Ted Briscoe. 2017.
Wronging a right: Generating better errors to improve grammatical error detection.
Sudhanshu Kasewa, Pontus Stenetorp, Sebastian Riedel. 2018.
Context is key: Grammatical error detection with contextual word representations.
Samuel Bell, Helen Yannakoudakis and Marek Rei. 2019.
Language model based grammatical error correction without annotated training data.
Christopher Bryant and Ted Briscoe. 2018.
Neural grammatical error correction with finite state transducers.
Felix Stahlberg, Christopher Bryant, Bill Byrne. 2019.