This page is for the course on Machine Learning for Language Modelling that I am teaching in University of Tartu.
Time and location
Date | Time | Room | |
Lecture 1 | 6. April 2015 | 14:15 | J. Liivi 2 - 612 |
Lecture 2 | 7. April 2015 | 14:15 | J. Liivi 2 - 611 |
Lecture 3 | 8. April 2015 | 14:15 | J. Liivi 2 - 202 |
Lecture 4 | 9. April 2015 | 14:15 | J. Liivi 2 - 202 |
Practical | 10. April 2015 | 14:15 | J. Liivi 2 - 004 |
Lecture slides
- Lecture 1 - N-gram language models
- Lecture 2 - N-gram language model smoothing
- Lecture 3 - Neural network language models
- Lecture 4 - Neural network language model optimisation
- Practical
- Homework
Homework
Deadline: 5. May 2015, 23:59 Estonian time
- Implement a bigram language model with „stupid“ backoff.
- Finish a neural network language model. Java skeleton code will be provided, and you need to fill out the parts that perform feedforward and backpropagation passes through the network. Alternatively, you can implement a neural language model from scratch, in any language you wish.
For your submission of both language models:
- Include the source code
- Evaluate your language models on the language modelling task and include your output file in your submission e-mail
- Include your achieved accuracy on the LM task. I expect at least 0.68 with either model
- Include instructions on how to compile and run your system, to reproduce the result
- Do not use existing language modelling or neural network optimisation libraries for your submission. These tasks are about learning how they work. Matrix algebra libraries are fine.
- Package everything up, upload it (eg Dropbox) and e-mail to me: marek.rei@gmail.com
- You have three weeks from the end of the course. I recommend not leaving it on the last minute, as language model debugging and training can take time.
The neural network skeleton code is available on github: neurallm-exercise. More information in the readme file.
Dataset
I have prepared and preprocessed a dataset for you to use when developing and training your language models. It is created from Wikipedia text, separated into sentences, tokenised, and lowercased. The data is separated into training, development and test sets. The training set contains approximately 10M words. You can process these files as you wish, create separate subsets, etc.
The „unk100“ files have been preprocessed, so that all words that occur less than 100 times in the training data are replaced by a special UNK token.
The „topNK“ files contain the first N thousand lines of the full file. It is likely that training on the full dataset is too time-consuming for the neural network language model, therefore I made some subsets.
Download the dataset here: lm-dataset.tar.gz
Practical
We'll train two neural network models. You can use the dataset described in the previous section for training data.
First, we'll look at the word2vec toolkit: http://code.google.com/p/word2vec/
- Download and compile the code.
- Train word vectors: ./word2vec -train trainingfile -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
- Run ./distance vectors.bin and enter words to find other words with the most similar vectors. For example, try these words as input: island, university, france, night
Second, we'll try out RNNLM, the recurrect neural network language model: http://rnnlm.org/
- Download the toolkit and compile it. You might need to set "CC = g++" (or the compiler you have installed) in the makefile
- Train a model: ./rnnlm -train trainingfile -valid validationfile -rnnlm model -hidden 15 -class 100 -bptt 4
- Test the model and measure perplexity (PPL): ./rnnlm -rnnlm model -test test
- Change some parameters and see how that changes perplexity on the test set. For example, try a larger hidden layer (only 15 at the moment). In order to see the full list of settings, run ./rnnlm without parameters. You need to delete the model file every time, or use a different name, otherwise the system will continue training as opposed to starting fresh.