ML for Language Modelling

This page is for the course on Machine Learning for Language Modelling that I am teaching in University of Tartu.

Time and location

	Date	Time	Room
Lecture 1	6. April 2015	14:15	J. Liivi 2 - 612
Lecture 2	7. April 2015	14:15	J. Liivi 2 - 611
Lecture 3	8. April 2015	14:15	J. Liivi 2 - 202
Lecture 4	9. April 2015	14:15	J. Liivi 2 - 202
Practical	10. April 2015	14:15	J. Liivi 2 - 004

Lecture slides

Homework

Deadline: 5. May 2015, 23:59 Estonian time

Implement a bigram language model with „stupid“ backoff.
Finish a neural network language model. Java skeleton code will be provided, and you need to fill out the parts that perform feedforward and backpropagation passes through the network. Alternatively, you can implement a neural language model from scratch, in any language you wish.

For your submission of both language models:

Include the source code
Evaluate your language models on the language modelling task and include your output file in your submission e-mail
Include your achieved accuracy on the LM task. I expect at least 0.68 with either model
Include instructions on how to compile and run your system, to reproduce the result
Do not use existing language modelling or neural network optimisation libraries for your submission. These tasks are about learning how they work. Matrix algebra libraries are fine.
Package everything up, upload it (eg Dropbox) and e-mail to me: marek.rei@gmail.com
You have three weeks from the end of the course. I recommend not leaving it on the last minute, as language model debugging and training can take time.

The neural network skeleton code is available on github: neurallm-exercise. More information in the readme file.

Dataset

I have prepared and preprocessed a dataset for you to use when developing and training your language models. It is created from Wikipedia text, separated into sentences, tokenised, and lowercased. The data is separated into training, development and test sets. The training set contains approximately 10M words. You can process these files as you wish, create separate subsets, etc.

The „unk100“ files have been preprocessed, so that all words that occur less than 100 times in the training data are replaced by a special UNK token.

The „topNK“ files contain the first N thousand lines of the full file. It is likely that training on the full dataset is too time-consuming for the neural network language model, therefore I made some subsets.

Download the dataset here: lm-dataset.tar.gz

Practical

We'll train two neural network models. You can use the dataset described in the previous section for training data.

First, we'll look at the word2vec toolkit: http://code.google.com/p/word2vec/

Download and compile the code.
Train word vectors: ./word2vec -train trainingfile -output vectors.bin -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -binary 1 -iter 15
Run ./distance vectors.bin and enter words to find other words with the most similar vectors. For example, try these words as input: island, university, france, night

Second, we'll try out RNNLM, the recurrect neural network language model: http://rnnlm.org/

Download the toolkit and compile it. You might need to set "CC = g++" (or the compiler you have installed) in the makefile
Train a model: ./rnnlm -train trainingfile -valid validationfile -rnnlm model -hidden 15 -class 100 -bptt 4
Test the model and measure perplexity (PPL): ./rnnlm -rnnlm model -test test
Change some parameters and see how that changes perplexity on the test set. For example, try a larger hidden layer (only 15 at the moment). In order to see the full list of settings, run ./rnnlm without parameters. You need to delete the model file every time, or use a different name, otherwise the system will continue training as opposed to starting fresh.