This page is for the module on Constructing and Evaluating Word Embeddings that I am teaching in University of Cambridge, together with Dr Ekaterina Kochmar.
Description
Representing words as low-dimensional vectors allows systems to take advantage of semantic similarities, generalise to unseen examples and improve pattern detection accuracy on nearly all NLP tasks. Advances in neural networks and representation learning have opened new and exciting ways of learning word embeddings with unique properties.
In this topic we will provide an introduction to the classical vector space models and cover the most influential research in neural embeddings from the past couple of years, including word similarity and semantic analogy tasks, word2vec models and task-specific representation learning. We will also discuss the most recent advances in the field including multilingual embeddings and multimodal vectors using image detection.
By the end of the course you will have learned to construct word representations using both traditional and various neural network models. You will learn about different properties of these models and how to choose an approach for a specific task. You will also get an overview of the most recent and notable advances in the field.
Lecture slides
Introductory lecture on word embeddingsBackground Reading:
Mikolov et al. (2013). Efficient Estimation of Word Representations in Vector Space
Mikolov et al. (2013). Linguistic Regularities in Continuous Space Word Representations
Papers for student presentations:
Socher et al. (2012). Semantic Compositionality through Recursive Matrix-Vector Spaces
Moritz Hermann and Blunsom (2014, ACL). Multilingual Models for Compositional Distributed Semantics
Faruqui et al. (2015, best paper at NAACL). Retrofitting Word Vectors to Semantic Lexicons
Norouzi et al (2014, ICLR) Zero-Shot Learning by Convex Combination of Semantic Embeddings
Useful links
Word2vec, a tool for creating word embeddings
https://code.google.com/archive/p/word2vec/
Word vectors pretrained on 100B words. More information on the word2vec homepage.
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
Tool for converting word2vec vectors between binary and plain-text formats. You can use this to convert the pre-trained vectors to plain-text.
https://github.com/marekrei/convertvec
Vectors trained using 3 different methods (counting, word2vec and dependecy-relations) on the same dataset (BNC).
http://www.marekrei.com/projects/vectorsets/
An online tool for evaluating word vectors on 12 different word similarity datasets.
http://www.wordvectors.org/
t-SNE, a tool for visualising word embeddings in 2D.
http://lvdmaaten.github.io/tsne/