RNN language models are composed of:
1. Embedding layer
2. Recurrent layer(s) (RNN/LSTM/GRU/...)
3. Softmax layer (linear transformation + softmax operation)
The embedding matrix and the matrix of the linear transformation just before the softmax operation are of the same size (size_of_vocab * recurrent_state_size) .
They both contain one representation for each word in the vocabulary.
## __Weight Tying__
This paper shows, that by using the same matrix as both the input embedding and the pre-softmax linear transformation (the output embedding), the performance of a wide variety of language models is improved while the number of parameters is massively reduced.
In weight tied models each word has just one representation that is used in both the input and output embedding.
## __Why does weight tying work?__
1. In the paper we show that in un-tied language models, the output embedding contains much better word representations that the input embedding. We show that when the embedding matrices are tied, the quality of the shared embeddings is comparable to that of the output embedding in the un-tied model. So in the tied model the quality of the input and output embeddings is superior to the quality of those embeddings in the un-tied model.
2. In most language modeling tasks because of the small size of the datasets the models tend to overfit. When the number of parameters is reduced in a way that makes sense there is less overfitting because of the reduction in the capacity of the network.
## __Can I tie the input and output embeddings of the decoder of an translation model?__
Yes, we show that this reduces the model's size while not hurting its performance.
In addition, we show that if you preprocess your data using BPE, because of the large overlap between the subword vocabularies of the source and target language, __Three-Way Weight Tying__ can be used. In Three-Way Weight Tying, we tie the input embedding in the encoder to the input and output embeddings of the decoder (so each word has one representation which is used across three matrices).
[This](http://ofir.io/Neural-Language-Modeling-From-Scratch/) blog post contains more details about the weight tying method.
Incorporating an unsupervised language modeling objective to help train a bidirectional LSTM for sequence labeling. At the same time as training the tagger, the forward-facing LSTM is optimised to predict the next word and the backward-facing LSTM is optimised to predict the previous word. The model learns a better composition function and improves performance on NER, error detection, chunking and POS-tagging, without using additional data.
A toolkit for automatically annotating error correction data with error types. It takes original and corrected sentences as input, aligns them to infer error spans, and uses rules to assign error types. They use the tool to perform fine-grained evaluation of CoNLL-14 shared task participants.
The paper proposes integrating a pre-trained language model into a sequence labeling model. The baseline model for sequence labeling is a two-layer LSTM/GRU. They concatenate the hidden states from pre-trained language models onto the output of the first LSTM layer. This provides an improvement on NER and chunking tasks.
They propose neural models for dialogue state tracking, making a binary decision for each possible slot-value pair, based on the latest context from the user and the system. The context utterances and the slot-value option are encoded into vectors, either by summing word representations or using a convnet. These vectors are then further combined to produce a binary output. The systems are evaluated on two dialogue datasets and show improvement over baselines that use hand-constructed lexicons.
Proposing character-based extensions to a neural MT system for grammatical error correction. OOV words are represented in the encoder and decoder using character-based RNNs. They evaluate on the CoNLL-14 dataset, integrate probabilities from a large language model, and achieve good results.
This paper attempts to open up the black box of neural machine translation models and inspect what the representations look like, specifically with respect to morphology. The technique they use is to train word-based and character-based seq2seq-style models on multiple source-target language pairs, of varying morphological complexity, and then ignore the target side to focus on the representations learned about the source language. Once they have an encoder trained to generate these representations, they attempt to use the encoder to create feature representations for external tasks that directly evaluate for morphology and part of speech information. (Contrast this with methods that may, for example, try to inspect activation patterns of individual neurons in a trained model.)
The first experiment shows that representations learned from character-based models are superior for POS tagging in the source language. The gap is bigger for morphologically rich languages like Arabic. The same result holds for morphological tagging. For infrequent words the gap is especially large -- the system can memorize morphological information for frequent words. They also show that the increases in accuracy are due to getting prevoiusly unseen words correct (both for POS and morph prediction) and that the biggest increase in accuracy is in predicting plural and determined noun categories. Next, they show that in a deeper network, the middle layer (of 3) has the best representations for predicting pos/morph information. The authors suggest the higher layers are more focused on semantics or other higher abstractions.
Overall, this work empirically confirms some conventional wisdom, that character representations are better for unseen words because of their ability to represent morphology.
First published: 2017/04/22 (2 years ago) Abstract: Lexical features are a major source of information in state-of-the-art
coreference resolvers. Lexical features implicitly model some of the linguistic
phenomena at a fine granularity level. They are especially useful for
representing the context of mentions. In this paper we investigate a drawback
of using many lexical features in state-of-the-art coreference resolvers. We
show that if coreference resolvers mainly rely on lexical features, they can
hardly generalize to unseen domains. Furthermore, we show that the current
coreference resolution evaluation is clearly flawed by only evaluating on a
specific split of a specific dataset in which there is a notable overlap
between the training, development and test sets.
Kind of a response/deeper dive into the durret/klein "easy victories" paper. Suggests that a) lexical features they used ("easy victories") are very prone to overfitting. They first show that several state of the art systems that use lexical features, trained on CoNLL data, perform poorly on wikiref, which was annotated using the same guidelines. Meanwhile the stanford sieve system performs about the same on both.
Then they show that a high percentage of gold standard linked headwords in the test set have been seen in the training set, and that a much lower percentage of errors are in the training set, implying that lexical features just allow you to memorize what kinds of things can be linked.
They suggest development of robust features, including using embeddings as lexical features, using lexical representations only for context, and on the evaluation side, using test sets that are different domains than the training set.
Multilingual embeddings are useful for creating embeddings for low resource languages for things like transfer learning (e.g., learning a POS tagger in a low-resource language using training data from a high resource language). However, they typically require some small amount of supervision in the form of aligned corpora, seed pairs, or dictionaries. This approach attempts to learn a mapping from a source embedding space into a target embedding space without supervision.
The approach uses two networks a la adversarial training. One network (the generator) is parameterized by a projection matrix that attempts to map source words into the target space. The other network (the discriminator) attempts to discriminate true target embeddings from projected source embeddings. Since adversarial training is known to be unstable (a "research frontier" as the authors say), quite a bit of the paper describes tricks and training methods the authors investigated to get training to converge and understand how to select models.
They evaluate on many pairs, including both similar and dissimilar language pairs, and get very nice results. In summary, better than seed-based approaches with 0-100 seeds, competitive with 100-1000 seeds. Much of what would be traditional discussion is instead devoted to details of training regimen, so unfortunately there is little discussion of why this works. Given the difficulty one might encounter attempting to train this, I think it might be a little preliminary to try using this for applications, but continued research in training adversarial networks for NLP and properties of embedding spaces could potentially make this approach reliable enough for real applications.
They get multilingual alignments from dictionaries, then train a Bilstm pos tagger in source language, then automatically tag many tokens in the target language, then manually annotate 1000 tokens in target language, then train a system with combined loss over distant tagging and gold tagging. They add an additional output layer that is learned for the gold annotations.