They get multilingual alignments from dictionaries, then train a Bilstm pos tagger in source language, then automatically tag many tokens in the target language, then manually annotate 1000 tokens in target language, then train a system with combined loss over distant tagging and gold tagging. They add an additional output layer that is learned for the gold annotations.
Multilingual embeddings are useful for creating embeddings for low resource languages for things like transfer learning (e.g., learning a POS tagger in a low-resource language using training data from a high resource language). However, they typically require some small amount of supervision in the form of aligned corpora, seed pairs, or dictionaries. This approach attempts to learn a mapping from a source embedding space into a target embedding space without supervision.
The approach uses two networks a la adversarial training. One network (the generator) is parameterized by a projection matrix that attempts to map source words into the target space. The other network (the discriminator) attempts to discriminate true target embeddings from projected source embeddings. Since adversarial training is known to be unstable (a "research frontier" as the authors say), quite a bit of the paper describes tricks and training methods the authors investigated to get training to converge and understand how to select models.
They evaluate on many pairs, including both similar and dissimilar language pairs, and get very nice results. In summary, better than seed-based approaches with 0-100 seeds, competitive with 100-1000 seeds. Much of what would be traditional discussion is instead devoted to details of training regimen, so unfortunately there is little discussion of why this works. Given the difficulty one might encounter attempting to train this, I think it might be a little preliminary to try using this for applications, but continued research in training adversarial networks for NLP and properties of embedding spaces could potentially make this approach reliable enough for real applications.
First published: 2017/04/22 (7 months ago) Abstract: Lexical features are a major source of information in state-of-the-art
coreference resolvers. Lexical features implicitly model some of the linguistic
phenomena at a fine granularity level. They are especially useful for
representing the context of mentions. In this paper we investigate a drawback
of using many lexical features in state-of-the-art coreference resolvers. We
show that if coreference resolvers mainly rely on lexical features, they can
hardly generalize to unseen domains. Furthermore, we show that the current
coreference resolution evaluation is clearly flawed by only evaluating on a
specific split of a specific dataset in which there is a notable overlap
between the training, development and test sets.
Kind of a response/deeper dive into the durret/klein "easy victories" paper. Suggests that a) lexical features they used ("easy victories") are very prone to overfitting. They first show that several state of the art systems that use lexical features, trained on CoNLL data, perform poorly on wikiref, which was annotated using the same guidelines. Meanwhile the stanford sieve system performs about the same on both.
Then they show that a high percentage of gold standard linked headwords in the test set have been seen in the training set, and that a much lower percentage of errors are in the training set, implying that lexical features just allow you to memorize what kinds of things can be linked.
They suggest development of robust features, including using embeddings as lexical features, using lexical representations only for context, and on the evaluation side, using test sets that are different domains than the training set.
This paper attempts to open up the black box of neural machine translation models and inspect what the representations look like, specifically with respect to morphology. The technique they use is to train word-based and character-based seq2seq-style models on multiple source-target language pairs, of varying morphological complexity, and then ignore the target side to focus on the representations learned about the source language. Once they have an encoder trained to generate these representations, they attempt to use the encoder to create feature representations for external tasks that directly evaluate for morphology and part of speech information. (Contrast this with methods that may, for example, try to inspect activation patterns of individual neurons in a trained model.)
The first experiment shows that representations learned from character-based models are superior for POS tagging in the source language. The gap is bigger for morphologically rich languages like Arabic. The same result holds for morphological tagging. For infrequent words the gap is especially large -- the system can memorize morphological information for frequent words. They also show that the increases in accuracy are due to getting prevoiusly unseen words correct (both for POS and morph prediction) and that the biggest increase in accuracy is in predicting plural and determined noun categories. Next, they show that in a deeper network, the middle layer (of 3) has the best representations for predicting pos/morph information. The authors suggest the higher layers are more focused on semantics or other higher abstractions.
Overall, this work empirically confirms some conventional wisdom, that character representations are better for unseen words because of their ability to represent morphology.