They get multilingual alignments from dictionaries, then train a Bilstm pos tagger in source language, then automatically tag many tokens in the target language, then manually annotate 1000 tokens in target language, then train a system with combined loss over distant tagging and gold tagging. They add an additional output layer that is learned for the gold annotations.
Multilingual embeddings are useful for creating embeddings for low resource languages for things like transfer learning (e.g., learning a POS tagger in a low-resource language using training data from a high resource language). However, they typically require some small amount of supervision in the form of aligned corpora, seed pairs, or dictionaries. This approach attempts to learn a mapping from a source embedding space into a target embedding space without supervision.
The approach uses two networks a la adversarial training. One network (the generator) is parameterized by a projection matrix that attempts to map source words into the target space. The other network (the discriminator) attempts to discriminate true target embeddings from projected source embeddings. Since adversarial training is known to be unstable (a "research frontier" as the authors say), quite a bit of the paper describes tricks and training methods the authors investigated to get training to converge and understand how to select models.
They evaluate on many pairs, including both similar and dissimilar language pairs, and get very nice results. In summary, better than seed-based approaches with 0-100 seeds, competitive with 100-1000 seeds. Much of what would be traditional discussion is instead devoted to details of training regimen, so unfortunately there is little discussion of why this works. Given the difficulty one might encounter attempting to train this, I think it might be a little preliminary to try using this for applications, but continued research in training adversarial networks for NLP and properties of embedding spaces could potentially make this approach reliable enough for real applications.
(Reposting under ACL 2017 version)
Kind of a response/deeper dive into the durret/klein "easy victories" paper. Suggests that a) lexical features they used ("easy victories") are very prone to overfitting. They first show that several state of the art systems that use lexical features, trained on CoNLL data, perform poorly on wikiref, which was annotated using the same guidelines. Meanwhile the stanford sieve system performs about the same on both.
Then they show that a high percentage of gold standard linked headwords in the test set have been seen in the training set, and that a much lower percentage of errors are in the training set, implying that lexical features just allow you to memorize what kinds of things can be linked.
They suggest development of robust features, including using embeddings as lexical features, using lexical representations only for context, and on the evaluation side, using test sets that are different domains than the training set.