A Distribution-based Model to Learn Bilingual Word EmbeddingsA Distribution-based Model to Learn Bilingual Word EmbeddingsCao, Hailong and Zhao, Tiejun and Zhang, Shu and Meng, Yao2016

Paper summarytmillsA joint model for training bilingual embeddings without supervision. As background information they point out that embeddings in different languages tend to be linear transforms of each other. Then they create a new network and objective that builds on CBOW training, but has two terms in the loss function: The traditional loss for the CBOW language model, and an additional loss over the mean and variance of each dimension of the embedding space. The hypothesis, I guess, is that constraining each dimension of each languages embeddings to be similar (mean and variance) will put them in the same space from the start (i.e., no transformation needed). During training, they then sample a word randomly from the source (or target) language, feed forward, and compute the distribution-based gradient using the target (or source) means as the gold, and add it to the standard gradient.
They evaluate on English-French and English-Chinese using aligned corpora, and get better, though not good, performance.
There are two issues that I can think of that might affect the (lack of) accuracy. First, the mean/variance similarity constraints might be not constrained enough. Second, the fact that the other languages in-progress embedding statistics are used as gold may cause it to get stuck in local optima? FWIW, Zhang et al '17 (ACL) also question the distributional assumptions (that each dimension is Gaussian-distributed).

A joint model for training bilingual embeddings without supervision. As background information they point out that embeddings in different languages tend to be linear transforms of each other. Then they create a new network and objective that builds on CBOW training, but has two terms in the loss function: The traditional loss for the CBOW language model, and an additional loss over the mean and variance of each dimension of the embedding space. The hypothesis, I guess, is that constraining each dimension of each languages embeddings to be similar (mean and variance) will put them in the same space from the start (i.e., no transformation needed). During training, they then sample a word randomly from the source (or target) language, feed forward, and compute the distribution-based gradient using the target (or source) means as the gold, and add it to the standard gradient.
They evaluate on English-French and English-Chinese using aligned corpora, and get better, though not good, performance.
There are two issues that I can think of that might affect the (lack of) accuracy. First, the mean/variance similarity constraints might be not constrained enough. Second, the fact that the other languages in-progress embedding statistics are used as gold may cause it to get stuck in local optima? FWIW, Zhang et al '17 (ACL) also question the distributional assumptions (that each dimension is Gaussian-distributed).