This paper builds on the paper "Learned in Translation: Contextualized Word Vectors" , which learned contextualized word representations by using the sequence of encodings generated by a Bidirectional LSTM as the representation of the sequence of input words. This paper says “if we're learning a deep LSTM, ie one with more than one layer, why should we use only the last layer that it produces as the representation of the word?”. This paper instead suggests that it could be valuable for transfer learning if each task can learn a weighting of layer encodings that is most valuable for that task. In a prime example of “your model is a special case of my model,” they note that this framework can easily learn the approach of only using the final encoding layer by just only giving that layer a non-zero weight. As intuition for why this might be a valuable thing to do: different layers tend to capture different levels of meaning, with lower layers more likely to capture part of speech information, and higher layers more likely to capture more rich semantic context. https://i.imgur.com/s8Qn6YY.png One difficulty in comparing this paper directly to the “take the top layer encoding from a LSTM” paper is that they were trained on different problems: the top layer paper learned using a machine translation objective, where, by contrast, this one learns by using a much simpler language model. Here, a simple language model means a RNN that is trained to predict the next word, given the hidden state built up over all prior words. Because we want to pull in word context from both directions, this is’t just a LSTM but a bidirectional LSTM, which - surprise surprise - tries to pick the word *before* a word in a sentence by using all of the words that come after it. This has the advantage of not requiring parallel data, the way machine translation does, but also makes it difficult to make direct comparisons to prior work that isolate the effect of multi-layer combination, as separate from the switch between machine translation and direct language modeling. Although this is likely also a benefit you see with just top-layer contextual vectors, it is interesting to examine the attached table, and look at how effectively the model is able to learn different representations of the word “play” depending on the context in which it appears; in each case, the nearest neighbor of a context of “play” is a sentence in which the word is used in the same context.