# Skip-Thought Vectors ## Introduction * The paper describes an unsupervised approach to train a generic, distributed sentence encoder. * It also describes a vocabulary expansion method to encode words not seen at training time. * [Link to the paper](https://arxiv.org/abs/1506.06726) ## Skip-Thoughts * Train an encoder-decoder model where the encoder maps the input sentence to a sentence vector and the decoder generates the sentences surrounding the original sentence. * The model is called **skip-thoughts** and the encoded vectors are called **skip-thought vectors.** * Similar to the [skip-gram](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) model in the sense that surrounding sentences are used to learn sentence vectors. ### Architecture * Training data is in form of sentence tuples (previous sentence, current sentence, next sentence). * **Encoder** * RNN Encoder with GRU. * **Decoder** * RNN Decoder with conditional GRU. * Conditioned on encoder output. * Extra matrices introduced to bias the update gate, reset gate and hidden state, given the encoder output. * **Vocabulary matrix (V)** - Weight matrix having one row (vector) for each word in the vocabulary. * Separate decoders for the previous and next sentence which share only **V**. * Given the decoder context **h** (at any time), encoder output, and list of words already generated for the output sentence, the probability of choosing *w* as the next word is proportional to *exp(**V(*word*)h**)* * **Objective** * Sum of the log-probabilities for the forward and backwards sentences conditioned on the encoder output. ## Vocabulary Expansion * Use a model like Word2Vec which can be trained to induce word representations and train it to obtain embeddings for all the words that are likely to be seen by the encoder. * Learn a matrix **W** such that *encoder(word) = cross-product(W, Word2Vec(word))* for all words that are common to both Word2Vec model and encoder model. * Use **W** to generate embeddings for words are not seen during encoder training. ## Dataset * [BookCorpus dataset](https://arxiv.org/abs/1506.06724) having books across 16 genres. ## Training * **uni-skip** * Unidirectional auto-encoder with 2400 dimensions. * **bi-skip** * Bidirectional model with forward (sentence given in correct order) and backward (sentence given in reverse order) encoders of 1200 dimensions each. * **combine-skip** * concatenation of uni-skip and bi-skip vectors. * Initialization * Recurrent matricies - orthogonal initialization. * Non-recurrent matricies - uniform distribution in [-0.1,0.1]. * Mini-batches of size 128. * Gradient Clipping at norm = 10. * Adam optimizer. ## Experiments * After learning skip-thoughts, freeze the model and use the encoder as feature extractor only. * Evaluated the vectors with linear models on following tasks: ### Semantic Relatedness * Given a sentence pair, predict how closely related the two sentences are. * **skip-thoughts** method outperforms all systems from SemEval 2014 competition and is outperformed only by dependency tree-LSTMs. * Using features learned from image-sentence embedding model on COCO boosts performance and brings it at par with dependency tree-LSTMs. ### Paraphrase detection * **skip-thoughts** outperforms recursive nets with dynamic pooling if no hand-crafted features are used. * **skip-thoughts** with basic pairwise statistics produce results comparable with the state-of-the-art systems that house complicated features and hand engineering. ### Image-sentence Ranking * MS COCO dataset * Task * Image annotation * Given an image, rank the sentences on basis of how well they describe the image. * Image search - Given a caption, find the image that is being described. * Though the system does not outperform baseline system in all cases, the results does indicate that skip-thought vectors can capture image descriptions without having to learn their representations from scratch. ### Classification * **skip-thoughts** perform about as good as bag-of-words baselines but are outperformed by methods where sentence representation has been learnt for the task at hand. * Combining **skip-thoughts** with bi-gram Naive Bayes (NB) features improves the performance. ## Future Work * Variants to be explored include: * Fine tuning the encoder-decoder model during the downstream task instead of freezing the weights. * Deep encoders and decoders. * Larger context windows. * Encoding and decoding paragraphs. * Encoders, such as convnets.