Skip-Thought Vectors on ShortScience.org

papers.nips.cc
scholar.google.com

Skip-Thought Vectors
Kiros, Ryan and Zhu, Yukun and Salakhutdinov, Ruslan and Zemel, Richard S. and Urtasun, Raquel and Torralba, Antonio and Fidler, Sanja
Neural Information Processing Systems Conference - 2015 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 2

[link] Summary by Denny Britz 7 years ago

TLDR; The authors apply the skip-thoguth word2vec model to the sentence level, training auto-encoders that predict the previous and next sentences. The resulting general-purpose vector representations are called skip-thought vectors. The authors evaluate the performance of these vectors as features on semantic relatedness and classification tasks, achieving competitive results, but not beating fine-tuned models.

#### Key Points

- Code at https://github.com/ryankiros/skip-thoughts
- Training is done on large book corpus (74M sentences, 1B tokens), takes 2 weeks. 
- Two variations: Bidirectional encoder and unidirectional encoder with 1200 and 2400 units per encoder respectively. GRU cell, Adam optimizer, gradient clipping norm 10.
- Vocabulary can be expanded by learning a mapping from a large word2vec voab to the smaller skip-thought vocab. Could also used sampling/hierarchical softmax during training for larger vocab, or train on characters.

#### Questions/Notes

- Authors clearly state that this is not the goal of the paper, though I'd be curious how more sophisticated (non-linear) classifiers perform with skip-thought vectors. Authors probably tried this but it didn't do well ;)
- The fact that the story generation doesn't seem work well shows that the model has problems learning or understanding long-term dependencies. I wonder if this can be solved by deeper encoders or attention.

Your comment:

[link] Summary by Shagun Sodhani 7 years ago

# Skip-Thought Vectors

## Introduction

* The paper describes an unsupervised approach to train a generic, distributed sentence encoder.
* It also describes a vocabulary expansion method to encode words not seen at training time.
* [Link to the paper](https://arxiv.org/abs/1506.06726)

## Skip-Thoughts

* Train an encoder-decoder model where the encoder maps the input sentence to a sentence vector and the decoder generates the sentences surrounding the original sentence.
* The model is called **skip-thoughts** and the encoded vectors are called **skip-thought vectors.**
* Similar to the [skip-gram](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) model in the sense that surrounding sentences are used to learn sentence vectors.

### Architecture

* Training data is in form of sentence tuples (previous sentence, current sentence, next sentence).
* **Encoder**
* RNN Encoder with GRU.
* **Decoder**
* RNN Decoder with conditional GRU.
* Conditioned on encoder output.
* Extra matrices introduced to bias the update gate, reset gate and hidden state, given the encoder output.
* **Vocabulary matrix (V)** - Weight matrix having one row (vector) for each word in the vocabulary.
* Separate decoders for the previous and next sentence which share only **V**.
* Given the decoder context **h** (at any time), encoder output, and list of words already generated for the output sentence, the probability of choosing *w* as the next word is proportional to *exp(**V(*word*)h**)*
* **Objective**
* Sum of the log-probabilities for the forward and backwards sentences conditioned on the encoder output.

## Vocabulary Expansion

* Use a model like Word2Vec which can be trained to induce word representations and train it to obtain embeddings for all the words that are likely to be seen by the encoder.
* Learn a matrix **W** such that *encoder(word) = cross-product(W, Word2Vec(word))* for all words that are common to both Word2Vec model and encoder model.
* Use **W** to generate embeddings for words are not seen during encoder training.

## Dataset

* [BookCorpus dataset](https://arxiv.org/abs/1506.06724) having books across 16 genres.

## Training

* **uni-skip**
* Unidirectional auto-encoder with 2400 dimensions.
* **bi-skip**
* Bidirectional model with forward (sentence given in correct order) and backward (sentence given in reverse order) encoders of 1200 dimensions each.
* **combine-skip**
* concatenation of uni-skip and bi-skip vectors.
* Initialization
* Recurrent matricies - orthogonal initialization.
* Non-recurrent matricies - uniform distribution in [-0.1,0.1].
* Mini-batches of size 128.
* Gradient Clipping at norm = 10.
* Adam optimizer.

## Experiments

* After learning skip-thoughts, freeze the model and use the encoder as feature extractor only.
* Evaluated the vectors with linear models on following tasks:

### Semantic Relatedness

* Given a sentence pair, predict how closely related the two sentences are.
* **skip-thoughts** method outperforms all systems from SemEval 2014 competition and is outperformed only by dependency tree-LSTMs.
* Using features learned from image-sentence embedding model on COCO boosts performance and brings it at par with dependency tree-LSTMs.

### Paraphrase detection

* **skip-thoughts** outperforms recursive nets with dynamic pooling if no hand-crafted features are used.
* **skip-thoughts** with basic pairwise statistics produce results comparable with the state-of-the-art systems that house complicated features and hand engineering.

### Image-sentence Ranking

* MS COCO dataset
* Task
* Image annotation
* Given an image, rank the sentences on basis of how well they describe the image.
* Image search - Given a caption, find the image that is being described.
* Though the system does not outperform baseline system in all cases, the results does indicate that skip-thought vectors can capture image descriptions without having to learn their representations from scratch.

### Classification

* **skip-thoughts** perform about as good as bag-of-words baselines but are outperformed by methods where sentence representation has been learnt for the task at hand.
* Combining **skip-thoughts** with bi-gram Naive Bayes (NB) features improves the performance.

## Future Work

* Variants to be explored include:
* Fine tuning the encoder-decoder model during the downstream task instead of freezing the weights.
* Deep encoders and decoders.
* Larger context windows.
* Encoding and decoding paragraphs.
* Encoders, such as convnets.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private