[link]
Summary by mrdrozdov 3 years ago
### General Approach
The Neural Tree Indexer (NTI) approach succeeded to reach 87.3\% test accuracy on SNLI. Here I'll attempt to clearly describe the steps involved based on the publication [1] and open sourced codebase [2].
NTI is a method to apply attention over a tree, specifically applied to sentence pairs. There are three main steps, each giving an incrementally more expressive representation of the input. It's worth noting that the tree is a full binary tree, so sentence lengths are padded to a factor of 2. In this case, the padded length used is $2^5 = 32$.
- **Sequence Encoding.** Run an RNN over your sentence to get new hidden states for each element.
$$h_t = f_1^{rnn}(i_t, h_{t-1})$$
- **Tree Encoding.** Using the hidden states from the previous step, use a variant of TreeLSTM to combine leaves until you have a single hidden state representing the entire sentence. Keep all of the intermediary hidden states for the next step.
$$ h_t^{tree} = f^{tree}(h_l^{tree},h_r^{tree})$$
- **Attention on Opposite Tree.** Until now we've only been describing how to encode a single sentence. When incorporating attention, we attend on the opposite tree by using the hidden states from the previous step. For instance, here is how we'd encode the premise (where the $p,h$ superscripts denote the premise or hypothesis, and $\vec{h}^{h,tree}$ denotes all of the hidden states of the non-attended hypothesis tree.:
$$h_t^p = f_1^{rnn}(i_t^p, h_{t-1}^p) \\
h_t^{p,tree} = f^{tree}(h_l^{p,tree},h_r^{p,tree}) \\
i_t^{p,attn} = f^{attn}(h_t^{p,tree}, \vec{h}^{h,tree}) \\
h_t^{p,attn} = f_2^{rnn}(i_t^{p,attn}, h_{t-1}^{p,attn})
$$
### Datasets
NTI was evaluated on three datasets. Some variant of the model achieved state-of-the-art in some category for each dataset:
- SNLI [3]: Sentence Pair Classification.
- WikiQA [4]: Answer Sentence Selection.
- Stanford Sentiment TreeBank (SST) [5]: Sentence Classification.
### Implementation Details
- Batch size is $32$ pairs (so $32$ of each premise and hypothesis).
- Tree is full binary tree with $2^5 = 32$ leaves.
- All sentences are padded left to length $32$, matching the full binary tree.
- Steps 1 (sentence encoding) runs on all sentence simultaneously. So is Step 2 (tree encoding). Step 3 (attention) is done first on the premise, then on the hypothesis.
- The variant of TreeLSTM used is S-LSTM. It's available as a standard function in Chainer.
- Dropout is applied liberally in each step. The keep rate is fixed at $80\%$.
- MLP has $1$ hidden layer with dimension $1024$. Dimensions of the entire MLP are: $(2 \times H) \times 1024 \times 3$. $H$ is the size of the hidden states and is $300$.
- Uses Chainer's Adam optimizer with $\alpha=0.0003,\beta_1=0.9,\beta_2=0.999,\epsilon=10^{-8}$. Gradient clipping using L2 norm of $40$. Parameters periodically scaled by $0.00003$ (weight decay).
- Weights are initialized uniformly random between $-0.1$ and $0.1$.
[1]: https://arxiv.org/abs/1607.04492
[2]: https://bitbucket.org/tsendeemts/nti/overview
[3]: nlp.stanford.edu/projects/snli/
[4]: https://www.microsoft.com/en-us/research/publication/wikiqa-a-challenge-dataset-for-open-domain-question-answering/
[5]: http://www.socher.org/index.php/Main/SemanticCompositionalityThroughRecursiveMatrix-VectorSpaces

more
less