Empirical Methods in Natural Language Processing or EMNLP is a leading conference in the area of Natural Language Processing. EMNLP is organized by the ACL special interest group on linguistic data (SIGDAT).
The authors investigate convolutional networks for essay scoring. They use a two-level convolution – first over words and then over sentences. Evaluation is performed on the Kaggle ASAP dataset, training separate models on individual topics, and also reporting some cross-topic results.
They describe a neural model for text generation, which keeps track of a checklist of items that need to be mentioned in the text.
The basic system is an encoder-decoder GRU model for text generation. On top of that, the model uses attention over items that need to be mentioned and items that have already been mentioned, both of which are encoded as vectors. An additional cost objective encourages the checklist to be filled by the end of the text. Evaluation is performed on recipe and dialogue generation.
The authors construct a neural network for automated essay scoring.
Convolution window of 3 is passed over the text, which is used as input to an LSTM. The output of the LSTM is averaged over all timesteps and then a single value in the range of [0,1] is predicted as a scaled-down score for the essay. They evaluate by measuring quadratic weighted Kappa on the Kaggle essay scoring dataset.
The authors tackle the problem of domain adaptation for NER, where the label set of the target domain is different from the source domain.
They first train a CRF model on the source domain. Next, they train a LR classifier to predict labels in the target domain, based on predicted label scores from the model. Finally, the weights from the classifier are used to initialise another CRF model, which is then fine-tuned on the target domain data.
The authors compare different image recognition models and image data sources for multimodal word representation learning.
Image recognition models used for vector generation
Experiments are performed on SimLex-999 (similarity) and MEN (relatedness). The performance of different models (AlexNet, GoogLeNet, VGGNet) is found to be quite similar, with VGGNet performing slightly better at the cost of requiring more computation. Using search engines for image sources gives good coverage; ImageNet performs quite well with VGGNet; ESP Game dataset gave the lowest performance. Combining visual and linguistic vectors was found to be beneficial on both English and Italian.
They create an LSTM neural language model that 1) has better handling of numerical values, and 2) is conditioned on a knowledge base.
First the the numerical value each token is given as an additional signal to the network at each time step. While we normally represent token “25” as a normal word embedding, we now also have an extra feature with numerical value float(25). Second, they condition the language model on text in a knowledge base. All the information in the KB is converted to a string, passed through an LSTM and then used to condition the main LM.
They evaluate on a dataset of 16,003 clinical records which come paired with small KB tuples of 20 possible attributes. The numerical grounding helps quite a bit, and the best results are obtained when the KB conditioning is also added.
They start with the neural machine translation model using alignment, by Bahdanau et al. (2014), and add an extra variational component.
The authors use two neural variational components to model a distribution over latent variables z that captures the semantics of a sentence being translated. First, they model the posterior probability of z, conditioned on both input and output. Then they also model the prior of z, conditioned only on the input. During training, these two distributions are optimised to be similar using Kullback-Leibler distance, and during testing the prior is used. They report improvements on Chinese-English and English-German translation, compared to using the original encoder-decoder NMT framework.
This paper builds on top of a bunch of existing ideas for building neural conversational agents so as to control against generic and repetitive responses.
Their model is the sequence-to-sequence model with attention (Bahdanau et al.), first trained with the usual MLE loss and fine-tuned with policy gradients to optimize for specific conversational properties. Specifically, they define 3 rewards:
1. Ease of answering — Measured as the likelihood of responding to a query with a list of hand-picked dull responses (more negative log likelihood is higher reward).
2. Information flow — Consecutive responses from the same agent (person) should have different information, measured as negative of log cosine distance (more negative is better).
3. Semantic coherence — Mutual information between source and target (the response should make sense wrt query). $P(a|q) + P(q|a)$ where a is answer, q is question.
The model is pre-trained with the usual supervised objective function, taking source as concatenation of two previous utterances. Then they have two stages of policy gradient training, first with just a mutual information reward and then with a combination of all three. The policy network (sequence-to-sequence model) produces a probability distribution over actions (responses) given state (previous utterances). To estimate the gradient in an iteration, the network is frozen and responses are sampled from the model, the rewards for which are then averaged and gradients are computed for first L tokens of response using MLE and remaining T-L tokens with policy gradients, with L being gradually annealed to zero (moving towards just the long-term reward).
Evaluation is done based on length of dialogue, diversity (distinct unigram, bigrams) and human studies on
1. Which of two outputs has better quality (single turn)
2. Which of two outputs is easier to respond to, and
3. Which of two conversations have better quality (multi turn).
- Interesting results
- Avoids generic responses
- 'Ease of responding' reward encourages responses to be question-like
- Adding in hand-engineereed approximate reward functions based on conversational properties and using those to fine-tune a pre-trained network using policy gradients is neat.
- Policy gradient training also encourages two dialogue agents to interact with each other and explore the complete action space (space of responses), which seems desirable to identify modes of the distribution and not converge on a single, high-scoring, generic response.
## Weaknesses / Notes
- Evaluating conversational agents is hard. BLEU / perplexity are intentionally avoided as they don't necessarily reward desirable conversational properties.