How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation on ShortScience.org

arxiv.org
scholar.google.com

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation
Liu, Chia-Wei and Lowe, Ryan and Serban, Iulian Vlad and Noseworthy, Michael and Charlin, Laurent and Pineau, Joelle
arXiv e-Print archive - 2016 via Local Bibsonomy
Keywords: dblp

Summaries/Notes 1

[link] Summary by Shagun Sodhani 7 years ago

#### Introduction

* The paper explores the strengths and weaknesses of different evaluation metrics for end-to-end dialogue systems(in unsupervised setting).
* [Link to the paper](https://arxiv.org/abs/1603.08023)

#### Evaluation Metrics Considered

##### Word Based Similarity Metric

###### BLEU

* Analyses the co-occurrences of n-grams in the ground truth and the proposed responses.
* BLEU-N: N-gram precision for the entire dataset.
* Brevity penalty added to avoid bias towards short sentences.

###### METEOR

* Create explicit alignment between candidate and target response (using Wordnet, stemmed token etc).
* Compute the harmonic mean of precision and recall between proposed and ground truth.

###### ROGUE

* F-measure based on Longest Common Subsequence (LCS) between candidate and target response.

##### Embedding Based Metric

###### Greedy Matching

* Each token in actual response is greedily matched with each token in predicted response based on cosine similarity of word embedding (and vice-versa).
* Total score is averaged over all words.

###### Embedding Average

* Calculate sentence level embedding by averaging word level embeddings
* Compare sentence level embeddings between candidate and target sentences.

###### Vector Extrema

* For each dimension in the word vector, take the most extreme value amongst all word vectors in the sentence, and use
that value in the sentence-level embedding.
* Idea is that by taking the maxima along each dimension, we can ignore the common words (which will be pulled towards the origin in the vector space).

#### Dialogue Models Considered

##### Retrieval Models

###### TF-IDF

* Compute the TF-IDF vectors for each context and response in the corpus.
* C-TFIDF computes the cosine similarity between an input context and all other contexts in the corpus and returns the response with the highest score.
* R-TFIDF computes the cosine similarity between the input context and each response directly.

###### Dual Encoder

* Two RNNs which respectively compute the vector representation of the input context and response.
* Then calculate the probability that given response is the ground truth response given the context.

##### Generative Models

###### LSTM language model

* LSTM model trained to predict the next word in the (context, response) pair.
* Given a context, model encodes it with the LSTM and generates a response using a greedy beam search procedure.

###### Hierarchical Recurrent Encoder-Decoder (HRED)

* Uses a hierarchy of encoders.
* Each utterance in the context passes through an ‘utterance-level’ encoder and the output of these encoders is passed through another 'context-level' decoder.
* Better handling of long-term dependencies as compared to the conventional Encoder-Decoder.

#### Observations

* Human survey to determine the correlation between human judgement on the quality of responses, and the score assigned by each metric.
* Metrics (especially BLEU-4 and BLEU-3) correlate poorly with human evaluation.
* Best performing metric:
* Using word-overlaps - BLEU-2 score
* Using word embeddings - vector average
* Embedding-based metrics would benefit from a weighting of word saliency.
* BLEU could still be a good evaluation metric in constrained tasks like mapping dialogue acts to natural language sentences.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private