Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems
Jesse Dodge and Andreea Gane and Xiang Zhang and Antoine Bordes and Sumit Chopra and Alexander Miller and Arthur Szlam and Jason Weston
arXiv e-Print archive - 2015 via arXiv
Keywords: cs.CL, cs.LG
more

Summaries/Notes 1

[link] Summary by Shagun Sodhani 7 years ago

#### Introduction

* The paper presents a suite of benchmark tasks to evaluate end-to-end dialogue systems such that performing well on the tasks is a necessary (but not sufficient) condition for a fully functional dialogue agent.
* [Link to the paper](https://research.facebook.com/publications/evaluating-prerequisite-qualities-for-learning-end-to-end-dialog-systems/)

#### Dataset

* Created using large-scale real-world sources - OMDB (Open Movie Database), MovieLens and Reddit.
* Consists of ~75K movie entities and ~3.5M training examples.

#### Tasks

##### QA Task

* Answering Factoid Questions without relation to the previous dialogue.
* KB(Knowledge Base) created using OMDB and stored as triplets of the form (Entity, Relation, Entity).
* Question (in Natural Language Form) generated by creating templates using [SimpleQuestions](https://arxiv.org/abs/1506.02075)
* Instead of giving out just 1 response, the system ranks all the answers in order of their relevance.

##### Recommendation Task

* Providing personalised responses to the user via recommendation instead of providing universal facts as in case 1.
* MovieLens dataset with a *user x item* matrix of ratings.
* Statements (for any user) are generated by sampling highly ranked movies by the user and forming a statement about these movies using natural language templates.
* Like the previous case, a list of ranked responses is generated.

##### QA + Recommendation Task

* Maintaining short dialogues involving both factoid and personalised content.
* Dataset consists of short conversations of 3 exchanges (3 from each participant).

##### Reddit Discussion Task

* Identify most likely response is discussions on Reddit.
* Data processed to flatten the potential conversation so that it appears to be a two participant conversation.

##### Joint Task

* Combines all the previous tasks into one single task to test all the skills at once.

#### Models Tested

* **Memory Networks** - Comprises of a memory component that includes both long term memory and short term context.

* **Supervised Embedding Models** - Sum the word embeddings of the input and the target independently and compare them with a similarity metric.

* **Recurrent Language Models** - RNN, LSTM, SeqToSeq

* **Question Answering Systems** - Systems that answer natural language questions by converting them into search queries over a KB.

* **SVD(Singular Value Decomposition)** - Standard benchmark for recommendation.

* **Information Retrieval Models** - Given a message, find the most similar message in the training dataset and report its output or find a most similar response to input directly.

#### Result

##### QA Task

* QA System > Memory Networks > Supervised Embeddings > LSTM

##### Recommendation Task

* Supervised Embeddings > Memory Networks > LSTM > SVD

##### Task Involving Dialog History

* QA + Recommendation Task and Reddit Discussion Task
* Memory Networks > Supervised Embeddings > LSTM

##### Joint Task

* Supervised word embeddings perform very poorly even when using a large number of dimensions (2000 dimensions).
* Memory Networks perform better than embedding models as they can utilise the local context and the long-term memory. But they do not perform as well on standalone QA tasks.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private