[link]
#### Introduction * The paper presents a suite of benchmark tasks to evaluate end-to-end dialogue systems such that performing well on the tasks is a necessary (but not sufficient) condition for a fully functional dialogue agent. * [Link to the paper](https://research.facebook.com/publications/evaluating-prerequisite-qualities-for-learning-end-to-end-dialog-systems/) #### Dataset * Created using large-scale real-world sources - OMDB (Open Movie Database), MovieLens and Reddit. * Consists of ~75K movie entities and ~3.5M training examples. #### Tasks ##### QA Task * Answering Factoid Questions without relation to the previous dialogue. * KB(Knowledge Base) created using OMDB and stored as triplets of the form (Entity, Relation, Entity). * Question (in Natural Language Form) generated by creating templates using [SimpleQuestions](https://arxiv.org/abs/1506.02075) * Instead of giving out just 1 response, the system ranks all the answers in order of their relevance. ##### Recommendation Task * Providing personalised responses to the user via recommendation instead of providing universal facts as in case 1. * MovieLens dataset with a *user x item* matrix of ratings. * Statements (for any user) are generated by sampling highly ranked movies by the user and forming a statement about these movies using natural language templates. * Like the previous case, a list of ranked responses is generated. ##### QA + Recommendation Task * Maintaining short dialogues involving both factoid and personalised content. * Dataset consists of short conversations of 3 exchanges (3 from each participant). ##### Reddit Discussion Task * Identify most likely response is discussions on Reddit. * Data processed to flatten the potential conversation so that it appears to be a two participant conversation. ##### Joint Task * Combines all the previous tasks into one single task to test all the skills at once. #### Models Tested * **Memory Networks** - Comprises of a memory component that includes both long term memory and short term context. * **Supervised Embedding Models** - Sum the word embeddings of the input and the target independently and compare them with a similarity metric. * **Recurrent Language Models** - RNN, LSTM, SeqToSeq * **Question Answering Systems** - Systems that answer natural language questions by converting them into search queries over a KB. * **SVD(Singular Value Decomposition)** - Standard benchmark for recommendation. * **Information Retrieval Models** - Given a message, find the most similar message in the training dataset and report its output or find a most similar response to input directly. #### Result ##### QA Task * QA System > Memory Networks > Supervised Embeddings > LSTM ##### Recommendation Task * Supervised Embeddings > Memory Networks > LSTM > SVD ##### Task Involving Dialog History * QA + Recommendation Task and Reddit Discussion Task * Memory Networks > Supervised Embeddings > LSTM ##### Joint Task * Supervised word embeddings perform very poorly even when using a large number of dimensions (2000 dimensions). * Memory Networks perform better than embedding models as they can utilise the local context and the long-term memory. But they do not perform as well on standalone QA tasks.
Your comment:
|