End-to-end optimization of goal-driven and visually grounded dialogue systems End-to-end optimization of goal-driven and visually grounded dialogue systems
Paper summary The authors proposed a end-to-end way to learn how to play a game, which involves both images and text, called GuessWhat?!. They use both supervised learning as a baseline and reinforcement learning to improve their results. **GuessWhat Rules :** *From the paper :* "GuessWhat?! is a cooperative two-player game in which both players see the picture of a rich visual scene with several objects. One player – the oracle – is randomly assigned an object (which could be a person) in the scene. This object is not known by the other player – the questioner – whose goal is to locate the hidden object. To do so, the questioner can ask a series of yes-no questions which are answered by the oracle" **Why do they use reinforcement learning in a dialogue context ?** Supervised learning in a dialogue system usually brings poor results because the agent only learns to say the exact same sentences that are in the training set. Reinforcement learning seems to be a better option since it doesn't try to exactly match the sentences, but allow more flexibility as long as you get a positive reward at the end. The problem is : In a dialogue context, how can you tell that the dialogue was either "good" (positive reward) or "bad" (negative reward). In the context of the GuessWhat?! game, the reward is easy. If the guesser can find the object that the oracle was assigned to, then it gets a positive reward, otherwise it gets a negative reward. The dataset is composed of 150k human-human dialogues. **Models used** *Oracle model* : Its goal is to answer by 'yes' or 'no' to the question asked by the agent. They are concatenating : - LSTM encoded information of the question asked - Information about the location of the object (coordinate of the bounding box) - The object category Then the vector is fed to a single hidden layer MLP https://i.imgur.com/SjWkciI.png *Question model* : The questionner is split in two models : - The question generation : - **Input** : History of questions already asked (if questions were asked before) and the beginning of the question (if this is not the first word of the question) - **Model** : LSTM with softmax - **Output** : The next word in the sentence - The guesser - **Input** : The image + all the questions + all the answers - **Model** : MLP + softmax - **Output** : Selection of one object among the set of all objects in the image. **Training procedure** : Train all the components above, in a supervised way. Once the training is done, you have a dialogue system that is good enough to play on it's own, but the question model is still pretty bad. To improve it, you can train it using REINFORCE Algorithm, the reward being positive if the question model guessed the good object, negative otherwise. **Main Results :** The results are given on both new objects (images have been already seen, but the objected selected had never been selected during training) and new images. The results are in % of the human score, not in absolute accuracy (100% means human-level performance). | | New objects | New images | |-----------------------|-------------|------------| | Baseline (Supervised) | 53.4% | 53% | | Reinforce | 63.2% | 62% | We can improvement using the REINFORCE algorithm. This is mainly because supervised algorithm doesn't know when to stop asking questions and give an answer. On the other hand REINFORCE is more accurate but tends to stop too early (and giving wrong answers) One last thing to point out regarding the database : The language learned by the agent is still pretty bad, the question are mostly "Is it ... ?" and since the oracle only answers yes/no questions, the interaction is relatively poor.

Summary by Mathieu Seurin 3 years ago
Your comment:

ShortScience.org allows researchers to publish paper summaries that are voted on and ranked!

Sponsored by: and