SeqGAN: Sequence Generative Adversarial Nets with Policy GradientSeqGAN: Sequence Generative Adversarial Nets with Policy GradientLantao Yu and Weinan Zhang and Jun Wang and Yong Yu2016
Paper summarydecodyngGANs for images have made impressive progress in recent years, reaching ever-higher levels of subjective realism. It’s also interesting to think about domains where the GAN architecture is less of a good fit. An example of one such domain is natural language.
As opposed to images, which are made of continuous pixel values, sentences are fundamentally sequences of discrete values: that is, words. In a GAN, when the discriminator makes its assessment of the realness of the image, the gradient for that assessment can be backpropagated through to the pixel level. The discriminator can say “move that pixel just a bit, and this other pixel just a bit, and then I’ll find the image more realistic”. However, there is no smoothly flowing continuous space of words, and, even if you use continuous embeddings of words, it’s still the case that if you tried to apply a small change to a embedding vector, you almost certainly wouldn’t end up with another word, you’d just be somewhere in the middle of nowhere in word space. In short: the discrete nature of language sequences doesn’t allow for gradient flow to propagate backwards through to the generator.
The authors of this paper propose a solution: instead of trying to treat their GAN as one big differentiable system, they framed the problem of “generate a sequence that will seem realistic to the discriminator” as a reinforcement learning problem? After all, this property - of your reward just being generated *somewhere* in the environment, not something analytic, not something you can backprop through - is one of the key constraints of reinforcement learning. Here, the more real the discriminator finds your sequence, the higher the reward. One approach to RL, and the one this paper uses, is that of a policy network, where your parametrized network produces a distribution over actions. You can’t update your model to deterministically increase reward, but you can shift around probability in your policy such that your expected reward of following that policy is higher.
This key kernel of an idea - GANs for language, but using a policy network framework to get around not having backprop-able loss/reward- gets you most of the way to understanding what these authors did, but it’s still useful to mechanically walk through specifics.
At each step, the “state” is the existing words in the sequence, and the agent’s “action” the choosing of its next word
- The Discriminator can only be applied to completed sequences, since it's difficult to determine whether an incoherent half-sentence is realistic language. So, when the agent is trying to calculate the reward of an action at a state, it uses Monte Carlo Tree Search: randomly “rolling out” many possible futures by randomly sampling from the policy, and then taking the average Discriminator judgment of all those futures resulting from each action as being its expected reward
- The Generator is a LSTM that produces a softmax over words, which can be interpreted as a policy if it’s sampled from randomly
- One of the nice benefits of this approach is that it can work well for cases where we don't have a hand-crafted quality assessment metric, the way we have BLEU score for translation
First published: 2016/09/18 (3 years ago) Abstract: As a new way of training generative models, Generative Adversarial Nets (GAN)
that uses a discriminative model to guide the training of the generative model
has enjoyed considerable success in generating real-valued data. However, it
has limitations when the goal is for generating sequences of discrete tokens. A
major reason lies in that the discrete outputs from the generative model make
it difficult to pass the gradient update from the discriminative model to the
generative model. Also, the discriminative model can only assess a complete
sequence, while for a partially generated sequence, it is non-trivial to
balance its current score and the future one once the entire sequence has been
generated. In this paper, we propose a sequence generation framework, called
SeqGAN, to solve the problems. Modeling the data generator as a stochastic
policy in reinforcement learning (RL), SeqGAN bypasses the generator
differentiation problem by directly performing gradient policy update. The RL
reward signal comes from the GAN discriminator judged on a complete sequence,
and is passed back to the intermediate state-action steps using Monte Carlo
search. Extensive experiments on synthetic data and real-world tasks
demonstrate significant improvements over strong baselines.