Towards Diverse and Natural Image Descriptions via a Conditional GANTowards Diverse and Natural Image Descriptions via a Conditional GANBo Dai and Sanja Fidler and Raquel Urtasun and Dahua Lin2017
This paper proposes a conditional GAN-based image captioning model.
Given an image, the generator generates a caption, and given an image
and caption, the discriminator/evaluator distinguishes between generated
and real captions. Key ideas:
- Since caption generation involves sequential sampling, which is
non-differentiable, the model is trained with policy gradients, with
the action being the choice of word at every time step, policy being
the distribution over words, and reward the score assigned by the
evaluator to generated caption.
- The evaluator's role assumes a completely generated caption as input
(along with image), which in practice leads to convergence issues. Thus
to accommodate feedback for partial sequences during training, Monte Carlo
rollouts are used, i.e. given a partial generated sequence, n completions
are sampled and run through the evaluator to compute reward.
- The evaluator's objective function consists of three terms
- image-caption pairs from training data (positive)
- image and generated captions (negative)
- image and sampled captions for other images from training data (negative)
- Both the generator and evaluator are pretrained with supervision / MLE, then
fine-tuned with policy gradients. During inference, evaluator score is used as
the beam search objective.
This is neat paper with insightful ideas (Monte Carlo rollouts for assigning
rewards to partial sequences, evaluator score as beam search objective),
and is perhaps the first work on C-GAN-based image captioning.
## Weaknesses / Notes
Towards Diverse and Natural Image Descriptions via a Conditional GAN
arXiv e-Print archive - 2017 via Local arXiv
First published: 2017/03/17 (3 years ago) Abstract: Despite the substantial progress in recent years, the image captioning
techniques are still far from being perfect.Sentences produced by existing
methods, e.g. those based on RNNs, are often overly rigid and lacking in
variability. This issue is related to a learning principle widely used in
practice, that is, to maximize the likelihood of training samples. This
principle encourages high resemblance to the "ground-truth" captions while
suppressing other reasonable descriptions. Conventional evaluation metrics,
e.g. BLEU and METEOR, also favor such restrictive methods. In this paper, we
explore an alternative approach, with the aim to improve the naturalness and
diversity -- two essential properties of human expression. Specifically, we
propose a new framework based on Conditional Generative Adversarial Networks
(CGAN), which jointly learns a generator to produce descriptions conditioned on
images and an evaluator to assess how well a description fits the visual
content. It is noteworthy that training a sequence generator is nontrivial. We
overcome the difficulty by Policy Gradient, a strategy stemming from
Reinforcement Learning, which allows the generator to receive early feedback
along the way. We tested our method on two large datasets, where it performed
competitively against real people in our user study and outperformed other
methods on various tasks.