Efficient Interactive Annotation of Segmentation Datasets with Polygon-RNN++Efficient Interactive Annotation of Segmentation Datasets with Polygon-RNN++David Acuna and Huan Ling and Amlan Kar and Sanja Fidler2018
Paper summaryukrdailoIn this paper, the authors develop a system for automatic as well as an interactive annotation (i.e. segmentation) of a dataset. In the automatic mode, bounding boxes are generated by another network (e.g. FasterRCNN), while in the interactive mode, the input bounding box around an object of interest comes from the human in the loop.
The system is composed of the following parts:
1. **Residual encoder with skip connections**. This step acts as a feature extractor. The ResNet-50 with few modifications (i.e. reducing stride, usage of dilation, removal of average pooling and FC layers) serve as a base CNN encoder. Instead of utilizing the last features of the network, the authors concatenate outputs from different layers - resized to highest feature resolution - to capture multi-level representations. This is shown in the figure below:
2. **Recurrent decoder** is a two-layer ConvLSTM which takes image features, previous (or first) vertex position and outputs one-hot encoding of 28x28 representing possible vertex position, +1 indicates that the polygon is closed (i.e. the end of the sequence). Attention weight per location is utilized using CNN features, 1st and 2nd layers of ConvLSTM. Training is formulated as reinforcement learning since recurrent decoder is considered as sequential decision-making agent. The reward function is IoU between mask generated by the enclosed polygon and ground-truth mask.
3. **Evaluator network** chooses the best polygon among multiple candidates. CNN features, last state tensor of ConvLSTM, and the predicted polygon are used as input, and the output is the predicted IoU. The best polygon is selected from the polygons which are generated using 5 top scoring first vertex predictions.
4. **Upscaling with Graph Neural Network** takes the list of vertices generated by ConvLSTM decode, adds a node in between two consecutive nodes (to produce finer details at higher resolution), and aims to predict relative offset of each node at a higher resolution. Specifically, it extracts features around every predicted vertex and forwards it through GGNN (Gated Graph Neural Network) to obtain the final location (i.e. offset) of the vertex (treated as classification task).
The whole system is not trained end-to-end. While the network was trained on CityScapes dataset, it has shown reasonable generalization to different modalities (e.g. medical data). It would be very nice to observe the opposite generalization of the model. Meaning you train on medical data and see how it performs on CityScapes data.
Efficient Interactive Annotation of Segmentation Datasets with Polygon-RNN++
arXiv e-Print archive - 2018 via Local arXiv
First published: 2018/03/26 (1 year ago) Abstract: Manually labeling datasets with object masks is extremely time consuming. In
this work, we follow the idea of Polygon-RNN to produce polygonal annotations
of objects interactively using humans-in-the-loop. We introduce several
important improvements to the model: 1) we design a new CNN encoder
architecture, 2) show how to effectively train the model with Reinforcement
Learning, and 3) significantly increase the output resolution using a Graph
Neural Network, allowing the model to accurately annotate high-resolution
objects in images. Extensive evaluation on the Cityscapes dataset shows that
our model, which we refer to as Polygon-RNN++, significantly outperforms the
original model in both automatic (10% absolute and 16% relative improvement in
mean IoU) and interactive modes (requiring 50% fewer clicks by annotators). We
further analyze the cross-domain scenario in which our model is trained on one
dataset, and used out of the box on datasets from varying domains. The results
show that Polygon-RNN++ exhibits powerful generalization capabilities,
achieving significant improvements over existing pixel-wise methods. Using
simple online fine-tuning we further achieve a high reduction in annotation
time for new datasets, moving a step closer towards an interactive annotation
tool to be used in practice.