First published: 2016/05/30 (7 years ago) Abstract: While convolutional neural networks have gained impressive success recently
in solving structured prediction problems such as semantic segmentation, it
remains a challenge to differentiate individual object instances in the scene.
Instance segmentation is very important in a variety of applications, such as
autonomous driving, image captioning, and visual question answering. Techniques
that combine large graphical models with low-level vision have been proposed to
address this problem; however, we propose an end-to-end recurrent neural
network (RNN) architecture with an attention mechanism to model a human-like
counting process, and produce detailed instance segmentations. The network is
jointly trained to sequentially produce regions of interest as well as a
dominant object segmentation within each region. The proposed model achieves
competitive results on the CVPPP, KITTI, and Cityscapes datasets.
This combines the ideas of recurrent attention to perform object detection in an image \cite{1406.6247} for multiple objects \cite{1412.7755} with semantic segmentation \cite{1505.04366}.
Segmenting subregions is to avoid a global resolution bias (the object would take up the majority of pixels) and to allow multiple scales of objects to be segmented.
Here is a video that demos the method described in the paper:
https://youtu.be/BMVDhTjEfBU