This paper introduces a CNN based segmentation of an object that is defined by a user using four extreme points (i.e. bounding box). Interestingly, in a related work, it has been shown that clicking extreme points is about 5 times more efficient than drawing a bounding box in terms of speed. https://i.imgur.com/9GJvf17.png The extreme points have several goals in this work. First, they are used as a bounding box to crop the object of interest. Secondly, they are utilized to create a heatmap with activations in the regions of extreme points. The heatmap is created as a 2D Gaussian centered around each of the extreme points. This heatmap is matched to the size of the resized crop (i.e. 512x512) and is concatenated with the original RGB channels of the crop. The concatenated input of channel depth=4 is fed to the network which is a ResNet-101 with FC and last two maxpool layers removed. In order to maintain the same receptive field, an astrous convolution is used. Pyramid scene parsing module from PSPNet is used to aggregate global context. The network is trained with a standard cross-entropy loss weighted by a normalization factor (i.e. a frequency of a class in a dataset). How does it compare to "Efficient Interactive Annotation of Segmentation Datasets with Polygon-RNN++ " paper in terms of accuracy? Specifically, if the polygon is wrong it is easy to correct points on the polygon that are wrong. However, it is unclear how to obtain preferred segmentation when no matter how many (greater than four) extreme points are selected, the object of interest is not segmented properly.