Fully Convolutional Networks for Semantic Segmentation on ShortScience.org

arxiv.org
arxiv-vanity.com
scholar.google.com

Fully Convolutional Networks for Semantic Segmentation
Jonathan Long and Evan Shelhamer and Trevor Darrell
arXiv e-Print archive - 2014 via Local arXiv
Keywords: cs.CV
more

Summaries/Notes 1

[link] Summary by Martin Thoma 7 years ago

## Terms

* Semantic Segmentation: Traditional segmentation divides the image in visually similar patches. Semantic segmentation on the other hand divides the image in semantically meaningful patches. This usually means to classify each pixel (e.g.: This pixel belongs to a cat, that pixel belongs to a dog, the other pixel is background).


## Main ideas

* Complete neural networks which were trained for image classification can be used as a convolution. Those networks can be trained on Image Net (e.g. VGG, AlexNet, GoogLeNet)
* Use upsampling to (1) reduce training and prediction time (2) improve consistency of output. (See [What are deconvolutional layers?](http://datascience.stackexchange.com/a/12110/8820) for an explanation.)


## How FCNs work

1. Train a neural network for image classification which is trained on input images of a fixed size ($d \times w \times h$)
2. Interpret the network as a single convolutional filter for each output neuron (so $k$ output neurons means you have $k$ filters) over the complete image area on which the original network was trained.
3. Run the network as a CNN over an image of any size (but at least $d \times w \times h$) with a stride $s \in \mathbb{N}_{\geq 1}$
4. If $s > 1$, then you need an upsampling layer (deconvolutional layer) to convert the coarse output into a dense output.

## Nice properties

* FCNs take images of arbitrary size and produce an image of the same output size.
* Computationally efficient

## See also:

https://www.quora.com/What-are-the-benefits-of-converting-a-fully-connected-layer-in-a-deep-neural-network-to-an-equivalent-convolutional-layer

> They allow you to treat the convolutional neural network as one giant filter. You can then spatially apply the neural net as a convolution to images larger than the original training image size, getting a spatially dense output.
>
> Let's say you train a neural net (with some loss function) with a convolutional layer (3 x 3, stride of 2), pooling layer (3 x 3, stride of 2), and a fully connected layer with 10 units, using 25 x 25 images. Note that the receptive field size of each max pooling unit is 7 x 7, so the pooling output is 5 x 5. You can convert the fully connected layer to to a set of  10 5 x 5 convolutional filters (unit strides). If you do that, the entire net can be treated as a filter with receptive field size 35 x 35 and stride of 4. You can then take that net and apply it to a 50 x 50 image, and you'd get a 3 x 3 x 10 spatially dense output.

Great summary! A figure would be helpful for the "How FCNs work" section to show the overlap.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private