* [Detailed Summary](https://blog.heuritech.com/2017/04/11/beganstateoftheartgenerationoffaceswithgenerativeadversarialnetworks/) * [Tensorflow implementation](https://github.com/carpedm20/BEGANtensorflow) ### Summary * They suggest a GAN algorithm that is based on an autoencoder with Wasserstein distance. * Their method generates highly realistic human faces. * Their method has a convergence measure, which reflects the quality of the generates images. * Their method has a diversity hyperparameter, which can be used to set the tradeoff between image diversity and image quality. ### How * Like other GANs, their method uses a generator G and a discriminator D. * Generator * The generator is fairly standard. * It gets a noise vector `z` as input and uses upsampling+convolutions to generate images. * It uses ELUs and no BN. * Discriminator * The discriminator is a full autoencoder (i.e. it converts input images to `8x8x3` tensors, then reconstructs them back to images). * It has skipconnections from the `8x8x3` layer to each upsampling layer. * It also uses ELUs and no BN. * Their method now has the following steps: 1. Collect real images `x_real`. 2. Generate fake images `x_fake = G(z)`. 3. Reconstruct the real images `r_real = D(x_real)`. 4. Reconstruct the fake images `r_fake = D(x_fake)`. 5. Using an LpNorm (e.g. L1Norm), compute the reconstruction loss of real images `d_real = Lp(x_real, r_real)`. 6. Using an LpNorm (e.g. L1Norm), compute the reconstruction loss of fake images `d_fake = Lp(x_fake, r_fake)`. 7. The loss of D is now `L_D = d_real  d_fake`. 8. The loss of G is now `L_G = L_D`. * About the loss * `r_real` and `r_fake` are really losses (e.g. L1loss or L2loss). In the paper they use `L(...)` for that. Here they are referenced as `d_*` in order to avoid confusion. * The loss `L_D` is based on the Wasserstein distance, as in WGAN. * `L_D` assumes, that the losses `d_real` and `d_fake` are normally distributed and tries to move their mean values. Ideally, the discriminator produces very different means for real/fake images, while the generator leads to very similar means. * Their formulation of the Wasserstein distance does not require KLipschitz functions, which is why they don't have the weight clipping from WGAN. * Equilibrium * The generator and discriminator are at equilibrium, if `E[r_fake] = E[r_real]`. (That's undesirable, because it means that D can't differentiate between fake and real images, i.e. G doesn't get a proper gradient any more.) * Let `g = E[r_fake] / E[r_real]`, then: * Low `g` means that `E[r_fake]` is low and/or `E[r_real]` is high, which means that real images are not as well reconstructed as fake images. This means, that the discriminator will be more heavily trained towards reconstructing real images correctly (as that is the main source of error). * High `g` conversely means that real images are well reconstructed (compared to fake ones) and that the discriminator will be trained more towards fake ones. * `g` gives information about how much G and D should be trained each (so that none of the two overwhelms the other). * They introduce a hyperparameter `gamma` (from interval `[0,1]`), which reflects the target value of the balance `g`. * Using `gamma`, they change their losses `L_D` and `L_G` slightly: * `L_D = d_real  k_t d_fake` * `L_G = r_fake` * `k_t+1 = k_t + lambda_k (gamma d_real  d_fake)`. * `k_t` is a control term that controls how much D is supposed to focus on the fake images. It changes with every batch. * `k_t` is clipped to `[0,1]` and initialized at `0` (max focus on reconstructing real images). * `lambda_k` is like the learning rate of the control term, set to `0.001`. * Note that `gamma d_real  d_fake = 0 <=> gamma d_real = d_fake <=> gamma = d_fake / d_real`. * Convergence measure * They measure the convergence of their model using `M`: * `M = d_real + gamma d_real  d_fake` * `M` goes down, if `d_real` goes down (D becomes better at autoencoding real images). * `M` goes down, if the difference in reconstruction error between real and fake images goes down, i.e. if G becomes better at generating fake images. * Other * They use Adam with learning rate 0.0001. They decrease it by a factor of 2 whenever M stalls. * Higher initial learning rate could lead to model collapse or visual artifacs. * They generate images of max size 128x128. * They don't use more than 128 filters per conv layer. ### Results * NOTES: * Below example images are NOT from generators trained on CelebA. They used a custom dataset of celebrity images. They don't show any example images from the dataset. The generated images look like there is less background around the faces, making the task easier. * Few example images. Unclear how much cherry picking was involved. Though the results from the tensorflow example (see like at top) make it look like the examples are representative (aside from speckleartifacts). * No LSUN Bedrooms examples. Human faces are comparatively easy to generate. * Example images at 128x128: * ![Examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/BEGAN__examples.jpg?raw=true "Examples") * Effect of changing the target balance `gamma`: * ![Examples gamma](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/BEGAN__examples_gamma.jpg?raw=true "Examples gamma") * High gamma leads to more diversity at lower quality. * Interpolations: * ![Interpolations](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/BEGAN__interpolations.jpg?raw=true "Interpolations") * Convergence measure `M` and associated image quality during the training: * ![M](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/BEGAN__convergence.jpg?raw=true "M") 
* They propose a twostage GAN architecture that generates 256x256 images of (relatively) high quality. * The model gets text as an additional input and the images match the text. ### How * Most of the architecture is the same as in any GAN: * Generator G generates images. * Discriminator D discriminates betweens fake and real images. * G gets a noise variable `z`, so that it doesn't always do the same thing. * Twostaged image generation: * Instead of one step, as in most GANs, they use two steps, each consisting of a G and D. * The first generator creates 64x64 images via upsampling. * The first discriminator judges these images via downsampling convolutions. * The second generator takes the image from the first generator, downsamples it via convolutions, then applies some residual convolutions and then reupsamples it to 256x256. * The second discriminator is comparable to the first one (downsampling convolutions). * Note that the second generator does not get an additional noise term `z`, only the first one gets it. * For upsampling, they use 3x3 convolutions with ReLUs, BN and nearest neighbour upsampling. * For downsampling, they use 4x4 convolutions with stride 2, Leaky ReLUs and BN (the first convolution doesn't seem to use BN). * Text embedding: * The generated images are supposed to match input texts. * These input texts are embedded to vectors. * These vectors are added as: 1. An additional input to the first generator. 2. An additional input to the second generator (concatenated after the downsampling and before the residual convolutions). 3. An additional input to the first discriminator (concatenated after the downsampling). 4. An additional input to the second discriminator (concatenated after the downsampling). * In case the text embeddings need to be matrices, the values are simply reshaped to `(N, 1, 1)` and then repeated to `(N, H, W)`. * The texts are converted to embeddings via a network at the start of the model. * Input to that vector: Unclear. (Concatenated word vectors? Seems to not be described in the text.) * The input is transformed to a vector via a fully connected layer (the text model is apparently not recurrent). * The vector is transformed via fully connected layers to a mean vector and a sigma vector. * These are then interpreted as normal distributions, from which the final output vector is sampled. This uses the reparameterization trick, similar to the method in VAEs. * Just like in VAEs, a KLdivergence term is added to the loss, which prevents each single normal distribution from deviating too far from the unit normal distribution `N(0,1)`. * The authors argue, that using the VAElike formulation  instead of directly predicting an output vector (via FC layers)  compensated for the lack of labels (smoother manifold). * Note: This way of generating text embeddings seems very simple. (No recurrence, only about two layers.) It probably won't do much more than just roughly checking for the existence of specific words and word combinations (e.g. "red head"). * Visualization of the architecture: * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/StackGAN__architecture.jpg?raw=true "Architecture") ### Results * Note: No example images of the twostage architecture for LSUN bedrooms. * Using only the first stage of the architecture (first G and D) reduces the Inception score significantly. * Adding the text to both the first and second generator improves the Inception score slightly. * Adding the VAElike text embedding generation (as opposed to only FC layers) improves the Inception score slightly. * Generating images at higher resolution (256x256 instead of 128x128) improves the Inception score significantly * Note: The 256x256 architecture has more residual convolutions than the 128x128 one. * Note: The 128x128 and the 256x256 are both upscaled to 299x299 images before computing the Inception score. That should make the 128x128 images quite blurry and hence of low quality. * Example images, with text and stage 1/2 results: * ![Examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/StackGAN__examples.jpg?raw=true "Examples") * More examples of birds: * ![Examples birds](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/StackGAN__examples_birds.jpg?raw=true "Examples birds") * Examples of failures: * ![Failure Cases](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/StackGAN__failures.jpg?raw=true "Failure Cases") * The authors argue, that most failure cases happen when stage 1 messes up. 
https://github.com/bioinfjku/SNNs * They suggest a variation of ELUs, which leads to networks being automatically normalized. * The effects are comparable to Batch Normalization, while requiring significantly less computation (barely more than a normal ReLU). ### How * They define SelfNormalizing Neural Networks (SNNs) as neural networks, which automatically keep their activations at zeromean and unitvariance (per neuron). * SELUs * They use SELUs to turn their networks into SNNs. * Formula: * ![SELU](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/SelfNormalizing_Neural_Networks__SELU.jpg?raw=true "SELU") * with `alpha = 1.6733` and `lambda = 1.0507`. * They proof that with properly normalized weights the activations approach a fixed point of zeromean and unitvariance. (Different settings for alpha and lambda can lead to other fixed points.) * They proof that this is still the case when previous layer activations and weights do not have optimal values. * They proof that this is still the case when the variance of previous layer activations is very high or very low and argue that the mean of those activations is not so important. * Hence, SELUs with these hyperparameters should have selfnormalizing properties. * SELUs are here used as a basis because: 1. They can have negative and positive values, which allows to control the mean. 2. They have saturating regions, which allows to dampen high variances from previous layers. 3. They have a slope larger than one, which allows to increase low variances from previous layers. 4. They generate a continuous curve, which ensures that there is a fixed point between variance damping and increasing. * ReLUs, Leaky ReLUs, Sigmoids and Tanhs do not offer the above properties. * Initialization * SELUs for SNNs work best with normalized weights. * They suggest to make sure per layer that: 1. The first moment (sum of weights) is zero. 2. The second moment (sum of squared weights) is one. * This can be done by drawing weights from a normal distribution `N(0, 1/n)`, where `n` is the number of neurons in the layer. * Alphadropout * SELUs don't perform as well with normal Dropout, because their point of low variance is not 0. * They suggest a modification of Dropout called Alphadropout. * In this technique, values are not dropped to 0 but to `alpha' = lambda * alpha = 1.0507 * 1.6733 = 1.7581`. * Similar to dropout, activations are changed during training to compensate for the dropped units. * Each activation `x` is changed to `a(xd+alpha'(1d))+b`. * `d = B(1, q)` is the dropout variable consisting of 1s and 0s. * `a = (q + alpha'^2 q(1q))^(1/2)` * `b = (q + alpha'^2 q(1q))^(1/2) ((1q)alpha')` * They made good experiences with dropout rates around 0.05 to 0.1. ### Results * Note: All of their tests are with fully connected networks. No convolutions. * Example training results: * ![MINST CIFAR10](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/SelfNormalizing_Neural_Networks__MNIST_CIFAR10.jpg?raw=true "MNIST CIFAR10") * Left: MNIST, Right: CIFAR10 * Networks have N layers each, see legend. No convolutions. * 121 UCI Tasks * They manage to beat SVMs and RandomForests, while other networks (Layer Normalization, BN, Weight Normalization, Highway Networks, ResNet) perform significantly worse than their network (and usually don't beat SVMs/RFs). * Tox21 * They achieve better results than other networks (again, Layer Normalization, BN, etc.). * They achive almost the same result as the so far best model on the dataset, which consists of a mixture of neural networks, SVMs and Random Forests. * HTRU2 * They achieve better results than other networks. * They beat the best nonneural method (Naive Bayes). * Among all tested other networks, MSRAinit performs best, which references a network withput any normalization, only ReLUs and Microsoft Weight Initialization (see paper: `Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification`). 
* They suggest a slightly altered algorithm for GANs. * The new algorithm is more stable than previous ones. ### How * Each GAN contains a Generator that generates (fake)examples and a Discriminator that discriminates between fake and real examples. * Both fake and real examples can be interpreted as coming from a probability distribution. * The basis of each GAN algorithm is to somehow measure the difference between these probability distributions and change the network parameters of G so that the fakedistribution becomes more and more similar to the real distribution. * There are multiple distance measures to do that: * Total Variation (TV) * KLDivergence (KL) * JensenShannon divergence (JS) * This one is based on the KLDivergence and is the basis of the original GAN, as well as LAPGAN and DCGAN. * EarthMover distance (EM), aka Wasserstein1 * Intuitively, one can imagine both probability distributions as hilly surfaces. EM then reflects, how much mass has to be moved to convert the fake distribution to the real one. * Ideally, a distance measure has everywhere nice values and gradients (e.g. no +/ infinity values; no binary 0 or 1 gradients; gradients that get continously smaller when the generator produces good outputs). * In that regard, EM beats JS and JS beats TV and KL (roughly speaking). So they use EM. * EM * EM is defined as * ![EM](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/WGAN__EM.jpg?raw=true "EM") * (inf = infinum, more or less a minimum) * which is intractable, but following the KantorovichRubinstein duality it can also be calculated via * ![EM tractable](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/WGAN__EM_tractable.jpg?raw=true "EM tractable") * (sup = supremum, more or less a maximum) * However, the second formula is here only valid if the network is a KLipschitz function (under every set of parameters). * This can be guaranteed by simply clipping the discriminator's weights to the range `[0.01, 0.01]`. * Then in practice the following version of the tractable EM is used, where `w` are the parameters of the discriminator: * ![EM tractable in practice](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/WGAN__EM_tractable_practice.jpg?raw=true "EM tractable in practice") * The full algorithm is mostly the same as for DCGAN: * ![Algorithm](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/WGAN__algorithm.jpg?raw=true "Algorithm") * Line 2 leads to training the discriminator multiple times per batch (i.e. more often than the generator). * This is similar to the `max w in W` in the third formula (above). * This was already part of the original GAN algorithm, but is here more actively used. * Because of the EM distance, even a "perfect" discriminator still gives good gradient (in contrast to e.g. JS, where the discriminator should not be too far ahead). So the discriminator can be safely trained more often than the generator. * Line 5 and 10 are derived from EM. Note that there is no more Sigmoid at the end of the discriminator! * Line 7 is derived from the KLipschitz requirement (clipping of weights). * High learning rates or using momentumbased optimizers (e.g. Adam) made the training unstable, which is why they use a small learning rate with RMSprop. ### Results * Improved stability. The method converges to decent images with models which failed completely when using JSdivergence (like in DCGAN). * For example, WGAN worked with generators that did not have batch normalization or only consisted of fully connected layers. * Apparently no more mode collapse. (Mode collapse in GANs = the generator starts to generate often/always the practically same image, independent of the noise input.) * There is a relationship between loss and image quality. Lower loss (at the generator) indicates higher image quality. Such a relationship did not exist for JS divergence. * Example images: * ![Example images](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/WGAN__examples.jpg?raw=true "Example images") 
* They suggest a new version of YOLO, a model to detect bounding boxes in images. * Their new version is more accurate, faster and is trained to recognize up to 9000 classes. ### How * Their base model is the previous YOLOv1, which they improve here. * Accuracy improvements * They add batch normalization to the network. * Pretraining usually happens on ImageNet at 224x224, fine tuning for bounding box detection then on another dataset, say Pascal VOC 2012, at higher resolutions, e.g. 448x448 in the case of YOLOv1. This is problematic, because the pretrained network has to learn to deal with higher resolutions and a new task at the same time. They instead first pretrain on low resolution ImageNet examples, then on higher resolution ImegeNet examples and only then switch to bounding box detection. That improves their accuracy by about 4 percentage points mAP. * They switch to anchor boxes, similar to Faster RCNN. That's largely the same as in YOLOv1. Classification is now done per tested anchor box shape, instead of per grid cell. The regression of x/ycoordinates is now a bit smarter and uses sigmoids to only translate a box within a grid cell. * In Faster RCNN the anchor box shapes are manually chosen (e.g. small squared boxes, large squared boxes, thin but high boxes, ...). Here instead they learn these shapes from data. That is done by applying kMeans to the bounding boxes in a dataset. They cluster them into k=5 clusters and then use the centroids as anchor box shapes. Their accuracy this way is the same as with 9 manually chosen anchor boxes. (Using k=9 further increases their accuracy significantly, but also increases model complexity. As they want to predict 9000 classes they stay with k=5.) * To better predict small bounding boxes, they add a passthrough connection from a higher resolution layer to the end of the network. * They train their network now at multiple scales. (As the network is now fully convolutional, they can easily do that.) * Speed improvements * They get rid of their fully connected layers. Instead the network is now fully convolutional. * They have also removed a handful or so of their convolutional layers. * Capability improvement (weakly supervised learning) * They suggest a method to predict bounding boxes of the 9000 most common classes in ImageNet. They add a few more abstract classes to that (e.g. dog for all breeds of dogs) and arrive at over 9000 classes (9418 to be precise). * They train on ImageNet and MSCOCO. * ImageNet only contains class labels, no bounding boxes. MSCOCO only contains general classes (e.g. "dog" instead of the specific breed). * They train iteratively on both datasets. MSCOCO is used for detection and classification, while ImageNet is only used for classification. For an ImageNet example of class `c`, they search among the predicted bounding boxes for the one that has highest predicted probability of being `c` and backpropagate only the classification loss for that box. * In order to compensate the problem of different abstraction levels on the classes (e.g. "dog" vs a specific breed), they make use of WordNet. Based on that data they generate a hierarchy/tree of classes, e.g. one path through that tree could be: object > animal > canine > dog > hunting dog > terrier > yorkshire terrier. They let the network predict paths in that hierarchy, so that the prediction "dog" for a specific dog breed is not completely wrong. * Visualization of the hierarchy: * ![YOLO9000 hierarchy](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/YOLO9000__hierarchy.jpg?raw=true "YOLO9000 hierarchy") * They predict many small softmaxes for the paths in the hierarchy, one per node: * ![YOLO9000 softmaxes](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/YOLO9000__softmaxes.jpg?raw=true "YOLO9000 softmaxes") ### Results * Accuracy * They reach about 73.4 mAP when training on Pascal VOC 2007 and 2012. That's slightly behind Faster RCNN with VGG16 with 75.9 mAP, trained on MSCOCO+2007+2012. * Speed * They reach 91 fps (10ms/image) at image resolution 288x288 and 40 fps (25ms/image) at 544x544. * Weakly supervised learning * They test their 9000classdetection on ImageNet's detection task, which contains bounding boxes for 200 object classes. * They achieve 19.7 mAP for all classes and 16.0% mAP for the 156 classes which are not part of MSCOCO. * For some classes they get 0 mAP accuracy. * The system performs well for all kinds of animals, but struggles with notliving objects, like sunglasses. * Example images (notice the class labels): * ![YOLO9000 examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/YOLO9000__examples.jpg?raw=true "YOLO9000 examples") 
* They suggest a model ("YOLO") to detect bounding boxes in images. * In comparison to Faster RCNN, this model is faster but less accurate. ### How * Architecture * Input are images with a resolution of 448x448. * Output are `S*S*(B*5 + C)` values (per image). * `S` is the grid size (default value: 7). Each image is split up into `S*S` cells. * `B` is the number of "tested" bounding box shapes at each cell (default value: 2). So at each cell, the network might try one large and one small bounding box. The network predicts additionally for each such tested bounding box `5` values. These cover the exact position (x, y) and scale (height, width) of the bounding box as well as a confidence value. They allow the network to fine tune the bounding box shape and reject it, e.g. if there is no object in the grid cell. The confidence value is zero if there is no object in the grid cell and otherwise matches the IoU between predicted and true bounding box. * `C` is the number of classes in the dataset (e.g. 20 in Pascal VOC). For each grid cell, the model decides once to which of the `C` objects the cell belongs. * Rough overview of their outputs: * ![Method](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/YOLO__method.jpg?raw=true "Method") * In contrast to Faster RCNN, their model does *not* use a separate region proposal network (RPN). * Per bounding box they actually predict the *square root* of height and width instead of the raw values. That is supposed to result in similar errors/losses for small and big bounding boxes. * They use a total of 24 convolutional layers and 2 fully connected layers. * Some of these convolutional layers are 1x1convs that halve the number of channels (followed by 3x3s that double them again). * Overview of the architecture: * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/YOLO__architecture.jpg?raw=true "Architecture") * They use Leaky ReLUs (alpha=0.1) throughout the network. The last layer uses linear activations (apparently even for the class prediction...!?). * Similarly to Faster RCNN, they use a non maximum suppression that drops predicted bounding boxes if they are too similar to other predictions. * Training * They pretrain their network on ImageNet, then finetune on Pascal VOC. * Loss * They use sumsquared losses (apparently even for the classification, i.e. the `C` values). * They dont propagate classification loss (for `C`) for grid cells that don't contain an object. * For each grid grid cell they "test" `B` example shapes of bounding boxes (see above). Among these `B` shapes, they only propagate the bounding box losses (regarding x, y, width, height, confidence) for the shape that has highest IoU with a ground truth bounding box. * Most grid cells don't contain a bounding box. Their confidence values will all be zero, potentialle dominating the total loss. To prevent that, the weighting of the confidence values in the loss function is reduced relative to the regression components (x, y, height, width). ### Results * The coarse grid and B=2 setting lead to some problems. Namely, small objects are missed and bounding boxes can end up being dropped if they are too close to other bounding boxes. * The model also has problems with unusual bounding box shapes. * Overall their accuracy is about 10 percentage points lower than Faster RCNN with VGG16 (63.4% vs 73.2%, measured in mAP on Pascal VOC 2007). * They achieve 45fps (22ms/image), compared to 7fps (142ms/image) with Faster RCNN + VGG16. * Overview of results on Pascal VOC 2012: * ![Results on VOC2012](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/YOLO__results.jpg?raw=true "Results on VOC2012") * They also suggest a faster variation of their model which reached 145fps (7ms/image) at a further drop of 10 percentage points mAP (to 52.7%). * A significant part of their error seems to come from badly placed or sized bounding boxes (e.g. too wide or too much to the right). * They mistake background less often for objects than Fast RCNN. They test combining both models with each other and can improve Fast RCNN's accuracy by about 2.5 percentage points mAP. * They test their model on paintings/artwork (Picasso and PeopleArt datasets) and notice that it generalizes fairly well to that domain. * Example results (notice the paintings at the top): * ![Examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/YOLO__examples.jpg?raw=true "Examples") 
* They present a variation of Faster RCNN. * Faster RCNN is a model that detects bounding boxes in images. * Their variation is about as accurate as the best performing versions of Faster RCNN. * Their variation is significantly faster than these variations (roughly 50ms per image). ### How * PVANET reuses the standard Faster RCNN architecture: * A base network that transforms an image into a feature map. * A region proposal network (RPN) that uses the feature map to predict bounding box candidates. * A classifier that uses the feature map and the bounding box candidates to predict the final bounding boxes. * PVANET modifies the base network and keeps the RPN and classifier the same. * Inception * Their base network uses eight Inception modules. * They argue that these are good choices here, because they are able to represent an image at different scales (aka at different receptive field sizes) due to their mixture of 3x3 and 1x1 convolutions. * ![Receptive field sizes in inception modules](images/PVANET__inception_fieldsize.jpg?raw=true "Receptive field sizes in inception modules") * Representing an image at different scales is useful here in order to detect both large and small bounding boxes. * Inception modules are also reasonably fast. * Visualization of their Inception modules: * ![Inception modules architecture](images/PVANET__inception_modules.jpg?raw=true "Inception modules architecture") * Concatenated ReLUs * Before the eight Inception modules, they start the network with eight convolutions using concatenated ReLUs. * These CReLUs compute both the classic ReLU result (`max(0, x)`) and concatenate to that the negated result, i.e. something like `f(x) = max(0, x <concat> (1)*x)`. * That is done, because among the early one can often find pairs of convolution filters that are the negated variations of each other. So by adding CReLUs, the network does not have to compute these any more, instead they are created (almost) for free, reducing the computation time by up to 50%. * Visualization of their final CReLU block: * TODO * ![CReLU modules](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/PVANET__crelu.jpg?raw=true "CReLU modules") * MultiScale output * Usually one would generate the final feature map simply from the output of the last convolution. * They instead combine the outputs of three different convolutions, each resembling a different scale (or level of abstraction). * They take one from an early point of the network (downscaled), one from the middle part (kept the same) and one from the end (upscaled). * They concatenate these and apply a 1x1 convolution to generate the final output. * Other stuff * Most of their network uses residual connections (including the Inception modules) to facilitate learning. * They pretrain on ILSVRC2012 and then perform finetuning on MSCOCO, VOC 2007 and VOC 2012. * They use plateau detection for their learning rate, i.e. if a moving average of the loss does not improve any more, they decrease the learning rate. They say that this increases accuracy significantly. * The classifier in Faster RCNN consists of fully connected layers. They compress these via Truncated SVD to speed things up. (That was already part of Fast RCNN, I think.) ### Results * On Pascal VOC 2012 they achieve 82.5% mAP at 46ms/image (Titan X GPU). * Faster RCNN + ResNet101: 83.8% at 2.2s/image. * Faster RCNN + VGG16: 75.9% at 110ms/image. * RFCN + ResNet101: 82.0% at 133ms/image. * Decreasing the number of region proposals from 300 per image to 50 almost doubles the speed (to 27ms/image) at a small loss of 1.5 percentage points mAP. * Using Truncated SVD for the classifier reduces the required timer per image by about 30% at roughly 1 percentage point of mAP loss. 
* They present a variation of Faster RCNN, i.e. a model that predicts bounding boxes in images and classifies them. * In contrast to Faster RCNN, their model is fully convolutional. * In contrast to Faster RCNN, the computation per bounding box candidate (region proposal) is very low. ### How * The basic architecture is the same as in Faster RCNN: * A base network transforms an image to a feature map. Here they use ResNet101 to do that. * A region proposal network (RPN) uses the feature map to locate bounding box candidates ("region proposals") in the image. * A classifier uses the feature map and the bounding box candidates and classifies each one of them into `C+1` classes, where `C` is the number of object classes to spot (e.g. "person", "chair", "bottle", ...) and `1` is added for the background. * During that process, small subregions of the feature maps (those that match the bounding box candidates) must be extracted and converted to fixedsizes matrices. The method to do that is called "Region of Interest Pooling" (RoIPooling) and is based on max pooling. It is mostly the same as in Faster RCNN. * Visualization of the basic architecture: * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/RFCN__architecture.jpg?raw=true "Architecture") * Positionsensitive classification * Fully convolutional bounding box detectors tend to not work well. * The authors argue, that the problems come from the translationinvariance of convolutions, which is a desirable property in the case of classification but not when precise localization of objects is required. * They tackle that problem by generating multiple heatmaps per object class, each one being slightly shifted ("positionsensitive score maps"). * More precisely: * The classifier generates per object class `c` a total of `k*k` heatmaps. * In the simplest form `k` is equal to `1`. Then only one heatmap is generated, which signals whether a pixel is part of an object of class `c`. * They use `k=3*3`. The first of those heatmaps signals, whether a pixel is part of the *top left* corner of a bounding box of class `c`. The second heatmap signals, whether a pixel is part of the *top center* of a bounding box of class `c` (and so on). * The RoIPooling is applied to these heatmaps. * For `k=3*3`, each bounding box candidate is converted to `3*3` values. The first one resembles the top left corner of the bounding box candidate. Its value is generated by taking the average of the values in that area in the first heatmap. * Once the `3*3` values are generated, the final score of class `c` for that bounding box candidate is computed by averaging the values. * That process is repeated for all classes and a softmax is used to determine the final class. * The graphic below shows examples for that: * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/RFCN__examples.jpg?raw=true "Examples") * The above described RoIPooling uses only averages and hence is almost (computationally) free. * They make use of that during the training by sampling many candidates and only backpropagating on those with high losses (online hard example mining, OHEM). * À trous trick * In order to increase accuracy for small bounding boxes they use the à trous trick. * That means that they use a pretrained base network (here ResNet101), then remove a pooling layer and set the à trous rate (aka dilation) of all convolutions after the removed pooling layer to `2`. * The á trous rate describes the distance of sampling locations of a convolution. Usually that is `1` (sampled locations are right next to each other). If it is set to `2`, there is one value "skipped" between each pair of neighbouring sampling location. * By doing that, the convolutions still behave as if the pooling layer existed (and therefore their weights can be reused). At the same time, they work at an increased resolution, making them more capable of classifying small objects. (Runtime increases though.) * Training of RFCN happens similarly to Faster RCNN. ### Results * Similar accuracy as the most accurate Faster RCNN configurations at a lower runtime of roughly 170ms per image. * Switching to ResNet50 decreases accuracy by about 2 percentage points mAP (at faster runtime). Switching to ResNet152 seems to provide no measureable benefit. * OHEM improves mAP by roughly 2 percentage points. * À trous trick improves mAP by roughly 2 percentage points. * Training on `k=1` (one heatmap per class) results in a failure, i.e. a model that fails to predict bounding boxes. `k=7` is slightly more accurate than `k=3`.
1 Comments

* RCNN and its successor Fast RCNN both rely on a "classical" method to find region proposals in images (i.e. "Which regions of the image look like they *might* be objects?"). * That classical method is selective search. * Selective search is quite slow (about two seconds per image) and hence the bottleneck in Fast RCNN. * They replace it with a neural network (region proposal network, aka RPN). * The RPN reuses the same features used for the remainder of the Fast RCNN network, making the region proposal step almost free (about 10ms). ### How * They now have three components in their network: * A model for feature extraction, called the "feature extraction network" (**FEN**). Initialized with the weights of a pretrained network (e.g. VGG16). * A model to use these features and generate region proposals, called the "Region Proposal Network" (**RPN**). * A model to use these features and region proposals to classify each regions proposal's object and readjust the bounding box, called the "classification network" (**CN**). Initialized with the weights of a pretrained network (e.g. VGG16). * Usually, FEN will contain the convolutional layers of the pretrained model (e.g. VGG16), while CN will contain the fully connected layers. * (Note: Only "RPN" really pops up in the paper, the other two remain more or less unnamed. I added the two names to simplify the description.) * Rough architecture outline: * ![Architecture](images/Faster_RCNN__architecture.jpg?raw=true "Architecture") * The basic method at test is as follows: 1. Use FEN to convert the image to features. 2. Apply RPN to the features to generate region proposals. 3. Use Region of Interest Pooling (RoIPooling) to convert the features of each region proposal to a fixed sized vector. 4. Apply CN to the RoIvectors to a) predict the class of each object (out of `K` object classes and `1` background class) and b) readjust the bounding box dimensions (top left coordinate, height, width). * RPN * Basic idea: * Place anchor points on the image, all with the same distance to each other (regular grid). * Around each anchor point, extract rectangular image areas in various shapes and sizes ("anchor boxes"), e.g. thin/square/wide and small/medium/large rectangles. (More precisely: The features of these areas are extracted.) * Visualization: * ![Anchor Boxes](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Faster_RCNN__anchor_boxes.jpg?raw=true "Anchor Boxes") * Feed the features of these areas through a classifier and let it rate/predict the "regionness" of the rectangle in a range between 0 and 1. Values greater than 0.5 mean that the classifier thinks the rectangle might be a bounding box. (CN has to analyze that further.) * Feed the features of these areas through a regressor and let it optimize the region size (top left coordinate, height, width). That way you get all kinds of possible bounding box shapes, even though you only use a few base shapes. * Implementation: * The regular grid of anchor points naturally arises due to the downscaling of the FEN, it doesn't have to be implemented explicitly. * The extraction of anchor boxes and classification + regression can be efficiently implemented using convolutions. * They first apply a 3x3 convolution on the feature maps. Note that the convolution covers a large image area due to the downscaling. * Not so clear, but sounds like they use 256 filters/kernels for that convolution. * Then they apply some 1x1 convolutions for the classification and regression. * They use `2*k` 1x1 convolutions for classification and `4*k` 1x1 convolutions for regression, where `k` is the number of different shapes of anchor boxes. * They use `k=9` anchor box types: Three sizes (small, medium, large), each in three shapes (thin, square, wide). * The way they build training examples (below) forces some 1x1 convolutions to react only to some anchor box types. * Training: * Positive examples are anchor boxes that have an IoU with a ground truth bounding box of 0.7 or more. If no anchor point has such an IoU with a specific box, the one with the highest IoU is used instead. * Negative examples are all anchor boxes that have IoU that do not exceed 0.3 for any bounding box. * Any anchor point that falls in neither of these groups does not contribute to the loss. * Anchor boxes that would violate image boundaries are not used as examples. * The loss is similar to the one in Fast RCNN: A sum consisting of log loss for the classifier and smooth L1 loss (=smoother absolute distance) for regression. * Per batch they only sample examples from one image (for efficiency). * They use 128 positive examples and 128 negative ones. If they can't come up with 128 positive examples, they add more negative ones. * Test: * They use nonmaximum suppression (NMS) to remove too identical region proposals, i.e. among all region proposals that have an IoU overlap of 0.7 or more, they pick the one that has highest score. * They use the 300 proposals with highest score after NMS (or less if there aren't that many). * Feature sharing * They want to share the features of the FEN between the RPN and the CN. * So they need a special training method that finetunes all three components while keeping the features extracted by FEN useful for both RPN and CN at the same time (not only for one of them). * Their training methods are: * Alternating traing: One batch for FEN+RPN, one batch for FEN+CN, then again one batch for FEN+RPN and so on. * Approximate joint training: Train one network of FEN+RPN+CN. Merge the gradients of RPN and CN that arrive at FEN via simple summation. This method does not compute a gradient from CN through the RPN's regression task, as that is nontrivial. (This runs 2550% faster than alternating training, accuracy is mostly the same.) * Nonapproximate joint training: This would compute the above mentioned missing gradient, but isn't implemented. * 4step alternating training: 1. Clone FEN to FEN1 and FEN2. 2. Train the pair FEN1 + RPN. 3. Train the pair FEN2 + CN using the region proposals from the trained RPN. 4. Finetune the pair FEN2 + RPN. FEN2 is fixed, RPN takes the weights from step 2. 5. Finetune the pair FEN2 + CN. FEN2 is fixed, CN takes the weights from step 3, region proposals come from RPN from step 4. * Results * Example images: * ![Example images](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Faster_RCNN__examples.jpg?raw=true "Example images") * Pascal VOC (with VGG16 as FEN) * Using an RPN instead of SS (selective search) slightly improved mAP from 66.9% to 69.9%. * Training RPN and CN on the same FEN (sharing FEN's weights) does not worsen the mAP, but instead improves it slightly from 68.5% to 69.9%. * Using the RPN instead of SS significantly speeds up the network, from 1830ms/image (less than 0.5fps) to 198ms/image (5fps). (Both stats with VGG16. They also use ZF as the FEN, which puts them at 17fps, but mAP is lower.) * Using per anchor point more scales and shapes (ratios) for the anchor boxes improves results. * 1 scale, 1 ratio: 65.8% mAP (scale `128*128`, ratio 1:1) or 66.7% mAP (scale `256*256`, same ratio). * 3 scales, 3 ratios: 69.9% mAP (scales `128*128`, `256*256`, `512*512`; ratios 1:1, 1:2, 2:1). * Twostaged vs onestaged * Instead of the twostage system (first, generate proposals via RPN, then classify them via CN), they try a onestaged system. * In the onestaged system they move a sliding window over the computed feature maps and regress at every location the bounding box sizes and classify the box. * When doing this, their performance drops from 58.7% to about 54%. 
* The original RCNN had three major disadvantages: 1. Twostaged training pipeline: Instead of only training a CNN, one had to train first a CNN and then multiple SVMs. 2. Expensive training: Training was slow and required lots of disk space (feature vectors needed to be written to disk for all region proposals (2000 per image) before training the SVMs). 3. Slow test: Each region proposal had to be handled independently. * Fast RCNN ist an improved version of RCNN and tackles the mentioned problems. * It no longer uses SVMs, only CNNs (singlestage). * It does one single feature extraction per image instead of per region, making it much faster (9x faster at training, 213x faster at test). * It is more accurate than RCNN. ### How * The basic architecture, training and testing methods are mostly copied from RCNN. * For each image at test time they do: * They generate region proposals via selective search. * They feed the image once through the convolutional layers of a pretrained network, usually VGG16. * For each region proposal they extract the respective region from the features generated by the network. * The regions can have different sizes, but the following steps need fixed size vectors. So each region is downscaled via maxpooling so that it has a size of 7x7 (so apparently they ignore regions of sizes below 7x7...?). * This is called Region of Interest Pooling (RoIPooling). * During the backwards pass, partial derivatives can be transferred to the maximum value (as usually in max pooling). That derivative values are summed up over different regions (in the same image). * They reshape the 7x7 regions to vectors of length `F*7*7`, where `F` was the number of filters in the last convolutional layer. * They feed these vectors through another network which predicts: 1. The class of the region (including background class). 2. Top left xcoordinate, top left ycoordinate, log height and log width of the bounding box (i.e. it finetunes the region proposal's bounding box). These values are predicted once for every class (so `K*4` values). * Architecture as image: * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Fast_RCNN__architecture.jpg?raw=true "Architecture") * Sampling for training * Efficiency * If batch size is `B` it is inefficient to sample regions proposals from `B` images as each image will require a full forward pass through the base network (e.g. VGG16). * It is much more efficient to use few images to share most of the computation between region proposals. * They use two images per batch (each 64 region proposals) during training. * This technique introduces correlations between examples in batches, but they did not observe any problems from that. * They call this technique "hierarchical sampling" (first images, then region proposals). * IoUs * Positive examples for specific classes during training are region proposals that have an IoU with ground truth bounding boxes of `>=0.5`. * Examples for background region proposals during training have IoUs with any ground truth box in the interval `(0.1, 0.5]`. * Not picking IoUs below 0.1 is similar to hard negative mining. * They use 25% positive examples, 75% negative/background examples per batch. * They apply horizontal flipping as data augmentation, nothing else. * Outputs * For their class predictions the use a simple softmax with negative log likelihood. * For their bounding box regression they use a smooth L1 loss (similar to mean absolute error, but switches to mean squared error for very low values). * Smooth L1 loss is less sensitive to outliers and less likely to suffer from exploding gradients. * The smooth L1 loss is only active for positive examples (not background examples). (Not active means that it is zero.) * Training schedule * The use SGD. * They train 30k batches with learning rate 0.001, then 0.0001 for another 10k batches. (On Pascal VOC, they use more batches on larger datasets.) * They use twice the learning rate for the biases. * They use momentum of 0.9. * They use parameter decay of 0.0005. * Truncated SVD * The final network for class prediction and bounding box regression has to be applied to every region proposal. * It contains one large fully connected hidden layer and one fully connected output layer (`K+1` classes plus `K*4` regression values). * For 2000 proposals that becomes slow. * So they compress the layers after training to less weights via truncated SVD. * A weights matrix is approximated via ![TSVD equation](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Fast_RCNN__tsvd.jpg?raw=true "TSVD equation") * U (`u x t`) are the first `t` leftsingular vectors of W. * Sigma is a `t x t` diagonal matrix of the top `t` singular values. * V (`v x t`) are the first `t` rightsingular vectors of W. * W is then replaced by two layers: One contains `Sigma V^T` as weights (no biases), the other contains `U` as weights (with original biases). * Parameter count goes down to `t(u+v)` from `uv`. ### Results * They try three base models: * AlexNet (Small, S) * VGGCNNM1024 (Medium, M) * VGG16 (Large, L) * On VGG16 and Pascal VOC 2007, compared to original RCNN: * Training time down to 9.5h from 84h (8.8x faster). * Test rate *with SVD* (1024 singular values) improves from 47 seconds per image to 0.22 seconds per image (213x faster). * Test rate *without SVD* improves similarly to 0.32 seconds per image. * mAP improves from 66.0% to 66.6% (66.9% without SVD). * Per class accuracy results: * Fast_RCNN__pvoc2012.jpg * ![VOC2012 results](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Fast_RCNN__pvoc2012.jpg?raw=true "VOC2012 results") * Fixing the weights of VGG16's convolutional layers and only finetuning the fully connected layers (those are applied to each region proposal), decreases the accuracy to 61.4%. * This decrease in accuracy is most significant for the later convolutional layers, but marginal for the first layers. * Therefor they only train the convolutional layers starting with `conv3_1` (9 out of 13 layers), which speeds up training. * Multitask training * Training models on classification and bounding box regression instead of only on classification improves the mAP (from 62.6% to 66.9%). * Doing this in one hierarchy instead of two seperate models (one for classification, one for bounding box regression) increases mAP by roughly 23 percentage points. * They did not find a significant benefit of training the model on multiple scales (e.g. same image sometimes at 400x400, sometimes at 600x600, sometimes at 800x800 etc.). * Note that their raw CNN (everything before RoIPooling) is fully convolutional, so they can feed the images at any scale through the network. * Increasing the amount of training data seemed to improve mAP a bit, but not as much as one might hope for. * Using a softmax loss instead of an SVM seemed to marginally increase mAP (01 percentage points). * Using more region proposals from selective search does not simply increase mAP. Instead it can lead to higher recall, but lower precision. * ![Proposal schemes](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Fast_RCNN__proposal_schemes.jpg?raw=true "Proposal schemes") * Using densely sampled region proposals (as in sliding window) significantly reduces mAP (from 59.2% to 52.9%). If SVMs instead of softmaxes are used, the results are even worse (49.3%). 
* Previously, methods to detect bounding boxes in images were often based on the combination of manual feature extraction with SVMs. * They replace the manual feature extraction with a CNN, leading to significantly higher accuracy. * They use supervised pretraining on auxiliary datasets to deal with the small amount of labeled data (instead of the sometimes used unsupervised pretraining). * They call their method RCNN ("Regions with CNN features"). ### How * Their system has three modules: 1) Region proposal generation, 2) CNNbased feature extraction per region proposal, 3) classification. * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Rich_feature_hierarchies_for_accurate_object_detection_and_semantic_segmentation__architecture.jpg?raw=true "Architecture") * Region proposals generation * A region proposal is a bounding box candidate that *might* contain an object. * By default they generate 2000 region proposals per image. * They suggest "simple" (i.e. not learned) algorithms for this step (e.g. objectneess, selective search, CPMC). * They use selective search (makes it comparable to previous systems). * CNN features * Uses a CNN to extract features, applied to each region proposal (replaces the previously used manual feature extraction). * So each region proposal ist turned into a fixed length vector. * They use AlexNet by Krizhevsky et al. as their base CNN (takes 227x227 RGB images, converts them into 4096dimensional vectors). * They add `p=16` pixels to each side of every region proposal, extract the pixels and then simply resize them to 227x227 (ignoring aspect ratio, so images might end up distorted). * They generate one 4096d vector per image, which is less than what some previous manual feature extraction methods used. That enables faster classification, less memory usage and thus more possible classes. * Classification * A classifier that receives the extracted feature vectors (one per region proposal) and classifies them into a predefined set of available classes (e.g. "person", "car", "bike", "background / no object"). * They use one SVM per available class. * The regions that were not classified as background might overlap (multiple bounding boxes on the same object). * They use greedy nonmaximum suppresion to fix that problem (for each class individually). * That method simply rejects regions if they overlap strongly with another region that has higher score. * Overlap is determined via Intersection of Union (IoU). * Training method * PreTraining of CNN * They use AlexNet pretrained on Imagenet (1000 classes). * They replace the last fully connected layer with a randomly initialized one that leads to `C+1` classes (`C` object classes, `+1` for background). * FineTuning of CNN * The use SGD with learning rate `0.001`. * Batch size is 128 (32 positive windows, 96 background windows). * A region proposal is considered positive, if its IoU with any groundtruth bounding box is `>=0.5`. * SVM * They train one SVM per class via hard negative mining. * For positive examples they use here an IoU threshold of `>=0.3`, which performed better than 0.5. ### Results * Pascal VOC 2010 * They: 53.7% mAP * Closest competitor (SegDPM): 40.4% mAP * Closest competitor that uses the same region proposal method (UVA): 35.1% mAP * ![Scores on Pascal VOC 2010](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Rich_feature_hierarchies_for_accurate_object_detection_and_semantic_segmentation__scores.jpg?raw=true "Scores on Pascal VOC 2010") * ILSVRC2013 detection * They: 31.4% mAP * Closest competitor (OverFeat): 24.3% mAP * The feed a large number of region proposals through the network and log for each filter in the last convlayer which images activated it the most: * ![Activations](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Rich_feature_hierarchies_for_accurate_object_detection_and_semantic_segmentation__activations.jpg?raw=true "Activations") * Usefulness of layers: * They remove later layers of the network and retrain in order to find out which layers are the most useful ones. * Their result is that both fully connected layers of AlexNet seemed to be very domainspecific and profit most from finetuning. * Using VGG16: * Using VGG16 instead of AlexNet increased mAP from 58.5% to 66.0% on Pascal VOC 2007. * Computation time was 7 times higher. * They train a linear regression model that improves the bounding box dimensions based on the extracted features of the last pooling layer. That improved their mAP by 34 percentage points. * The region proposals generated by selective search have a recall of 98% on Pascal VOC and 91.6% on ILSVRC2013 (measured by IoU of `>=0.5`). 
* They compare the results of various models for pedestrian detection. * The various models were developed over the course of ~10 years (20032014). * They analyze which factors seemed to improve the results. * They derive new models for pedestrian detection from that. ### Comparison: Datasets * Available datasets * INRIA: Small dataset. Diverse images. * ETH: Video dataset. Stereo images. * TUDBrussels: Video dataset. * Daimler: No color channel. * Daimler stereo: Stereo images. * CaltechUSA: Most often used. Large dataset. * KITTI: Often used. Large dataset. Stereo images. * All datasets except KITTI are part of the "unified evaluation toolbox" that allows authors to easily test on all of these datasets. * The evaluation started initially with perwindow (FPPW) and later changed to perimage (FPPI), because perwindow skewed the results. * Common evaluation metrics: * MR: Logaverage missrate (lower is better) * AUC: Area under the precisionrecall curve (higher is better) ### Comparison: Methods * Families * They identified three families of methods: Deformable Parts Models, Deep Neural Networks, Decision Forests. * Decision Forests was the most popular family. * No specific family seemed to perform better than other families. * There was no evidence that nonlinearity in kernels was needed (given sophisticated features). * Additional data * Adding (coarse) optical flow data to each image seemed to consistently improve results. * There was some indication that adding stereo data to each image improves the results. * Context * For sliding window detectors, adding context from around the window seemed to improve the results. * E.g. context can indicate whether there were detections next to the window as people tend to walk in groups. * Deformable parts * They saw no evidence that deformable part models outperformed other models. * MultiScale models * Training separate models for each sliding window scale seemed to improve results slightly. * Deep architectures * They saw no evidence that deep neural networks outperformed other models. (Note: Paper is from 2014, might have changed already?) * Features * Best performance was usually achieved with simple HOG+LUV features, i.e. by converting each window into: * 6 channels of gradient orientations * 1 channel of gradient magnitude * 3 channels of LUV color space * Some models use significantly more channels for gradient orientations, but there was no evidence that this was necessary to achieve good accuracy. * However, using more different features (and more sophisticated ones) seemed to improve results. ### Their new model: * They choose Decisions Forests as their model framework (2048 level2 trees, i.e. 3 thresholds per tree). * They use features from the [Integral Channels Features framework](http://pages.ucsd.edu/~ztu/publication/dollarBMVC09ChnFtrs_0.pdf). (Basically just a mixture of common/simple features per window.) * They add optical flow as a feature. * They add context around the window as a feature. (A second detector that detects windows containing two persons.) * Their model significantly improves upon the state of the art (from 34 to 22% MR on Caltech dataset). ![Table](https://raw.githubusercontent.com/aleju/papers/master/mixed/images/Ten_Years_of_Pedestrian_Detection_What_Have_We_Learned__table.png?raw=true "Table") *Overview of models developed over the years, starting with Viola Jones (VJ) and ending with their suggested model (Katamariv1). (DF = Decision Forest, DPM = Deformable Parts Model, DN = Deep Neural Network; I = Inria Dataset, C = Caltech Dataset)* 
* Style transfer between images works  in its original form  by iteratively making changes to a content image, so that its style matches more and more the style of a chosen style image. * That iterative process is very slow. * Alternatively, one can train a single feedforward generator network to apply a style in one forward pass. The network is trained on a dataset of input images and their stylized versions (stylized versions can be generated using the iterative approach). * So far, these generator networks were much faster than the iterative approach, but their quality was lower. * They describe a simple change to these generator networks to increase the image quality (up to the same level as the iterative approach). ### How * In the generator networks, they simply replace all batch normalization layers with instance normalization layers. * Batch normalization normalizes using the information from the whole batch, while instance normalization normalizes each feature map on its own. * Equations * Let `H` = Height, `W` = Width, `T` = Batch size * Batch Normalization: * ![Batch Normalization Equations](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Instance_Normalization_The_Missing_Ingredient_for_Fast_Stylization__batch_normalization.jpg?raw=true "Batch Normalization Equations") * Instance Normalization * ![Instance Normalization Equations](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Instance_Normalization_The_Missing_Ingredient_for_Fast_Stylization__instance_normalization.jpg?raw=true "Instance Normalization Equations") * They apply instance normalization at test time too (identically). ### Results * Same image quality as iterative approach (at a fraction of the runtime). * One content image with two different styles using their approach: * ![Example](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Instance_Normalization_The_Missing_Ingredient_for_Fast_Stylization__example.jpg?raw=true "Example") 
Official code: https://github.com/anewell/posehgtrain * They suggest a new model architecture for human pose estimation (i.e. "lay a skeleton over a person"). * Their architecture is based progressive pooling followed by progressive upsampling, creating an hourglass form. * Input are images showing a person's body. * Outputs are K heatmaps (for K body joints), with each heatmap showing the likely position of a single joint on the person (e.g. "akle", "wrist", "left hand", ...). ### How * *Basic building block* * They use residuals as their basic building block. * Each residual has three layers: One 1x1 convolution for dimensionality reduction (from 256 to 128 channels), a 3x3 convolution, a 1x1 convolution for dimensionality increase (back to 256). * Visualized: * ![Building Block](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Stacked_Hourglass_Networks_for_Human_Pose_Estimation__building_block.jpg?raw=true "Building Block") * *Architecture* * Their architecture starts with one standard 7x7 convolutions that has strides of (2, 2). * They use MaxPooling (2x2, strides of (2, 2)) to downsample the images/feature maps. * They use Nearest Neighbour upsampling (factor 2) to upsample the images/feature maps. * After every pooling step they add three of their basic building blocks. * Before each pooling step they branch off the current feature map as a minor branch and apply three basic building blocks to it. Then they add it back to the main branch after that one has been upsampeled again to the original size. * The feature maps between each basic building block have (usually) 256 channels. * Their HourGlass ends in two 1x1 convolutions that create the heatmaps. * They stack two of their HourGlass networks after each other. Between them they place an intermediate loss. That way, the second network can learn to improve the predictions of the first network. * Architecture visualized: * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Stacked_Hourglass_Networks_for_Human_Pose_Estimation__architecture.jpg?raw=true "Architecture") * *Heatmaps* * The output generated by the network are heatmaps, one per joint. * Each ground truth heatmap has a small gaussian peak at the correct position of a joint, everything else has value 0. * If a joint isn't visible, the ground truth heatmap for that joint is all zeros. * *Other stuff* * They use batch normalization. * Activation functions are ReLUs. * They use RMSprob as their optimizer. * Implemented in Torch. ### Results * They train and test on FLIC (only one HourGlass) and MPII (two stacked HourGlass networks). * Training is done with augmentations (horizontal flip, up to 30 degress rotation, scaling, no translation to keep the body of interest in the center of the image). * Evaluation is done via PCK@0.2 (i.e. percentage of predicted keypoints that are within 0.2 head sizes of their ground truth annotation (head size of the specific body)). * Results on FLIC are at >95%. * Results on MPII are between 80.6% (ankle) and 97.6% (head). Average is 89.4%. * Using two stacked HourGlass networks performs around 3% better than one HourGlass network (even when adjusting for parameters). * Training time was 5 days on a Titan X (9xx generation). 
They describe a CNN architecture that can be used to identify a person given an image of their face. ### How * The expected input is the image of a face (i.e. it does not search for faces in images, the faces already have to be extracted by a different method). * *Face alignment / Frontalization* * Target of this step: Get rid of variations within the face images, so that every face seems to look straight into the camera ("frontalized"). * 2D alignment * They search for landmarks (fiducial points) on the face. * They use SVRs (features: LBPs) for that. * After every application of the SVR, the localized landmarks are used to transform/normalize the face. Then the SVR is applied again. By doing this, the locations of the landmarks are gradually refined. * They use the detected landmarks to normalize the face images (via scaling, rotation and translation). * 3D alignment * The 2D alignment allows to normalize variations within the 2Dplane, not outofplane variations (e.g. seeing that face from its left/right side). To normalize outofplane variations they need a 3D transformation. * They detect an additional 67 landmarks on the faces (again via SVRs). * They construct a human face mesh from a dataset (USF HumanID). * They map the 67 landmarks to that mesh. * They then use some more complicated steps to recover the frontalized face image. * *CNN architecture* * The CNN receives the frontalized face images (152x152, RGB). * It then applies the following steps: * Convolution, 32 filters, 11x11, ReLU (> 32x142x142, CxHxW) * Max pooling over 3x3, stride 2 (> 32x71x71) * Convolution, 16 filters, 9x9, ReLU (> 16x63x63) * Local Convolution, 16 filters, 9x9, ReLU (> 16x55x55) * Local Convolution, 16 filters, 7x7, ReLU (> 16x25x25) * Local Convolution, 16 filters, 5x5, ReLU (> 16x21x21) * Fully Connected, 4096, ReLU * Fully Connected, 4030, Softmax * Local Convolutions use a different set of learned weights at every "pixel" (while a normal convolution uses the same set of weights at all locations). * They can afford to use local convolutions because of their frontalization, which roughly forces specific landmarks to be at specific locations. * They use dropout (apparently only after the first fully connected layer). * They normalize "the features" (probably the 4096 fully connected layer). Each component is divided by its maximum value across a training set. Additionally, the whole vector is L2normalized. The goal of this step is to make the network less sensitive to illumination changes. * The whole network has about 120 million parameters. * Visualization of the architecture: * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/DeepFace__architecture.jpg?raw=true "Architecture") * *Training* * The network receives images, each showing a face, and is trained to classify the identity of the face (e.g. gets image of Obama, has to return "that's Obama"). * They use crossentropy as their loss. * *Face verification* * In order to tell whether two images of faces show the same person they try three different methods. * Each of these relies on the vector extracted by the first fully connected layer in the network (4096d). * Let these vectors be `f1` (image 1) and `f2` (image 2). The methods are then: 1. Inner product between `f1` and `f2`. The classification (same person/not same person) is then done by a simple threshold. 2. Weighted X^2 (chisquared) distance. Equation, per vector component i: `weight_i (f1[i]  f2[i])^2 / (f1[i] + f2[i])`. The vector is then fed into an SVM. 3. Siamese network. Means here simply that the absolute distance between `f1` and `f2` is calculated (`f1f2`), each component is weighted by a learned weight and then the sum of the components is calculated. If the result is above a threshold, the faces are considered to show the same person. ### Results * They train their network on the Social Face Classification (SFC) dataset. That seems to be a Facebookinternal dataset (i.e. not public) with 4.4 million faces of 4k people. * When applied to the LFW dataset: * Face recognition ("which person is shown in the image") (apparently they retrained the whole model on LFW for this task?): * Simple SVM with LBP (i.e. not their network): 91.4% mean accuracy. * Their model, with frontalization, with 2d alignment: ??? no value. * Their model, no frontalization (only 2d alignment): 94.3% mean accuracy. * Their model, no frontalization, no 2d alignment: 87.9% mean accuracy. * Face verification (two images > same/not same person) (apparently also trained on LFW? unclear): * Method 1 (inner product + threshold): 95.92% mean accuracy. * Method 2 (X^2 vector + SVM): 97.00% mean accurracy. * Method 3 (siamese): Apparently 96.17% accuracy alone, and 97.25% when used in an ensemble with other methods (under special training schedule using SFC dataset). * When applied to the YTF dataset (YouTube video frames): * 92.5% accuracy via X^2method. 
* Most neural machine translation models currently operate on word vectors or one hot vectors of words. * They instead generate the vector of each word on a characterlevel. * Thereby, the model can spot charactersimilarities between words and treat them in a similar way. * They do that only for the source language, not for the target language. ### How * They treat each word of the source text on its own. * To each word they then apply the model from [Characteraware neural language models](https://arxiv.org/abs/1508.06615), i.e. they do per word: * Embed each character into a 620dimensional space. * Stack these vectors next to each other, resulting in a 2dtensor in which each column is one of the vectors (i.e. shape `620xN` for `N` characters). * Apply convolutions of size `620xW` to that tensor, where a few different values are used for `W` (i.e. some convolutions cover few characters, some cover many characters). * Apply a tanh after these convolutions. * Apply a maxovertime to the results of the convolutions, i.e. for each convolution use only the maximum value. * Reshape to 1dvector. * Apply two highwaylayers. * They get 1024dimensional vectors (one per word). * Visualization of their steps: * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Characterbased_Neural_Machine_Translation__architecture.jpg?raw=true "Architecture") * Afterwards they apply the model from [Neural Machine Translation by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473) to these vectors, yielding a translation to a target language. * Whenever that translation yields an unknown targetlanguageword ("UNK"), they replace it with the respective (untranslated) word from the source text. ### Results * They the GermanEnglish [WMT](http://www.statmt.org/wmt15/translationtask.html) dataset. * BLEU improvemements (compared to neural translation without characterlevel words): * GermanEnglish improves by about 1.5 points. * EnglishGerman improves by about 3 points. * Reduction in the number of unknown targetlanguagewords (same baseline again): * GermanEnglish goes down from about 1500 to about 1250. * EnglishGerman goes down from about 3150 to about 2650. * Translation examples (Phrase = phrasebased/nonneural translation, NN = noncharacterbased neural translation, CHAR = theirs): * ![Examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Characterbased_Neural_Machine_Translation__examples.jpg?raw=true "Examples") 
* They suggest a new model for human pose estimation (i.e. to lay a "skeleton" over the image of a person). * Their model has a (more or less) recurrent architecture. * Initial estimates of keypoint locations are refined in several steps. * The idea of the recurrent architecture is derived from message passing, unrolled into one feedforward model. ### How * Architecture * They generate the end result in multiple steps, similar to a recurrent network. * Step 1: * Receives the image (368x368 resolution). * Applies a few convolutions to the image in order to predict for each pixel the likelihood of belonging to a keypoint (head, neck, right elbow, ...). * Step 2 and later: * (Modified) Receives the image (368x368 resolution) and the previous likelihood scores. * (Same) Applies a few convolutions to the image in order to predict for each pixel the likelihood of belonging to a keypoint (head, neck, right elbow, ...). * (New) Concatenates the likelihoods with the likelihoods of the previous step. * (New) Applies a few more convolutions to the concatenation to compute the final likelihood scores. * Visualization of the architecture: * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Convolutional_Pose_Machines__architecture.jpg?raw=true "Architecture") * Loss function * The basic loss function is a simple mean squared error between the expected output maps per keypoint and the predicted ones. * In the expected output maps they mark the correct positions of the keypoints using a small gaussian function. * They apply losses after each step in the architecture, argueing that this helps against vanishing gradients (they don't seem to be using BN). * The expected output maps of the first step actually have the positions of all keypoints of a certain type (e.g. neck) marked, i.e. if there are multiple people in the extracted image patch there might be multiple correct keypoint positions. Only at step 2 and later they reduce that to the expected person (i.e. one keypoint position per map). ### Results * Example results: * ![Example results](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Convolutional_Pose_Machines__results.jpg?raw=true "Example results") * Selfcorrection of predictions over several timesteps: * ![Effect of timesteps](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Convolutional_Pose_Machines__timesteps.jpg?raw=true "Effect of timesteps") * They beat existing methods on the datasets MPII, LSP and FLIC. * Applying a loss function after each step (instead of only once after the last step) improved their results and reduced problems related to vanishing gradients. * The effective receptive field size of each step had a significant influence on the results. They increased it to up to 300px (about 80% of the image size) and saw continuous improvements in accuracy. * ![Receptive field size effect](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Convolutional_Pose_Machines__rf_size.jpg?raw=true "Receptive field size effect") 
* They suggest a single architecture that tries to solve the following tasks: * Face localization ("Where are faces in the image?") * Face landmark localization ("For a given face, where are its landmarks, e.g. eyes, nose and mouth?") * Face landmark visibility estimation ("For a given face, which of its landmarks are actually visible and which of them are occluded by other objects/people?") * Face roll, pitch and yaw estimation ("For a given face, what is its rotation on the x/y/zaxis?") * Face gender estimation ("For a given face, which gender does the person have?") ### How * *Pretraining the base model* * They start with a basic model following the architecture of AlexNet. * They train that model to classify whether the input images are faces or not faces. * They then remove the fully connected layers, leaving only the convolutional layers. * *Locating bounding boxes of face candidates* * They then use a [selective search and segmentation algorithm](https://www.robots.ox.ac.uk/~vgg/rg/papers/sande_iccv11.pdf) on images to extract bounding boxes of objects. * Each bounding box is considered a possible face. * Each bounding box is rescaled to 227x227. * *Feature extraction per face candidate* * They feed each bounding box through the above mentioned pretrained network. * They extract the activations of the network from the layers `max1` (27x27x96), `conv3` (13x13x384) and `pool5` (6x6x256). * They apply to the first two extracted tensors (from max1, conv3) convolutions so that their tensor shapes are reduced to 6x6xC. * They concatenate the three tensors to a 6x6x768 tensor. * They apply a 1x1 convolution to that tensor to reduce it to 6x6x192. * They feed the result through a fully connected layer resulting in 3072dimensional vectors (per face candidate). * *Classification and regression* * They feed each 3072dimensional vector through 5 separate networks: 1. Detection: Does the bounding box contain a face or no face. (2 outputs, i.e. yes/no) 2. Landmark Localization: What are the coordinates of landmark features (e.g. mouth, nose, ...). (21 landmarks, each 2 values for x/y = 42 outputs total) 3. Landmark Visibility: Which landmarks are visible. (21 yes/no outputs) 4. Pose estimation: Roll, pitch, yaw of the face. (3 outputs) 5. Gender estimation: Male/female face. (2 outputs) * Each of these network contains a single fully connected layer with 512 nodes, followed by the output layer with the above mentioned number of nodes. * *Architecture Visualization*: * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/HyperFace__architecture.jpg?raw=true "Architecture") * *Training* * The base model is trained once (see above). * The feature extraction layers and the five classification/regression networks are trained afterwards (jointly). * The loss functions for the five networks are: 1. Detection: BCE (binary crossentropy). Detected bounding boxes that have an overlap `>=0.5` with an annotated face are considered positive samples, bounding boxes with overlap `<0.35` are considered negative samples, everything in between is ignored. 2. Landmark localization: Roughly MSE (mean squared error), with some weighting for visibility. Only bounding boxes with overlap `>0.35` are considered. Coordinates are normalized with respect to the bounding boxes center, width and height. 3. Landmark visibility: MSE (predicted visibility factor vs. expected visibility factor). Only for bounding boxes with overlap `>0.35`. 4. Pose estimation: MSE. 5. Gender estimation: BCE. * *Testing* * They use two postprocessing methods for detected faces: * Iterative Region Proposals: * They localize landmarks per face region. * Then they compute a more appropriate face bounding box based on the localized landmarks. * They feed that new bounding box through the network. * They compute the face score (face / not face, i.e. number between 0 and 1) for both bounding boxes and choose the one with the higher score. * This shrinks down bounding boxes that turned out to be too big. * The method visualized: * ![IRP](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/HyperFace__irp.jpg?raw=true "IRP") * Landmarksbased NonMaximum Suppression: * When multiple detected face bounding boxes overlap, one has to choose which of them to keep. * A method to do that is to only keep the bounding box with the highest facescore. * They instead use a medianofk method. * Their steps are: 1. Reduce every box in size so that it is a bounding box around the localized landmarks. 2. For every box, find all bounding boxes with a certain amount of overlap. 3. Among these bounding boxes, select the `k` ones with highest face score. 4. Based on these boxes, create a new box which's size is derived from the median coordinates of the landmarks. 5. Compute the median values for landmark coordinates, landmark visibility, gender, pose and use it as the respective values for the new box. ### Results * Example results: * ![Example results](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/HyperFace__example_results.jpg?raw=true "Example results") * They test on AFW, AFWL, PASCAL, FDDB, CelebA. * They achieve the best mean average precision values on PASCAL and AFW (compared to selected competitors). * AFW results visualized: * ![AFW](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/HyperFace__afw.jpg?raw=true "AFW") * Their approach achieve good performance on FDDB. It has some problems with small and/or blurry faces. * If the feature fusion is removed from their approach (i.e. extracting features only from one fully connected layer at the end of the base network instead of merging feature maps from different convolutional layers), the accuracy of the predictions goes down. * Their architecture ends in 5 shallow networks and shares many layers before them. If instead these networks share no or few layers, the accuracy of the predictions goes down. * The postprocessing of bounding boxes (via Iterative Region Proposals and Landmarksbased NonMaximum Suppression) has a quite significant influence on the performance. * Processing time per image is 3s, of which 2s is the selective search algorithm (for the bounding boxes). 
* When using pretrained networks (like VGG) to solve tasks, one has to use features generated by these networks. * These features come from specific layers, e.g. from the fully connected layers at the end of the network. * They test whether the features from fully connected layers or from the last convolutional layer are better suited for face attribute prediction. ### How * Base networks * They use standard architectures for their test networks, specifically the architectures of FaceNet and VGG (very deep version). * They modify these architectures to both use PReLUs. * They do not use the pretrained weights, instead they train the networks on their own. * They train them on the WebFace dataset (350k images, 10k different identities) to classify the identity of the shown person. * Attribute prediction * After training of the base networks, they train a separate SVM to predict attributes of faces. * The datasets used for this step are CelebA (100k images, 10k identities) and LFWA (13k images, 6k identities). * Each image in these datasets is annotated with 40 binary face attributes. * Examples for attributes: Eyeglasses, bushy eyebrows, big lips, ... * The features for the SVM are extracted from the base networks (i.e. feed forward a face through the network, then take the activations of a specific layer). * The following features are tested: * FC2: Activations of the second fully connected layer of the base network. * FC1: As FC2, but the first fully connected layer. * Spat 3x3: Activations of the last convolutional layer, maxpooled so that their widths and heights are both 3 (i.e. shape Cx3x3). * Spat 1x1: Same as "Spat 3x3", but maxpooled to Cx1x1. ### Results * The SVMs trained on "Spat 1x1" performed overall worst, the ones trained on "Spat 3x3" performed best. * The accuracy order was roughly: `Spat 3x3 > FC1 > FC2 > Spat 1x1`. * This effect was consistent for both networks (VGG, FaceNet) and for other training datasets as well. * FC2 performed particularly bad for the "blurry" attribute (most likely because that was unimportant to the classification task). * Accuracy comparison per attribute: * ![Comparison](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Face_Attribute_Prediction_Using_OfftheShelf_CNN_Features__comparison.png?raw=true "Comparison") * The conclusion is, that when using pretrained networks one should not only try the last fully connected layer. Many characteristics of the input image might not appear any more in that layer (and later ones in general) as they were unimportant to the classification task. 
* They describe a model to locate faces in images. * Their model uses information from suspected face regions *and* from the corresponding suspected body regions to classify whether a region contains a face. * The intuition is, that seeing the region around the face (specifically where the body should be) can help in estimating whether a suspected face is really a face (e.g. it might also be part of a painting, statue or doll). ### How * Their whole model is called "CMSRCNN" (Contextual MultiScale RegionCNN). * It is based on the "Faster RCNN" architecture. * It uses the VGG network. * Subparts of their model are: MSRPN, CMSCNN. * MSRPN finds candidate face regions. CMSCNN refines their bounding boxes and classifies them (face / not face). * **MSRPN** (MultiScale Region Proposal Network) * "Looks" at the feature maps of the network (VGG) at multiple scales (i.e. before/after pooling layers) and suggests regions for possible faces. * Steps: * Feed an image through the VGG network. * Extract the feature maps of the three last convolutions that are before a pooling layer. * Pool these feature maps so that they have the same heights and widths. * Apply L2 normalization to each feature map so that they all have the same scale. * Apply a 1x1 convolution to merge them to one feature map. * Regress face bounding boxes from that feature map according to the Faster RCNN technique. * **CMSCNN** (Contextual MultiScale CNN): * "Looks" at feature maps of face candidates found by MSRPN and classifies whether these regions contains faces. * It also uses the same multiscale technique (i.e. take feature maps from convs before pooling layers). * It uses some area around these face regions as additional information (suspected regions of bodies). * Steps: * Receive face candidate regions from MSRPN. * Do per candidate region: * Calculate the suspected coordinates of the body (only based on the x/yposition and size of the face region, i.e. not learned). * Extract the feature maps of the *face* region (at multiple scales) and apply RoIPooling to it (i.e. convert to a fixed height and width). * Extract the feature maps of the *body* region (at multiple scales) and apply RoIPooling to it (i.e. convert to a fixed height and width). * L2normalize each feature map. * Concatenate the (RoIpooled and normalized) feature maps of the face (at multiple scales) with each other (creates one tensor). * Concatenate the (RoIpooled and normalized) feature maps of the body (at multiple scales) with each other (creates another tensor). * Apply a 1x1 convolution to the face tensor. * Apply a 1x1 convolution to the body tensor. * Apply two fully connected layers to the face tensor, creating a vector. * Apply two fully connected layers to the body tensor, creating a vector. * Concatenate both vectors. * Based on that vector, make a classification of whether it is really a face. * Based on that vector, make a regression of the face's final bounding box coordinates and dimensions. * Note: They use in both networks the multiscale approach in order to be able to find small or tiny faces. Otherwise, after pooling these small faces would be hard or impossible to detect. ### Results * Adding context to the classification (i.e. the body regions) empirically improves the results. * Their model achieves the highest recall rate on FDDB compared to other models. However, it has lower recall if only very few false positives are accepted. * FDDB ROC curves (theirs is bold red): * ![FDDB results](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/CMSRCNN__fddb.jpg?raw=true "FDDB results") * Example results on FDDB: * ![FDDB examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/CMSRCNN__examples.jpg?raw=true "FDDB examples") 
* PixelRNN * PixelRNNs generate new images pixel by pixel (and row by row) via LSTMs (or other RNNs). * Each pixel is therefore conditioned on the previously generated pixels. * Training of PixelRNNs is slow due to the RNNarchitecture (hard to parallelize). * Previously PixelCNNs have been suggested, which use masked convolutions during training (instead of RNNs), but their image quality was worse. * They suggest changes to PixelCNNs that improve the quality of the generated images (while still keeping them faster than RNNs). ### How * PixelRNNs split up the distribution `p(image)` into many conditional probabilities, one per pixel, each conditioned on all previous pixels: `p(image) = <product> p(pixel i  pixel 1, pixel 2, ..., pixel i1)`. * PixelCNNs implement that using convolutions, which are faster to train than RNNs. * These convolutions uses masked filters, i.e. the center weight and also all weights right and/or below the center pixel are `0` (because they are current/future values and we only want to condition on the past). * In most generative models, several layers are stacked, ultimately ending in three float values per pixel (RGB images, one value for grayscale images). PixelRNNs (including this implementation) traditionally end in a softmax over 255 values per pixel and channel (so `3*255` per RGB pixel). * The following image shows the application of such a convolution with the softmax output (left) and the mask for a filter (right): * ![Masked convolution](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Conditional_Image_Generation_with_PixelCNN_Decoders__masked_convolution.png?raw=true "Masked convolution") * Blind spot * Using the mask on each convolutional filter effectively converts them into nonsquared shapes (the green values in the image). * Advantage: Using such nonsquared convolutions prevents future values from leaking into present values. * Disadvantage: Using such nonsquared convolutions creates blind spots, i.e. for each pixel, some past values (diagonally topright from it) cannot influence the value of that pixel. * ![Blind spot](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Conditional_Image_Generation_with_PixelCNN_Decoders__blind_spot.png?raw=true "Blind Spot") * They combine horizontal (1xN) and vertical (Nx1) convolutions to prevent that. * Gated convolutions * PixelRNNs via LSTMs so far created visually better images than PixelCNNs. * They assume that one advantage of LSTMs is, that they (also) have multiplicative gates, while stacked convolutional layers only operate with summations. * They alleviate that problem by adding gates to their convolutions: * Equation: `output image = tanh(weights_1 * image) <elementwise product> sigmoid(weights_2 * image)` * `*` is the convolutional operator. * `tanh(weights_1 * image)` is a classical convolution with tanh activation function. * `sigmoid(weights_2 * image)` are the gate values (0 = gate closed, 1 = gate open). * `weights_1` and `weights_2` are learned. * Conditional PixelCNNs * When generating images, they do not only want to condition the previous values, but also on a laten vector `h` that describes the image to generate. * The new image distribution becomes: `p(image) = <product> p(pixel i  pixel 1, pixel 2, ..., pixel i1, h)`. * To implement that, they simply modify the previously mentioned gated convolution, adding `h` to it: * Equation: `output image = tanh(weights_1 * image + weights_2 . h) <elementwise product> sigmoid(weights_3 * image + weights_4 . h)` * `.` denotes here the matrixvector multiplication. * PixelCNN Autoencoder * The decoder in a standard autoencoder can be replaced by a PixelCNN, creating a PixelCNNAutoencoder. ### Results * They achieve similar NLLresults as PixelRNN on CIFAR10 and ImageNet, while training about twice as fast. * Here, "fast" means that they used 32 GPUs for 60 hours. * Using Conditional PixelCNNs on ImageNet (i.e. adding class information to each convolution) did not improve the NLLscore, but it did improve the image quality. * ![ImageNet](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Conditional_Image_Generation_with_PixelCNN_Decoders__imagenet.png?raw=true "ImageNet") * They use a different neural network to create embeddings of human faces. Then they generate new faces based on these embeddings via PixelCNN. * ![Portraits](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Conditional_Image_Generation_with_PixelCNN_Decoders__portraits.png?raw=true "Portraits") * Their PixelCNNAutoencoder generates significantly sharper (i.e. less blurry) images than a "normal" autoencoder. 
* Usually GANs transform a noise vector `z` into images. `z` might be sampled from a normal or uniform distribution. * The effect of this is, that the components in `z` are deeply entangled. * Changing single components has hardly any influence on the generated images. One has to change multiple components to affect the image. * The components end up not being interpretable. Ideally one would like to have meaningful components, e.g. for human faces one that controls the hair length and a categorical one that controls the eye color. * They suggest a change to GANs based on Mutual Information, which leads to interpretable components. * E.g. for MNIST a component that controls the stroke thickness and a categorical component that controls the digit identity (1, 2, 3, ...). * These components are learned in a (mostly) unsupervised fashion. ### How * The latent code `c` * "Normal" GANs parameterize the generator as `G(z)`, i.e. G receives a noise vector and transforms it into an image. * This is changed to `G(z, c)`, i.e. G now receives a noise vector `z` and a latent code `c` and transforms both into an image. * `c` can contain multiple variables following different distributions, e.g. in MNIST a categorical variable for the digit identity and a gaussian one for the stroke thickness. * Mutual Information * If using a latent code via `G(z, c)`, nothing forces the generator to actually use `c`. It can easily ignore it and just deteriorate to `G(z)`. * To prevent that, they force G to generate images `x` in a way that `c` must be recoverable. So, if you have an image `x` you must be able to reliable tell which latent code `c` it has, which means that G must use `c` in a meaningful way. * This relationship can be expressed with mutual information, i.e. the mutual information between `x` and `c` must be high. * The mutual information between two variables X and Y is defined as `I(X; Y) = entropy(X)  entropy(XY) = entropy(Y)  entropy(YX)`. * If the mutual information between X and Y is high, then knowing Y helps you to decently predict the value of X (and the other way round). * If the mutual information between X and Y is low, then knowing Y doesn't tell you much about the value of X (and the other way round). * The new GAN loss becomes `old loss  lambda * I(G(z, c); c)`, i.e. the higher the mutual information, the lower the result of the loss function. * Variational Mutual Information Maximization * In order to minimize `I(G(z, c); c)`, one has to know the distribution `P(cx)` (from image to latent code), which however is unknown. * So instead they create `Q(cx)`, which is an approximation of `P(cx)`. * `I(G(z, c); c)` is then computed using a lower bound maximization, similar to the one in variational autoencoders (called "Variational Information Maximization", hence the name "InfoGAN"). * Basic equation: `LowerBoundOfMutualInformation(G, Q) = E[log Q(cx)] + H(c) <= I(G(z, c); c)` * `c` is the latent code. * `x` is the generated image. * `H(c)` is the entropy of the latent codes (constant throughout the optimization). * Optimization w.r.t. Q is done directly. * Optimization w.r.t. G is done via the reparameterization trick. * If `Q(cx)` approximates `P(cx)` *perfectly*, the lower bound becomes the mutual information ("the lower bound becomes tight"). * In practice, `Q(cx)` is implemented as a neural network. Both Q and D have to process the generated images, which means that they can share many convolutional layers, significantly reducing the extra cost of training Q. ### Results * MNIST * They use for `c` one categorical variable (10 values) and two continuous ones (uniform between 1 and +1). * InfoGAN learns to associate the categorical one with the digit identity and the continuous ones with rotation and width. * Applying Q(cx) to an image and then classifying only on the categorical variable (i.e. fully unsupervised) yields 95% accuracy. * Sampling new images with exaggerated continuous variables in the range `[2,+2]` yields sound images (i.e. the network generalizes well). * ![MNIST examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/InfoGAN__mnist.png?raw=true "MNIST examples") * 3D face images * InfoGAN learns to represent the faces via pose, elevation, lighting. * They used five uniform variables for `c`. (So two of them apparently weren't associated with anything sensible? They are not mentioned.) * 3D chair images * InfoGAN learns to represent the chairs via identity (categorical) and rotation or width (apparently they did two experiments). * They used one categorical variable (four values) and one continuous variable (uniform `[1, +1]`). * SVHN * InfoGAN learns to represent lighting and to spot the center digit. * They used four categorical variables (10 values each) and two continuous variables (uniform `[1, +1]`). (Again, a few variables were apparently not associated with anything sensible?) * CelebA * InfoGAN learns to represent pose, presence of sunglasses (not perfectly), hair style and emotion (in the sense of "smiling or not smiling"). * They used 10 categorical variables (10 values each). (Again, a few variables were apparently not associated with anything sensible?) * ![CelebA examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/InfoGAN__celeba.png?raw=true "CelebA examples") 
* They suggest some small changes to the GAN training scheme that lead to visually improved results. * They suggest a new scoring method to compare the results of different GAN models with each other. ### How * Feature Matching * Usually G would be trained to mislead D as often as possible, i.e. to maximize D's output. * Now they train G to minimize the feature distance between real and fake images. I.e. they do: 1. Pick a layer $l$ from D. 2. Forward real images through D and extract the features from layer $l$. 3. Forward fake images through D and extract the features from layer $l$. 4. Compute the squared euclidean distance between the layers and backpropagate. * Minibatch discrimination * They allow D to look at multiple images in the same minibatch. * That is, they feed the features (of each image) extracted by an intermediate layer of D through a linear operation, resulting in a matrix per image. * They then compute the L1distances between these matrices. * They then let D make its judgement (fake/real image) based on the features extracted from the image and these distances. * They add this mechanism so that the diversity of images generated by G increases (which should also prevent collapses). * Historical averaging * They add a penalty term that punishes weights which are rather far away from their historical average values. * I.e. the cost is `distance(current parameters, average of parameters over the last t batches)`. * They argue that this can help the network to find equilibria that normal gradient descent would not find. * Onesided label smoothing * Usually one would use the labels 0 (image is fake) and 1 (image is real). * Using smoother labels (0.1 and 0.9) seems to make networks more resistent to adversarial examples. * So they smooth the labels of real images (apparently to 0.9?). * Smoothing the labels of fake images would lead to (mathematical) problems in some cases, so they keep these at 0. * Virtual Batch Normalization (VBN) * Usually BN normalizes each example with respect to the other examples in the same batch. * They instead normalize each example with respect to the examples in a reference batch, which was picked once at the start of the training. * VBN is intended to reduce the dependence of each example on the other examples in the batch. * VBN is computationally expensive, because it requires forwarding of two minibatches. * They use VBN for their G. * Inception Scoring * They introduce a new scoring method for GAN results. * Their method is based on feeding the generated images through another network, here they use Inception. * For an image `x` and predicted classes `y` (softmaxoutput of Inception): * They argue that they want `p(yx)` to have low entropy, i.e. the model should be rather certain of seeing a class (or few classes) in the image. * They argue that they want `p(y)` to have high entropy, i.e. the predicted classes (and therefore image contents) should have high diversity. (This seems like something that is quite a bit dependend on the used dataset?) * They combine both measurements to the final score of `exp(KL(p(yx)  p(y))) = exp( <sum over images> p(yxi) * (log(p(yxi))  log(p(y))) )`. * `p(y)` can be approximated as the mean of the softmaxoutputs over many examples. * Relevant python code that they use (where `part` seems to be of shape `(batch size, number of classes)`, i.e. the softmax outputs): `kl = part * (np.log(part)  np.log(np.expand_dims(np.mean(part, 0), 0))); kl = np.mean(np.sum(kl, 1)); scores.append(np.exp(kl));` * They average this score over 50,000 generated images. * Semisupervised Learning * For a dataset with K classes they extend D by K outputs (leading to K+1 outputs total). * They then optimize two loss functions jointly: * Unsupervised loss: The classic GAN loss, i.e. D has to predict the fake/real output correctly. (The other outputs seem to not influence this loss.) * Supervised loss: D must correctly predict the image's class label, if it happens to be a real image and if it was annotated with a class. * They note that training G with feature matching produces the best results for semisupervised classification. * They note that training G with minibatch discrimination produces significantly worse results for semisupervised classification. (But visually the samples look better.) * They note that using semisupervised learning overall results in higher image quality than not using it. They speculate that this has to do with the class labels containing information about image statistics that are important to humans. ### Results * MNIST * They use weight normalization and white noise in D. * Samples of high visual quality when using minibatch discrimination with semisupervised learning. * Very good results in semisupervised learning when using feature matching. * Using feature matching decreases visual quality of generated images, but improves results of semisupervised learning. * CIFAR10 * D: 9layer CNN with dropout, weight normalization. * G: 4layer CNN with batch normalization (so no VBN?). * Visually very good generated samples when using minibatch discrimination with semisupervised learning. (Probably new record quality.) * Note: No comparison with nearest neighbours from the dataset. * When using feature matching the results are visually not as good. * Again, very good results in semisupervised learning when using feature matching. * SVHN * Same setup as in CIFAR10 and similar results. * ImageNet * They tried to generate 128x128 images and compared to DCGAN. * They improved from "total garbage" to "garbage" (they now hit some textures, but structure is still wildly off). ![CIFAR10 Examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Improved_Techniques_for_Training_GANs__cifar.jpg?raw=true "CIFAR10 Examples") *Generated CIFAR10like images (with minibatch discrimination and semisupervised learning).* 
* They suggest a new method to generate images which maximize the activation of a specific neuron in a (trained) target network (abbreviated with "**DNN**"). * E.g. if your DNN contains a neuron that is active whenever there is a car in an image, the method should generate images containing cars. * Such methods can be used to investigate what exactly a network has learned. * There are plenty of methods like this one. They usually differ from each other by using different *natural image priors*. * A natural image prior is a restriction on the generated images. * Such a prior pushes the generated images towards realistic looking ones. * Without such a prior it is easy to generate images that lead to high activations of specific neurons, but don't look realistic at all (e.g. they might look psychodelic or like white noise). * That's because the space of possible images is extremely highdimensional and can therefore hardly be covered reliably by a single network. Note also that training datasets usually only show a very limited subset of all possible images. * Their work introduces a new natural image prior. ### How * Usually, if one wants to generate images that lead to high activations, the basic/naive method is to: 1. Start with a noise image, 2. Feed that image through DNN, 3. Compute an error that is high if the activation of the specified neuron is low (analogous for high activation), 4. Backpropagate the error through DNN, 5. Change the noise image according to the gradient, 6. Repeat. * So, the noise image is basically treated like weights in the network. * Their alternative method is based on a Generator network **G**. * That G is trained according to the method described in [Generating Images with Perceptual Similarity Metrics based on Deep Networks]. * Very rough outline of that method: * First, a pretrained network **E** is given (they picked CaffeNet, which is a variation of AlexNet). * G then has to learn to inverse E, i.e. G receives per image the features extracted by a specific layer in E (e.g. the last fully connected layer before the output) and has to generate (recreate) the image from these features. * Their modified steps are: 1. *(New step)* Start with a noise vector, 2. *(New step)* Feed that vector through G resulting in an image, 3. *(Same)* Feed that image through DNN, 4. *(Same)* Compute an error that is low if the activation of the specified neuron is high (analogous for low activations), 5. *(Same)* Backpropagate the error through DNN, 6. *(Modified)* Change the noise *vector* according to the gradient, 7. *(Same)* Repeat. * Visualization of their architecture: * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Synthesizing_the_preferred_inputs_for_neurons_in_neural_networks_via_deep_generator_networks__architecture.jpg?raw=true "Architecture") * Additionally they do: * Apply an L2 norm to the noise vector, which adds pressure to each component to take low values. They say that this improved the results. * Clip each component of the noise vector to a range `[0, a]`, which improved the results significantly. * The range starts at `0`, because the network (E) inverted by their Generator (G) is based on ReLUs. * `a` is derived from test images fed through E and set to 3 standard diviations of the mean activation of that component (recall that the "noise" vector mirrors a specific layer in E). * They argue that this clipping is similar to a prior on the noise vector components. That prior reflects likely values of the layer in E that is used for the noise vector. ### Results * Examples of generated images: * ![Examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Synthesizing_the_preferred_inputs_for_neurons_in_neural_networks_via_deep_generator_networks__examples.jpg?raw=true "Examples") * Early vs. late layers * For G they have to pick a specific layer from E that G has to invert. They found that using "later" layers (e.g. the fully connected layers at the end) produced images with more reasonable overall structure than using "early" layers (e.g. first convolutional layers). Early layers led to repeating structures. * Datasets and architectures * Both G and DNN have to be trained on datasets. * They found that these networks can actually be trained on different datasets, the results will still look good. * However, they found that the architectures of DNN and E should be similar to create the best looking images (though this might also be down to depth of the tested networks). * Verification that the prior can generate any image * They tested whether the generated images really show what the DNNneurons prefer and not what the Generator/prior prefers. * To do that, they retrained DNNs on images that were both directly from the dataset as well as images that were somehow modified. * Those modifications were: * Treated RGB images as if they were BGR (creating images with weird colors). * Copypasted areas in the images around (creating mosaics). * Blurred the images (with gaussian blur). * The DNNs were then trained to classify the "normal" images into 1000 classes and the modified images into 1000 other classes (2000 total). * So at the end there were (in the same DNN) neurons reacting strongly to specific classes of unmodified images and other neurons that reacted strongly to specific classes of modified images. * When generating images to maximize activations of specific neurons, the Generator was able to create both modified and unmodified images. Though it seemed to have some trouble with blurring. * That shows that the generated images probably indeed show what the DNN has learned and not just what G has learned. * Uncanonical images * The method can sometimes generate uncanonical images (e.g. instead of a full dog just blobs of texture). * They found that this seems to be mostly the case when the dataset images have uncanonical pose, i.e. are very diverse/multimodal. 
* They describe an architecture for deep CNNs that contains short and long paths. (Short = few convolutions between input and output, long = many convolutions between input and output) * They achieve comparable accuracy to residual networks, without using residuals. ### How * Basic principle: * They start with two branches. The left branch contains one convolutional layer, the right branch contains a subnetwork. * That subnetwork again contains a left branch (one convolutional layer) and a right branch (a subnetwork). * This creates a recursion. * At the last step of the recursion they simply insert two convolutional layers as the subnetwork. * Each pair of branches (left and right) is merged using a pairwise mean. (Result: One of the branches can be skipped or removed and the result after the merge will still be sound.) * Their recursive expansion rule (left) and architecture (middle and right) visualized: ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/FractalNet_UltraDeep_Networks_without_Residuals__architecture.png?raw=true "Architecture") * Blocks: * Each of the recursively generated networks is one block. * They chain five blocks in total to create the network that they use for their experiments. * After each block they add a max pooling layer. * Their first block uses 64 filters per convolutional layer, the second one 128, followed by 256, 512 and again 512. * Droppath: * They randomly dropout whole convolutional layers between mergelayers. * They define two methods for that: * Local droppath: Drops each input to each merge layer with a fixed probability, but at least one always survives. (See image, first three examples.) * Global droppath: Drops convolutional layers so that only a single columns (and thereby path) in the whole network survives. (See image, right.) * Visualization: ![Droppath](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/FractalNet_UltraDeep_Networks_without_Residuals__drop_path.png?raw=true "Droppath") ### Results * They test on CIFAR10, CIFAR100 and SVHN with no or mild (crops, flips) augmentation. * They add dropout at the start of each block (probabilities: 0%, 10%, 20%, 30%, 40%). * They use for 50% of the batches local droppath at 15% and for the other 50% global droppath. * They achieve comparable accuracy to ResNets (a bit behind them actually). * Note: The best ResNet that they compare to is "ResNet with Identity Mappings". They don't compare to Wide ResNets, even though they perform best. * If they use image augmentations, dropout and droppath don't seem to provide much benefit (only small improvement). * If they extract the deepest column and test on that one alone, they achieve nearly the same performance as with the whole network. * They derive from that, that their fractal architecture is actually only really used to help that deepest column to learn anything. (Without shorter paths it would just learn nothing due to vanishing gradients.) 
* They describe a convolutional network that takes in photos and returns where (on the planet) these photos were likely made. * The output is a distribution over locations around the world (so not just one single location). This can be useful in the case of ambiguous images. ### How * Basic architecture * They simply use the Inception architecture for their model. * They have 97M parameters. * Grid * The network uses a grid of cells over the planet. * For each photo and every grid cell it returns the likelihood that the photo was made within the region covered by the cell (simple softmax layer). * The naive way would be to use a regular grid around the planet (i.e. a grid in which all cells have the same size). * Possible disadvantages: * In places where lots of photos are taken you still have the same grid cell size as in places where barely any photos are taken. * Maps are often distorted towards the poles (countries are represented much larger than they really are). This will likely affect the grid cells too. * They instead use an adaptive grid pattern based on S2 cells. * S2 cells interpret the planet as a sphere and project a cube onto it. * The 6 sides of the cube are then partitioned using quad trees, creating the grid cells. * They don't use the same depth for all quad trees. Instead they subdivide them only if their leafs contain enough photos (based on their dataset of geolocated images). * They remove some cells for which their dataset does not contain enough images, e.g. cells on oceans. (They also remove these images from the dataset. They don't say how many images are affected by this.) * They end up with roughly 26k cells, some of them reaching the street level of major cities. * Visualization of their cells: ![S2 cells](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/PlaNet__S2.jpg?raw=true "S2 cells") * Training * For each example photo that they feed into the network, they set the correct grid cell to `1.0` and all other grid cells to `0.0`. * They train on a dataset of 126M images with Exif geolocation information. The images were collected from all over the web. * They used Adagrad. * They trained on 200 CPUs for 2.5 months. * Album network * For photo albums they develop variations of their network. * They do that because albums often contain images that are very hard to geolocate on their own, but much easier if the other images of the album are seen. * They use LSTMs for their album network. * The simplest one just iterates over every photo, applies their previously described model to it and extracts the last layer (before output) from that model. These vectors (one per image) are then fed into an LSTM, which is trained to predict (again) the grid cell location per image. * More complicated versions use multiple passes or are bidirectional LSTMs (to use the information from the last images to classify the first ones in the album). ### Results * They beat previous models (based on handengineered features or nearest neighbour methods) by a significant margin. * In a small experiment they can beat experienced humans in geoguessr.com. * Based on a dataset of 2.3M photos from Flickr, their method correctly predicts the country where the photo was made in 30% of all cases (top1; top5: about 50%). Citylevel accuracy is about 10% (top1; top5: about 18%). * Example predictions (using in coarser grid with 354 cells): ![Examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/PlaNet__examples.png?raw=true "Examples") * Using the LSTMtechnique for albums significantly improves prediction accuracy for these images. 
* They suggest a new stochastic optimization method, similar to the existing SGD, Adagrad or RMSProp. * Stochastic optimization methods have to find parameters that minimize/maximize a stochastic function. * A function is stochastic (nondeterministic), if the same set of parameters can generate different results. E.g. the loss of different minibatches can differ, even when the parameters remain unchanged. Even for the same minibatch the results can change due to e.g. dropout. * Their method tends to converge faster to optimal parameters than the existing competitors. * Their method can deal with nonstationary distributions (similar to e.g. SGD, Adadelta, RMSProp). * Their method can deal with very sparse or noisy gradients (similar to e.g. Adagrad). ### How * Basic principle * Standard SGD just updates the parameters based on `parameters = parameters  learningRate * gradient`. * Adam operates similar to that, but adds more "cleverness" to the rule. * It assumes that the gradient values have means and variances and tries to estimate these values. * Recall here that the function to optimize is stochastic, so there is some randomness in the gradients. * The mean is also called "the first moment". * The variance is also called "the second (raw) moment". * Then an update rule very similar to SGD would be `parameters = parameters  learningRate * means`. * They instead use the update rule `parameters = parameters  learningRate * means/sqrt(variances)`. * They call `means/sqrt(variances)` a 'Signal to Noise Ratio'. * Basically, if the variance of a specific parameter's gradient is high, it is pretty unclear how it should be changend. So we choose a small step size in the update rule via `learningRate * mean/sqrt(highValue)`. * If the variance is low, it is easier to predict how far to "move", so we choose a larger step size via `learningRate * mean/sqrt(lowValue)`. * Exponential moving averages * In order to approximate the mean and variance values you could simply save the last `T` gradients and then average the values. * That however is a pretty bad idea, because it can lead to high memory demands (e.g. for millions of parameters in CNNs). * A simple average also has the disadvantage, that it would completely ignore all gradients before `T` and weight all of the last `T` gradients identically. In reality, you might want to give more weight to the last couple of gradients. * Instead, they use an exponential moving average, which fixes both problems and simply updates the average at every timestep via the formula `avg = alpha * avg + (1  alpha) * avg`. * Let the gradient at timestep (batch) `t` be `g`, then we can approximate the mean and variance values using: * `mean = beta1 * mean + (1  beta1) * g` * `variance = beta2 * variance + (1  beta2) * g^2`. * `beta1` and `beta2` are hyperparameters of the algorithm. Good values for them seem to be `beta1=0.9` and `beta2=0.999`. * At the start of the algorithm, `mean` and `variance` are initialized to zerovectors. * Bias correction * Initializing the `mean` and `variance` vectors to zero is an easy and logical step, but has the disadvantage that bias is introduced. * E.g. at the first timestep, the mean of the gradient would be `mean = beta1 * 0 + (1  beta1) * g`, with `beta1=0.9` then: `mean = 0.9 * g`. So `0.9g`, not `g`. Both the mean and the variance are biased (towards 0). * This seems pretty harmless, but it can be shown that it lowers the convergence speed of the algorithm by quite a bit. * So to fix this pretty they perform biascorrections of the mean and the variance: * `correctedMean = mean / (1beta1^t)` (where `t` is the timestep). * `correctedVariance = variance / (1beta2^t)`. * Both formulas are applied at every timestep after the exponential moving averages (they do not influence the next timestep). ![Algorithm](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Adam__algorithm.png?raw=true "Algorithm") 
* What * They describe a new architecture for GANs. * The architecture is based on letting the Generator (G) create images in multiple steps, similar to DRAW. * They also briefly suggest a method to compare the quality of the results of different generators with each other. * How * In a classic GAN one samples a noise vector `z`, feeds that into a Generator (`G`), which then generates an image `x`, which is then fed through the Discriminator (`D`) to estimate its quality. * Their method operates in basically the same way, but internally G is changed to generate images in multiple time steps. * Outline of how their G operates: * Time step 0: * Input: Empty image `delta C1`, randomly sampled `z`. * Feed `delta C1` through a number of downsampling convolutions to create a tensor. (Not very useful here, as the image is empty. More useful in later timesteps.) * Feed `z` through a number of upsampling convolutions to create a tensor (similar to DCGAN). * Concat the output of the previous two steps. * Feed that concatenation through a few more convolutions. * Output: `delta C0` (changes to apply to the empty starting canvas). * Time step 1 (and later): * Input: Previous change `delta C0`, randomly sampled `z` (can be the same as in step 0). * Feed `delta C0` through a number of downsampling convolutions to create a tensor. * Feed `z` through a number of upsampling convolutions to create a tensor (similar to DCGAN). * Concat the output of the previous two steps. * Feed that concatenation through a few more convolutions. * Output: `delta C1` (changes to apply to the empty starting canvas). * At the end, after all timesteps have been performed: * Create final output image by summing all the changes, i.e. `delta C0 + delta C1 + ...`, which basically means `empty start canvas + changes from time step 0 + changes from time step 1 + ...`. * Their architecture as an image: * ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Generating_Images_with_Recurrent_Adversarial_Networks__architecture.png?raw=true "Architecture") * Comparison measure * They suggest a new method to compare GAN results with each other. * They suggest to train pairs of G and D, e.g. for two pairs (G1, D1), (G2, D2). Then they let the pairs compete with each other. * To estimate the quality of D they suggest `r_test = errorRate(D1, testset) / errorRate(D2, testset)`. ("Which D is better at spotting that the test set images are real images?") * To estimate the quality of the generated samples they suggest `r_sample = errorRate(D1, images by G2) / errorRate(D2, images by G1)`. ("Which G is better at fooling an unknown D, i.e. possibly better at generating lifelike images?") * They suggest to estimate which G is better using r_sample and then to estimate how valid that result is using r_test. * Results * Generated images of churches, with timesteps 1 to 5: * ![Churches](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Generating_Images_with_Recurrent_Adversarial_Networks__churches.jpg?raw=true "Churches") * Overfitting * They saw no indication of overfitting in the sense of memorizing images from the training dataset. * They however saw some indication of G just interpolating between some good images and of G reusing small image patches in different images. * Randomness of noise vector `z`: * Sampling the noise vector once seems to be better than resampling it at every timestep. * Resampling it at every time step often led to very similar looking output images. 
* They suggest a new architecture for GANs. * Their architecture adds another Generator for a reverse branch (from images to noise vector `z`). * Their architecture takes some ideas from VAEs/variational neural nets. * Overall they can improve on the previous state of the art (DCGAN). ### How * Architecture * Usually, in GANs one feeds a noise vector `z` into a Generator (G), which then generates an image (`x`) from that noise. * They add a reverse branch (G2), in which another Generator takes a real image (`x`) and generates a noise vector `z` from that. * The noise vector can now be viewed as a latent space vector. * Instead of letting G2 generate *discrete* values for `z` (as it is usually done), they instead take the approach commonly used VAEs and use *continuous* variables instead. * That is, if `z` represents `N` latent variables, they let G2 generate `N` means and `N` variances of gaussian distributions, with each distribution representing one value of `z`. * So the model could e.g. represent something along the lines of "this face looks a lot like a female, but with very low probability could also be male". * Training * The Discriminator (D) is now trained on pairs of either `(real image, generated latent space vector)` or `(generated image, randomly sampled latent space vector)` and has to tell them apart from each other. * Both Generators are trained to maximally confuse D. * G1 (from `z` to `x`) confuses D maximally, if it generates new images that (a) look real and (b) fit well to the latent variables in `z` (e.g. if `z` says "image contains a cat", then the image should contain a cat). * G2 (from `x` to `z`) confuses D maximally, if it generates good latent variables `z` that fit to the image `x`. * Continuous variables * The variables in `z` follow gaussian distributions, which makes the training more complicated, as you can't trivially backpropagate through gaussians. * When training G1 (from `z` to `x`) the situation is easy: You draw a random `z`vector following a gaussian distribution (`N(0, I)`). (This is basically the same as in "normal" GANs. They just often use uniform distributions instead.) * When training G2 (from `x` to `z`) the situation is a bit harder. * Here we need to use the reparameterization trick here. * That roughly means, that G2 predicts the means and variances of the gaussian variables in `z` and then we draw a sample of `z` according to exactly these means and variances. * That sample gives us discrete values for our backpropagation. * If we do that sampling often enough, we get a good approximation of the true gradient (of the continuous variables). (Monte Carlo approximation.) * Results * Images generated based on CelebA dataset: * ![CelebA samples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Adversarially_Learned_Inference__celebasamples.png?raw=true "CelebA samples") * Left column per pair: Real image, right column per pair: reconstruction (`x > z` via G2, then `z > x` via G1) * ![CelebA reconstructions](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Adversarially_Learned_Inference__celebareconstructions.png?raw=true "CelebA reconstructions") * Reconstructions of SVHN, notice how the digits often stay the same, while the font changes: * ![SVHN reconstructions](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Adversarially_Learned_Inference__svhnreconstructions.png?raw=true "SVHN reconstructions") * CIFAR10 samples, still lots of errors, but some quite correct: * ![CIFAR10 samples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Adversarially_Learned_Inference__cifar10samples.png?raw=true "CIFAR10 samples") 
* They describe an architecture that merges classical convolutional networks and residual networks. * The architecture can (theoretically) learn anything that a classical convolutional network or a residual network can learn, as it contains both of them. * The architecture can (theoretically) learn how many convolutional layers it should use per residual block (up to the amount of convolutional layers in the whole network). ### How * Just like residual networks, they have "blocks". Each block contains convolutional layers. * Each block contains residual units and nonresidual units. * They have two "streams" of data in their network (just matrices generated by each block): * Residual stream: The residual blocks write to this stream (i.e. it's their output). * Transient stream: The nonresidual blocks write to this stream. * Residual and nonresidual layers receive *both* streams as input, but only write to *their* stream as output. * Their architecture visualized: ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Resnet_in_Resnet__architecture.png?raw=true "Architecture") * Because of this architecture, their model can learn the number of layers per residual block (though BN and ReLU might cause problems here?): ![Learning layercount](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Resnet_in_Resnet__learning_layercount.png?raw=true "Learning layercount") * The easiest way to implement this should be along the lines of the following (some of the visualized convolutions can be merged): * Input of size CxHxW (both streams, each C/2 planes) * Concat * Residual block: Apply C/2 convolutions to the C input planes, with shortcut addition afterwards. * Transient block: Apply C/2 convolutions to the C input planes. * Apply BN * Apply ReLU * Output of size CxHxW. * The whole operation can also be implemented with just a single convolutional layer, but then one has to make sure that some weights stay at zero. ### Results * They test on CIFAR10 and CIFAR100. * They search for optimal hyperparameters (learning rate, optimizer, L2 penalty, initialization method, type of shortcut connection in residual blocks) using a grid search. * Their model improves upon a wide ResNet and an equivalent nonresidual CNN by a good margin (CIFAR10: 0.51%, CIFAR100: 12%). 
* Autoencoders typically have some additional criterion that pushes them towards learning meaningful representations. * E.g. L1Penalty on the code layer (z), Dropout on z, Noise on z. * Often, representations with sparse activations are considered meaningful (so that each activation reflects are clear concept). * This paper introduces another technique that leads to sparsity. * They use a rank ordering on z. * The first (according to the ranking) activations have to do most of the reconstruction work of the data (i.e. image). ### How * Basic architecture: * They use an Autoencoder architecture: Input > Encoder > z > Decoder > Output. * Their encoder and decoder seem to be empty, i.e. z is the only hidden layer in the network. * Their output is not just one image (or whatever is encoded), instead they generate one for every unit in layer z. * Then they order these outputs based on the activation of the units in z (rank ordering), i.e. the output of the unit with the highest activation is placed in the first position, the output of the unit with the 2nd highest activation gets the 2nd position and so on. * They then generate the final output image based on a cumulative sum. So for three reconstructed output images `I1, I2, I3` (rank ordered that way) they would compute `final image = I1 + (I1+I2) + (I1+I2+I3)`. * They then compute the error based on that reconstruction (`reconstruction  input image`) and backpropagate it. * Cumulative sum: * Using the cumulative sum puts most optimization pressure on units with high activation, as they have the largest influence on the reconstruction error. * The cumulative sum is best optimized by letting few units have high activations and generate most of the output (correctly). All the other units have ideally low to zero activations and low or no influence on the output. (Though if the output generated by the first units is wrong, you should then end up with an extremely high cumulative error sum...) * So their `z` coding should end up with few but high activations, i.e. it should become very sparse. * The cumulative generates an individual error per output, while an ordinary sum generates the same error for every output. They argue that this "blurs" the error less. * To avoid blow ups in their network they use TReLUs, which saturate below 0 and above 1, i.e. `min(1, max(0, input))`. * They use a custom derivative function for the TReLUs, which is dependent on both the input value of the unit and its gradient. Basically, if the input is `>1` (saturated) and the error is high, then the derivative pushes the weight down, so that the input gets into the unsaturated regime. Similarly for input values `<0` (pushed up). If the input value is between 0 and 1 and/or the error is low, then nothing is changed. * They argue that the algorithmic complexity of the rank ordering should be low, due to sorts being `O(n log(n))`, where `n` is the number of hidden units in `z`. ### Results * They autoencode 7x7 patches from CIFAR10. * They get very sparse activations. * Training and test loss develop identically, i.e. no overfitting. 
* The authors start with a standard ResNet architecture (i.e. residual network has suggested in "Identity Mappings in Deep Residual Networks"). * Their residual block: ![Residual block](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Wide_Residual_Networks__residual_block.png?raw=true "Residual block") * Several residual blocks of 16 filters per convlayer, followed by 32 and then 64 filters per convlayer. * They empirically try to answer the following questions: * How many residual blocks are optimal? (Depth) * How many filters should be used per convolutional layer? (Width) * How many convolutional layers should be used per residual block? * Does Dropout between the convolutional layers help? ### Results * *Layers per block and kernel sizes*: * Using 2 convolutional layers per residual block seems to perform best: ![Convs per block](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Wide_Residual_Networks__convs_per_block.png?raw=true "Convs per block") * Using 3x3 kernel sizes for both layers seems to perform best. * However, using 3 layers with kernel sizes 3x3, 1x1, 3x3 and then using less residual blocks performs nearly as good and decreases the required time per batch. * *Width and depth*: * Increasing the width considerably improves the test error. * They achieve the best results (on CIFAR10) when decreasing the depth to 28 convolutional layers, with each having 10 times their normal width (i.e. 16\*10 filters, 32\*10 and 64\*10): ![Depth and width results](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Wide_Residual_Networks__depth_and_width.png?raw=true "Depth and width results") * They argue that their results show no evidence that would support the common theory that thin and deep networks somehow regularized better than wide and shallow(er) networks. * *Dropout*: * They use dropout with p=0.3 (CIFAR) and p=0.4 (SVHN). * On CIFAR10 dropout doesn't seem to consistently improve test error. * On CIFAR100 and SVHN dropout seems to lead to improvements that are either small (wide and shallower net, i.e. depth=28, width multiplier=10) or significant (ResNet50). ![Dropout](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Wide_Residual_Networks__dropout.png?raw=true "Dropout") * They also observed oscillations in error (both train and test) during the training. Adding dropout decreased these oscillations. * *Computational efficiency*: * Applying few big convolutions is much more efficient on GPUs than applying many small ones sequentially. * Their network with the best test error is 1.6 times faster than ResNet1001, despite having about 3 times more parameters. 
* The authors reevaluate the original residual design of neural networks. * They compare various architectures of residual units and actually find one that works quite a bit better. ### How * The new variation starts the transformation branch of each residual unit with BN and a ReLU. * It removes BN and ReLU after the last convolution. * As a result, the information from previous layers can flow completely unaltered through the shortcut branch of each residual unit. * The image below shows some variations (of the position of BN and ReLU) that they tested. The new and better design is on the right: ![BN and ReLU positions](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Identity_Mappings_in_Deep_Residual_Networks__activations.png?raw=true "BN and ReLU positions") * They also tried various alternative designs for the shortcut connections. However, all of these designs performed worse than the original one. Only one (d) came close under certain conditions. Therefore, the recommendation is to stick with the old/original design. ![Shortcut designs](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Identity_Mappings_in_Deep_Residual_Networks__shortcuts.png?raw=true "Shortcut designs") ### Results * Significantly faster training for very deep residual networks (1001 layers). * Better regularization due to the placement of BN. * CIFAR10 and CIFAR100 results, old vs. new design: ![Old vs new results](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Identity_Mappings_in_Deep_Residual_Networks__old_vs_new.png?raw=true "Old vs new results") 
* They describe a regularization method similar to dropout and stochastic depth. * The method could be viewed as a merge of the two techniques (dropout, stochastic depth). * The method seems to regularize better than any of the two alone. ### How * Let `x` be the input to a layer. That layer produces an output. The output can be: * Feed forward ("classic") network: `F(x)`. * Residual network: `x + F(x)`. * The standard dropoutlike methods do the following: * Dropout in feed forward networks: Sometimes `0`, sometimes `F(x)`. Decided per unit. * Dropout in residual networks (rarely used): Sometimes `0`, sometimes `x + F(x)`. Decided per unit. * Stochastic depth (only in residual networks): Sometimes `x`, sometimes `x + F(x)`. Decided per *layer*. * Skip forward (only in residual networks): Sometimes `x`, sometimes `x + F(x)`. Decided per unit. * **Swapout** (any network): Sometimes `0`, sometimes `F(x)`, sometimes `x`, sometimes `x + F(x)`. Decided per unit. * Swapout can be represented using the formula `y = theta_1 * x + theta_2 * F(x)`. * `*` is the elementwise product. * `theta_1` and `theta_2` are tensors following bernoulli distributions, i.e. their values are all exactly `0` or exactly `1`. * Setting the values of `theta_1` and `theta_2` per unit in the right way leads to the values `0` (both 0), `x` (1, 0), `F(x)` (0, 1) or `x + F(x)` (1, 1). * Deterministic and Stochastic Inference * Ideally, when using a dropoutlike technique you would like to get rid of its stochastic effects during prediction, so that you can predict values with exactly *one* forward pass through the network (instead of having to average over many passes). * For Swapout it can be mathematically shown that you can't calculate a deterministic version of it that performs equally to the stochastic one (averaging over many forward passes). * This is even more the case when using Batch Normalization in a network. (Actually also when not using Swapout, but instead Dropout + BN.) * So for best results you should use the stochastic method (averaging over many forward passes). ### Results * They compare various dropoutlike methods, including Swapout, applied to residual networks. (On CIFAR10 and CIFAR100.) * General performance: * Results with Swapout are better than with the other methods. * According to their results, the ranking of methods is roughly: Swapout > Dropout > Stochastic Depth > Skip Forward > None. * Stochastic vs deterministic method: * The stochastic method of swapout (average over N forward passes) performs significantly better than the deterministic one. * Using about 1530 forward passes seems to yield good results. * Optimal parameter choice: * Previously the Swapoutformula `y = theta_1 * x + theta_2 * F(x)` was mentioned. * `theta_1` and `theta_2` are generated via Bernoulli distributions which have parameters `p_1` and `p_2`. * If using fixed values for `p_1` and `p_2` throughout the network, it seems to be best to either set both of them to `0.5` or to set `p_1` to `>0.5` and `p_2` to `<0.5` (preference towards `y = x`). * It's best however to start both at `1.0` (always `y = x + F(x)`) and to then linearly decay them to both `0.5` towards the end of the network, i.e. to apply less noise to the early layers. (This is similar to the results in the Stochastic Depth paper.) * Thin vs. wide residual networks: * The standard residual networks that they compared to used a `(16, 32, 64)` pattern for their layers, i.e. they started with layers of each having 16 convolutional filters, followed by some layers with each having 32 filters, followed by some layers with 64 filters. * They tried instead a `(32, 64, 128)` pattern, i.e. they doubled the amount of filters. * Then they reduced the number of layers from 100 down to 20. * Their wider residual network performed significantly better than the deep and thin counterpart. However, their parameter count also increased by about `4` times. * Increasing the pattern again to `(64, 128, 256)` and increasing the number of layers from 20 to 32 leads to another performance improvement, beating a 1000layer network of pattern `(16, 32, 64)`. (Parameter count is then `27` times the original value.) * Comments * Stochastic depth works layerwise, while Swapout works unitwise. When a layer in Stochastic Depth is dropped, its whole forward and backwardpass don't have to be calculated. That saves time. Swapout is not going to save time. * They argue that dropout+BN would also profit from using stochastic inference instead of deterministic inference, just like Swapout does. However, they don't mention using it for dropout in their comparison, only for Swapout. * They show that linear decay for their parameters (less dropping on early layers, more on later ones) significantly improves the results of Swapout. However, they don't mention testing the same thing for dropout. Maybe dropout would also profit from it? * For the above two points: Dropout's test error is at 5.87, Swapout's test error is at 5.68. So the difference is already quite small, making any disadvantage for dropout significant. ![Visualization](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Swapout__visualization.png?raw=true "Visualization") *Visualization of how Swapout works. From left to right: An input `x`; a standard layer is applied to the input `F(x)`; a residual layer is applied to the input `x + F(x)`; Skip Forward is applied to the layer; Swapout is applied to the layer. Stochastic Depth would be all units being orange (`x`) or blue (`x + F(x)`).* 
* They describe a variation of convolutions that have a differently structured receptive field. * They argue that their variation works better for dense prediction, i.e. for predicting values for every pixel in an image (e.g. coloring, segmentation, upscaling). ### How * One can image the input into a convolutional layer as a 3dgrid. Each cell is a "pixel" generated by a filter. * Normal convolutions compute their output per cell as a weighted sum of the input cells in a dense area. I.e. all input cells are right next to each other. * In dilated convolutions, the cells are not right next to each other. E.g. 2dilated convolutions skip 1 cell between each input cell, 3dilated convolutions skip 2 cells etc. (Similar to striding.) * Normal convolutions are simply 1dilated convolutions (skipping 0 cells). * One can use a 1dilated convolution and then a 2dilated convolution. The receptive field of the second convolution will then be 7x7 instead of the usual 5x5 due to the spacing. * Increasing the dilation factor by 2 per layer (1, 2, 4, 8, ...) leads to an exponential increase in the receptive field size, while every cell in the receptive field will still be part in the computation of at least one convolution. * They had problems with badly performing networks, which they fixed using an identity initialization for the weights. (Sounds like just using resdiual connections would have been easier.) ![Receptive field](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/MultiScale_Context_Aggregation_by_Dilated_Convolutions__receptive.png?raw=true "Receptive field") *Receptive fields of a 1dilated convolution (1st image), followed by a 2dilated conv. (2nd image), followed by a 4dilated conv. (3rd image). The blue color indicates the receptive field size (notice the exponential increase in size). Stronger blue colors mean that the value has been used in more different convolutions.* ### Results * They took a VGG net, removed the pooling layers and replaced the convolutions with dilated ones (weights can be kept). * They then used the network to segment images. * Their results were significantly better than previous methods. * They also added another network with more dilated convolutions in front of the VGG one, again improving the results. ![Segmentation performance](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/MultiScale_Context_Aggregation_by_Dilated_Convolutions__segmentation.png?raw=true "Segmentation performance") *Their performance on a segmentation task compared to two competing methods. They only used VGG16 without pooling layers and with convolutions replaced by dilated convolutions.* 
* The well known method of Artistic Style Transfer can be used to generate new texture images (from an existing example) by skipping the content loss and only using the style loss. * The method however can have problems with large scale structures and quasiperiodic patterns. * They add a new loss based on the spectrum of the images (synthesized image and style image), which decreases these problems and handles especially periodic patterns well. ### How * Everything is handled in the same way as in the Artistic Style Transfer paper (without content loss). * On top of that they add their spectrum loss: * The loss is based on a squared distance, i.e. $1/2 d(I_s, I_t)^2$. * $I_s$ is the last synthesized image. * $I_t$ is the texture example. * $d(I_s, I_t)$ then does the following: * It assumes that $I_t$ is an example for a space of target images. * Within that set it finds the image $I_p$ which is most similar to $I_s$. That is done using a projection via Fourier Transformations. (See formula 5 in the paper.) * The returned distance is then $I_s  I_p$. ### Results * Equal quality for textures without quasiperiodic structures. * Significantly better quality for textures with quasiperiodic structures. ![Overview](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Texture_Synthesis_Through_CNNs_and_Spectrum_Constraints__overview.png?raw=true "Overview") *Overview over their method, i.e. generated textures using style and/or spectrumbased loss.* 
https://www.youtube.com/watch?v=PRD8LpPvdHI * They describe a method that can be used for two problems: * (1) Choose a style image and apply that style to other images. * (2) Choose an example texture image and create new texture images that look similar. * In contrast to previous methods their method can be applied very fast to images (style transfer) or noise (texture creation). However, per style/texture a single (expensive) initial training session is still necessary. * Their method builds upon their previous paper "Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis". ### How * Rough overview of their previous method: * Transfer styles using three losses: * Content loss: MSE between VGG representations. * Regularization loss: Sum of xgradient and ygradients (encouraging smooth areas). * MRFbased style loss: Sample `k x k` patches from VGG representations of content image and style image. For each patch from content image find the nearest neighbor (based on normalized cross correlation) from style patches. Loss is then the sum of squared errors of euclidean distances between content patches and their nearest neighbors. * Generation of new images is done by starting with noise and then iteratively applying changes that minimize the loss function. * They introduce mostly two major changes: * (a) Get rid of the costly nearest neighbor search for the MRF loss. Instead, use a discriminatornetwork that receives a patch and rates how real that patch looks. * This discriminatornetwork is costly to train, but that only has to be done once (per style/texture). * (b) Get rid of the slow, iterative generation of images. Instead, start with the content image (style transfer) or noise image (texture generation) and feed that through a single generatornetwork to create the output image (with transfered style or generated texture). * This generatornetwork is costly to train, but that only has to be done once (per style/texture). * MDANs * They implement change (a) to the standard architecture and call that an "MDAN" (Markovian Deconvolutional Adversarial Networks). * So the architecture of the MDAN is: * Input: Image (RGB pixels) * Branch 1: Markovian Patch Quality Rater (aka Discriminator) * Starts by feeding the image through VGG19 until layer `relu3_1`. (Note: VGG weights are fixed/not trained.) * Then extracts `k x k` patches from the generated representations. * Feeds each patch through a shallow ConvNet (convolution with BN then fully connected layer). * Training loss is a hinge loss, i.e. max margin between classes +1 (real looking patch) and 1 (fake looking patch). (Could also take a single sigmoid output, but they argue that hinge loss isn't as likely to saturate.) * This branch will be trained continuously while synthesizing a new image. * Branch 2: Content Estimation/Guidance * Note: This branch is only used for style transfer, i.e if using an content image and not for texture generation. * Starts by feeding the currently synthesized image through VGG19 until layer `relu5_1`. (Note: VGG weights are fixed/not trained.) * Also feeds the content image through VGG19 until layer `relu5_1`. * Then uses a MSE loss between both representations (so similar to a MSE on RGB pixels that is often used in autoencoders). * Nothing in this branch needs to trained, the loss only affects the synthesizing of the image. * MGANs * The MGAN is like the MDAN, but additionally implements change (b), i.e. they add a generator that takes an image and stylizes it. * The generator's architecture is: * Input: Image (RGB pixels) or noise (for texture synthesis) * Output: Image (RGB pixels) (stylized input image or generated texture) * The generator takes the image (pixels) and feeds that through VGG19 until layer `relu4_1`. * Similar to the DCGAN generator, they then apply a few fractionally strided convolutions (with BN and LeakyReLUs) to that, ending in a Tanh output. (Fractionally strided convolutions increase the height/width of the images, here to compensate the VGG pooling layers.) * The output after the Tanh is the output image (RGB pixels). * They train the generator with pairs of `(input image, stylized image or texture)`. These pairs can be gathered by first running the MDAN alone on several images. (With significant augmentation a few dozen pairs already seem to be enough.) * One of two possible loss functions can then be used: * Simple standard choice: MSE on the euclidean distance between expected output pixels and generated output pixels. Can cause blurriness. * Better choice: MSE on a higher VGG representation. Simply feed the generated output pixels through VGG19 until `relu4_1` and the reuse the already generated (see above) VGGrepresentation of the input image. This is very similar to the pixelwise comparison, but tends to cause less blurriness. * Note: For some reason the authors call their generator a VAE, but don't mention any typical VAE technique, so it's not described like one here. * They use Adam to train their networks. * For texture generation they use Perlin Noise instead of simple white noise. In Perlin Noise, lower frequency components dominate more than higher frequency components. White noise didn't work well with the VGG representations in the generator (activations were close to zero). ### Results * Similar quality like previous methods, but much faster (compared to most methods). * For the Markovian Patch Quality Rater (MDAN branch 1): * They found that the weights of this branch can be used as initialization for other training sessions (e.g. other texture styles), leading to a decrease in required iterations/epochs. * Using VGG for feature extraction seems to be crucial. Training from scratch generated in worse results. * Using larger patch sizes preserves more structure of the structure of the style image/texture. Smaller patches leads to more flexibility in generated patterns. * They found that using more than 3 convolutional layers or more than 64 filters per layer provided no visible benefit in quality. ![Example](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Markovian_GANs__example.png?raw=true "Example") *Result of their method, compared to other methods.* ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Markovian_GANs__architecture.png?raw=true "Architecture") *Architecture of their model.* 
* They describe a method to transfer image styles based on semantic classes. * This allows to: * (1) Transfer styles between images more accurately than with previous models. E.g. so that the background of an image does not receive the style of skin/hair/clothes/... seen in the style image. Skin in the synthesized image should receive the style of skin from the style image. Same for hair, clothes, etc. * (2) Turn simple doodles into artwork by treating the simplified areas in the doodle as semantic classes and annotating an artwork with these same semantic classes. (E.g. "this blob should receive the style from these trees.") ### How * Their method is based on [Combining Markov Random Fields and Convolutional Neural Networks for Image Synthesis](Combining_MRFs_and_CNNs_for_Image_Synthesis.md). * They use the same content loss and mostly the same MRFbased style loss. (Apparently they don't use the regularization loss.) * They change the input of the MRFbased style loss. * Usually that input would only be the activations of a VGGlayer (for the synthesized image or the style source image). * They add a semantic map with weighting `gamma` to the activation, i.e. `<representation of image> = <activation of specific layer for that image>  gamma * <semantic map>`. * The semantic map has N channels with 1s in a channel where a specific class is located (e.g. skin). * The semantic map has to be created by the user for both the content image and the style image. * As usually for the MRF loss, patches are then sampled from the representations. The semantic maps then influence the distance measure. I.e. patches are more likely to be sampled from the same semantic class. * Higher `gamma` values make it more likely to sample from the same semantic class (because the distance from patches from different classes gets larger). * One can create a small doodle with few colors, then use the colors as the semantic map. Then add a semantic map to an artwork and run the algorithm to transform the doodle into an artwork. ### Results * More control over the transfered styles than previously. * Less sensitive to the style weighting, because of the additional `gamma` hyperparameter. * Easy transformation from doodle to artwork. ![Example](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Neural_Doodle__example.png?raw=true "Example") *Turning a doodle into an artwork. Note that the doodle input image is also used as the semantic map of the input.* 
* They describe a method that applies the style of a source image to a target image. * Example: Let a normal photo look like a van Gogh painting. * Example: Let a normal car look more like a specific luxury car. * Their method builds upon the well known artistic style paper and uses a new MRF prior. * The prior leads to locally more plausible patterns (e.g. less artifacts). ### How * They reuse the content loss from the artistic style paper. * The content loss was calculated by feed the source and target image through a network (here: VGG19) and then estimating the squared error of the euclidean distance between one or more hidden layer activations. * They use layer `relu4_2` for the distance measurement. * They replace the original style loss with a MRF based style loss. * Step 1: Extract from the source image `k x k` sized overlapping patches. * Step 2: Perform step (1) analogously for the target image. * Step 3: Feed the source image patches through a pretrained network (here: VGG19) and select the representations `r_s` from specific hidden layers (here: `relu3_1`, `relu4_1`). * Step 4: Perform step (3) analogously for the target image. (Result: `r_t`) * Step 5: For each patch of `r_s` find the best matching patch in `r_t` (based on normalized cross correlation). * Step 6: Calculate the sum of squared errors (based on euclidean distances) of each patch in `r_s` and its best match (according to step 5). * They add a regularizer loss. * The loss encourages smooth transitions in the synthesized image (i.e. few edges, corners). * It is based on the raw pixel values of the last synthesized image. * For each pixel in the synthesized image, they calculate the squared xgradient and the squared ygradient and then add both. * They use the sum of all those values as their loss (i.e. `regularizer loss = <sum over all pixels> xgradient^2 + ygradient^2`). * Their whole optimization problem is then roughly `image = argmin_image MRFstyleloss + alpha1 * contentloss + alpha2 * regularizerloss`. * In practice, they start their synthesis with a low resolution image and then progressively increase the resolution (each time performing some iterations of optimization). * In practice, they sample patches from the style image under several different rotations and scalings. ### Results * In comparison to the original artistic style paper: * Less artifacts. * Their method tends to preserve style better, but content worse. * Can handle photorealistic style transfer better, so long as the images are similar enough. If no good matches between patches can be found, their method performs worse. ![Nonphotorealistic example images](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Combining_MRFs_and_CNNs_for_Image_Synthesis__examples.png?raw=true "Nonphotorealistic example images") *Nonphotorealistic example images. Their method vs. the one from the original artistic style paper.* ![Photorealistic example images](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Combining_MRFs_and_CNNs_for_Image_Synthesis__examples_real.png?raw=true "Photorealistic example images") *Photorealistic example images. Their method vs. the one from the original artistic style paper.* 
* They describe a model that upscales low resolution images to their high resolution equivalents ("Single Image Super Resolution"). * Their model uses a deeper architecture than previous models and has a residual component. ### How * Their model is a fully convolutional neural network. * Input of the model: The image to upscale, *already upscaled to the desired size* (but still blurry). * Output of the model: The upscaled image (without the blurriness). * They use 20 layers of padded 3x3 convolutions with size 64xHxW with ReLU activations. (No pooling.) * They have a residual component, i.e. the model only learns and outputs the *change* that has to be applied/added to the blurry input image (instead of outputting the full image). That change is applied to the blurry input image before using the loss function on it. (Note that this is a bit different from the currently used "residual learning".) * They use a MSE between the "correct" upscaling and the generated upscaled image (input image + residual). * They use SGD starting with a learning rate of 0.1 and decay it 3 times by a factor of 10. * They use weight decay of 0.0001. * During training they use a special gradient clipping adapted to the learning rate. Usually gradient clipping restricts the gradient values to `[t, t]` (`t` is a hyperparameter). Their gradient clipping restricts the values to `[t/lr, t/lr]` (where `lr` is the learning rate). * They argue that their special gradient clipping allows the use of significantly higher learning rates. * They train their model on multiple scales, e.g. 2x, 3x, 4x upscaling. (Not really clear how. They probably feed their upscaled image again into the network or something like that?) ### Results * Higher accuracy upscaling than all previous methods. * Can handle well upscaling factors above 2x. * Residual network learns significantly faster than nonresidual network. ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Accurate_Image_SuperResolution__architecture.png?raw=true "Architecture") *Architecture of the model.* ![Examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Accurate_Image_SuperResolution__examples.png?raw=true "Examples") *Superresolution quality of their model (top, bottom is a competing model).* 
* They describe a model for human pose estimation, i.e. one that finds the joints ("skeleton") of a person in an image. * They argue that part of their model resembles a Markov Random Field (but in reality its implemented as just one big neural network). ### How * They have two components in their network: * PartDetector: * Finds candidate locations for human joints in an image. * Pretty standard ConvNet. A few convolutional layers with pooling and ReLUs. * They use two branches: A fine and a coarse one. Both branches have practically the same architecture (convolutions, pooling etc.). The coarse one however receives the image downscaled by a factor of 2 (half width/height) and upscales it by a factor of 2 at the end of the branch. * At the end they merge the results of both branches with more convolutions. * The output of this model are 4 heatmaps (one per joint? unclear), each having lower resolution than the original image. * SpatialModel: * Takes the results of the part detector and tries to remove all detections that were false positives. * They derive their architecture from a fully connected Markov Random Field which would be solved with one step of belief propagation. * They use large convolutions (128x128) to resemble the "fully connected" part. * They initialize the weights of the convolutions with joint positions gathered from the training set. * The convolutions are followed by log(), elementwise additions and exp() to resemble an energy function. * The end result are the input heatmaps, but cleaned up. ### Results * Beats all previous models (with and without spatial model). * Accuracy seems to be around 90% (with enough (16px) tolerance in pixel distance from ground truth). * Adding the spatial model adds a few percentage points of accuracy. * Using two branches instead of one (in the part detector) adds a bit of accuracy. Adding a third branch adds a tiny bit more. ![Results](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Joint_Training_of_a_ConvNet_and_a_PGM_for_HPE__results.png?raw=true "Results") *Example results.* ![Part Detector](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Joint_Training_of_a_ConvNet_and_a_PGM_for_HPE__part_detector.png?raw=true "Part Detector") *Part Detector network.* ![Spatial Model](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Joint_Training_of_a_ConvNet_and_a_PGM_for_HPE__spatial_model.png?raw=true "Spatial Model") *Spatial Model (apparently only for two input heatmaps).*  # Rough chapterwise notes * (1) Introduction * Human Pose Estimation (HPE) from RGB images is difficult due to the high dimensionality of the input. * Approaches: * Deformablepart models: Traditionally based on handcrafted features. * Deeplearning based disciminative models: Recently outperformed other models. However, it is hard to incorporate priors (e.g. possible joint interconnectivity) into the model. * They combine: * A partdetector (ConvNet, utilizes multiresolution feature representation with overlapping receptive fields) * Partbased SpatialModel (approximates loopy belief propagation) * They backpropagate through the spatial model and then the partdetector. * (3) Model * (3.1) Convolutional Network PartDetector * This model locates possible positions of human key joints in the image ("part detector"). * Input: RGB image. * Output: 4 heatmaps, one per key joint (per pixel: likelihood). * They use a fully convolutional network. * They argue that applying convolutions to every pixel is similar to moving a sliding window over the image. * They use two receptive field sizes for their "sliding window": A large but coarse/blurry one, a small but fine one. * To implement that, they use two branches. Both branches are mostly identical (convolutions, poolings, ReLU). They simply feed a downscaled (half width/height) version of the input image into the coarser branch. At the end they upscale the coarser branch once and then merge both branches. * After the merge they apply 9x9 convolutions and then 1x1 convolutions to get it down to 4xHxW (H=60, W=90 where expected input was H=320, W=240). * (3.2) Higherlevel SpatialModel * This model takes the detected joint positions (heatmaps) and tries to remove those that are probably false positives. * It is a ConvNet, which tries to emulate (1) a Markov Random Field and (2) solving that MRF approximately via one step of belief propagation. * The raw MRF formula would be something like `<likelihood of joint A per px> = normalize( <product over joint v from joints V> <probability of joint A per px given a> * <probability of joint v at px?> + someBiasTerm)`. * They treat the probabilities as energies and remove from the formula the partition function (`normalize`) for various reasons (e.g. because they are only interested in the maximum value anyways). * They use exp() in combination with log() to replace the product with a sum. * They apply SoftPlus and ReLU so that the energies are always positive (and therefore play well with log). * Apparently `<probability of joint v at px?>` are the input heatmaps of the part detector. * Apparently `<probability of joint A per px given a>` is implemented as the weights of a convolution. * Apparently `someBiasTerm` is implemented as the bias of a convolution. * The convolutions that they use are large (128x128) to emulate a fully connected graph. * They initialize the convolution weights based on histograms gathered from the dataset (empirical distribution of joint displacements). * (3.3) Unified Models * They combine the partbased model and the spatial model to a single one. * They first train only the partbased model, then only the spatial model, then both. * (4) Results * Used datasets: FLIC (4k training images, 1k test, mostly frontfacing and standing poses), FLICplus (17k, 1k ?), extendedLSP (10k, 1k). * FLIC contains images showing multiple persons with only one being annotated. So for FLIC they add a heatmap of the annotated body torso to the input (i.e. the partdetector does not have to search for the person any more). * The evaluation metric roughly measures, how often predicted joint positions are within a certain radius of the true joint positions. * Their model performs significantly better than competing models (on both FLIC and LSP). * Accuracy seems to be at around 80%95% per joint (when choosing high enough evaluation tolerance, i.e. 10px+). * Adding the spatial model to the part detector increases the accuracy by around 1015 percentage points. * Training the part detector and the spatial model jointly adds ~3 percentage points accuracy over training them separately. * Adding the second filter bank (coarser branch in the part detector) adds around 5 percentage points accuracy. Adding a third filter bank adds a tiny bit more accuracy. 
* They present a hierarchical method for reinforcement learning. * The method combines "long"term goals with shortterm action choices. ### How * They have two components: * MetaController: * Responsible for the "long"term goals. * Is trained to pick goals (based on the current state) that maximize (extrinsic) rewards, just like you would usually optimize to maximize rewards by picking good actions. * The MetaController only picks goals when the Controller terminates or achieved the goal. * Controller: * Receives the current state and the current goal. * Has to pick a reward maximizing action based on those, just as the agent would usually do (only the goal is added here). * The reward is intrinsic. It comes from the Critic. The Critic gives reward whenever the current goal is reached. * For Montezuma's Revenge: * A goal is to reach a specific object. * The goal is encoded via a bitmask (as big as the game screen). The mask contains 1s wherever the object is. * They handextract the location of a few specific objects. * So basically: * The MetaController picks the next object to reach via a Qvalue function. * It receives extrinsic reward when objects have been reached in a specific sequence. * The Controller picks actions that lead to reaching the object based on a Qvalue function. It iterates actionchoosing until it terminates or reached the goalobject. * The Critic awards intrinsic reward to the Controller whenever the goalobject was reached. * They use CNNs for the MetaController and the Controller, similar in architecture to the AtariDQN paper (shallow CNNs). * They use two replay memories, one for the MetaController (size 40k) and one for the Controller (size 1M). * Both follow an epsilongreedy policy (for picking goals/actions). Epsilon starts at 1.0 and is annealed down to 0.1. * They use a discount factor / gamma of 0.9. * They train with SGD. ### Results * Learns to play Montezuma's Revenge. * Learns to act well in a more abstract MDP with delayed rewards and where simple Qlearning failed.  # Rough chapterwise notes * (1) Introduction * Basic problem: Learn goal directed behaviour from sparse feedbacks. * Challenges: * Explore state space efficiently * Create multiple levels of spatiotemporal abstractions * Their method: Combines deep reinforcement learning with hierarchical value functions. * Their agent is motivated to solve specific intrinsic goals. * Goals are defined in the space of entities and relations, which constraints the search space. * They define their value function as V(s, g) where s is the state and g is a goal. * First, their agent learns to solve intrinsically generated goals. Then it learns to chain these goals together. * Their model has two hiearchy levels: * MetaController: Selects the current goal based on the current state. * Controller: Takes state s and goal g, then selects a good action based on s and g. The controller operates until g is achieved, then the metacontroller picks the next goal. * MetaController gets extrinsic rewards, controller gets intrinsic rewards. * They use SGD to optimize the whole system (with respect to reward maximization). * (3) Model * Basic setting: Action a out of all actions A, state s out of S, transition function T(s,a)>s', reward by state F(s)>R. * epsilongreedy is good for local exploration, but it's not good at exploring very different areas of the state space. * They use intrinsically motivated goals to better explore the state space. * Sequences of goals are arranged to maximize the received extrinsic reward. * The agent learns one policy per goal. * MetaController: Receives current state, chooses goal. * Controller: Receives current state and current goal, chooses action. Keeps choosing actions until goal is achieved or a terminal state is reached. Has the optimization target of maximizing cumulative reward. * Critic: Checks if current goal is achieved and if so provides intrinsic reward. * They use deep Q learning to train their model. * There are two Qvalue functions. One for the controller and one for the metacontroller. * Both formulas are extended by the last chosen goal g. * The Qvalue function of the metacontroller does not depend on the chosen action. * The Qvalue function of the controller receives only intrinsic direct reward, not extrinsic direct reward. * Both Qvalue functions are reprsented with DQNs. * Both are optimized to minimize MSE losses. * They use separate replay memories for the controller and metacontroller. * A memory is added for the metacontroller whenever the controller terminates. * Each new goal is picked by the metacontroller epsilongreedy (based on the current state). * The controller picks actions epsilongreedy (based on the current state and goal). * Both epsilons are annealed down. * (4) Experiments * (4.1) Discrete MDP with delayed rewards * Basic MDP setting, following roughly: Several states (s1 to s6) organized in a chain. The agent can move left or right. It gets high reward if it moves to state s6 and then back to s1, otherwise it gets small reward per reached state. * They use their hierarchical method, but without neural nets. * Baseline is Qlearning without a hierarchy/intrinsic rewards. * Their method performs significantly better than the baseline. * (4.2) ATARI game with delayed rewards * They play Montezuma's Revenge with their method, because that game has very delayed rewards. * They use CNNs for the controller and metacontroller (architecture similar to the AtariDQN paper). * The critic reacts to (entity1, relation, entity2) relationships. The entities are just objects visible in the game. The relation is (apparently ?) always "reached", i.e. whether object1 arrived at object2. * They extract the objects manually, i.e. assume the existance of a perfect unsupervised object detector. * They encode the goals apparently not as vectors, but instead just use a bitmask (game screen heightand width), which has 1s at the pixels that show the object. * Replay memory sizes: 1M for controller, 50k for metacontroller. * gamma=0.99 * They first only train the controller (i.e. metacontroller completely random) and only then train both jointly. * Their method successfully learns to perform actions which lead to rewards with long delays. * It starts with easier goals and then learns harder goals. 
* They present a model which adds color to grayscale images (e.g. to old black and white images). * It works best with 224x224 images, but can handle other sizes too. ### How * Their model has three feature extraction components: * Low level features: * Receives 1xHxW images and outputs 512xH/8xW/8 matrices. * Uses 6 convolutional layers (3x3, strided, ReLU) for that. * Global features: * Receives the low level features and converts them to 256 dimensional vectors. * Uses 4 convolutional layers (3x3, strided, ReLU) and 3 fully connected layers (1024 > 512 > 256; ReLU) for that. * Midlevel features: * Receives the low level features and converts them to 256xH/8xW/8 matrices. * Uses 2 convolutional layers (3x3, ReLU) for that. * The global and midlevel features are then merged with a Fusion Layer. * The Fusion Layer is basically an extended convolutional layer. * It takes the midlevel features (256xH/8xW/8) and the global features (256) as input and outputs a matrix of shape 256xH/8xW/8. * It mostly operates like a normal convolutional layer on the midlevel features. However, its weight matrix is extended to also include weights for the global features (which will be added at every pixel). * So they use something like `fusion at pixel u,v = sigmoid(bias + weights * [global features, midlevel features at pixel u,v])`  and that with 256 different weight matrices and biases for 256 filters. * After the Fusion Layer they use another network to create the coloring: * This network receives 256xH/8xW/8 matrices (merge of global and midlevel features) and generates 2xHxW outputs (color in L\*a\*b\* color space). * It uses a few convolutional layers combined with layers that do nearest neighbour upsampling. * The loss for the colorization network is a MSE based on the true coloring. * They train the global feature extraction also on the true class labels of the used images. * Their model can handle any sized image. If the image doesn't have a size of 224x224, it must be resized to 224x224 for the gobal feature extraction. The midlevel feature extraction only uses convolutions, therefore it can work with any image size. ### Results * The training set that they use is the "Places scene dataset". * After cleanup the dataset contains 2.3M training images (205 different classes) and 19k validation images. * Users rate images colored by their method in 92.6% of all cases as reallooking (ground truth: 97.2%). * If they exclude global features from their method, they only achieve 70% reallooking images. * They can also extract the global features from image A and then use them on image B. That transfers the style from A to B. But it only works well on semantically similar images. ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Let_there_be_Color__architecture.png?raw=true "Architecture") *Architecture of their model.* ![Old images](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Let_there_be_Color__old_images.png?raw=true "Old images") *Their model applied to old images.*  # Rough chapterwise notes * (1) Introduction * They use a CNN to color images. * Their network extracts global priors and local features from grayscale images. * Global priors: * Extracted from the whole image (e.g. time of day, indoor or outdoors, ...). * They use class labels of images to train those. (Not needed during test.) * Local features: Extracted from small patches (e.g. texture). * They don't generate a full RGB image, instead they generate the chrominance map using the CIE L\*a\*b\* colorspace. * Components of the model: * Low level features network: Generated first. * Mid level features network: Generated based on the low level features. * Global features network: Generated based on the low level features. * Colorization network: Receives mid level and global features, which were merged in a fusion layer. * Their network can process images of arbitrary size. * Global features can be generated based on another image to change the style of colorization, e.g. to change the seasonal colors from spring to summer. * (3) Joint Global and Local Model * <repetition of parts of the introduction> * They mostly use ReLUs. * (3.1) Deep Networks * <standard neural net introduction> * (3.2) Fusing Global and Local Features for Colorization * Global features are used as priors for local features. * (3.2.1) Shared LowLevel Features * The low level features are which's (low level) features are fed into the networks of both the global and the medium level features extractors. * They generate them from the input image using a ConvNet with 6 layers (3x3, 1x1 padding, strided/no pooling, ends in 512xH/8xW/8). * (3.2.2) Global Image Features * They process the low level features via another network into global features. * That network has 4 convlayers (3x3, 2 strided layers, all 512 filters), followed by 3 fully connected layers (1024, 512, 256). * Input size (of low level features) is expected to be 224x224. * (3.2.3) MidLevel Features * Takes the low level features (512xH/8xW/8) and uses 2 conv layers (3x3) to transform them to 256xH/8xW/8. * (3.2.4) Fusing Global and Local Features * The Fusion Layer is basically an extended convolutional layer. * It takes the midlevel features (256xH/8xW/8) and the global features (256) as input and outputs a matrix of shape 256xH/8xW/8. * It mostly operates like a normal convolutional layer on the midlevel features. However, its weight matrix is extended to also include weights for the global features (which will be added at every pixel). * So they use something like `fusion at pixel u,v = sigmoid(bias + weights * [global features, midlevel features at pixel u,v])`  and that with 256 different weight matrices and biases for 256 filters. * (3.2.5) Colorization Network * The colorization network receives the 256xH/8xW/8 matrix from the fusion layer and transforms it to the 2xHxW chrominance map. * It basically uses two upsampling blocks, each starting with a nearest neighbour upsampling layer, followed by 2 3x3 convs. * The last layer uses a sigmoid activation. * The network ends in a MSE. * (3.3) Colorization with Classification * To make training more effective, they train parts of the global features network via image class labels. * I.e. they take the output of the 2nd fully connected layer (at the end of the global network), add one small hidden layer after it, followed by a sigmoid output layer (size equals number of class labels). * They train that with cross entropy. So their global loss becomes something like `L = MSE(color accuracy) + alpha*CrossEntropy(class labels accuracy)`. * (3.4) Optimization and Learning * Low level feature extraction uses only convs, so they can be extracted from any image size. * Global feature extraction uses fc layers, so they can only be extracted from 224x224 images. * If an image has a size unequal to 224x224, it must be (1) resized to 224x224, fed through low level feature extraction, then fed through the global feature extraction and (2) separately (without resize) fed through the low level feature extraction and then fed through the midlevel feature extraction. * However, they only trained on 224x224 images (for efficiency). * Augmentation: 224x224 crops from 256x256 images; random horizontal flips. * They use Adadelta, because they don't want to set learning rates. (Why not adagrad/adam/...?) * (4) Experimental Results and Discussion * They set the alpha in their loss to `1/300`. * They use the "Places scene dataset". They filter images with low color variance (including grayscale images). They end up with 2.3M training images and 19k validation images. They have 205 classes. * Batch size: 128. * They train for about 11 epochs. * (4.1) Colorization results * Good looking colorization results on the Places scene dataset. * (4.2) Comparison with State of the Art * Their method succeeds where other methods fail. * Their method can handle very different kinds of images. * (4.3) User study * When rated by users, 92.6% think that their coloring is real (ground truth: 97.2%). * Note: Users were told to only look briefly at the images. * (4.4) Importance of Global Features * Their model *without* global features only achieves 70% user rating. * There are too many ambiguities on the local level. * (4.5) Style Transfer through Global Features * They can perform style transfer by extracting the global features of image B and using them for image A. * (4.6) Colorizing the past * Their model performs well on old images despite the artifacts commonly found on those. * (4.7) Classification Results * Their method achieves nearly as high classification accuracy as VGG (see classification loss for global features). * (4.8) Comparison of Color Spaces * L\*a\*b\* color space performs slightly better than RGB and YUV, so they picked that color space. * (4.9) Computation Time * One image is usually processed within seconds. * CPU takes roughly 5x longer. * (4.10) Limitations and Discussion * Their approach is data driven, i.e. can only deal well with types of images that appeared in the dataset. * Style transfer works only really well for semantically similar images. * Style transfer cannot necessarily transfer specific colors, because the whole model only sees the grayscale version of the image. * Their model tends to strongly prefer the most common color for objects (e.g. grass always green). 
https://www.youtube.com/watch?v=vQk_Sfl7kSc&feature=youtu.be * The paper describes a method to transfer the style (e.g. choice of colors, structure of brush strokes) of an image to a whole video. * The method is designed so that the transfered style is consistent over many frames. * Examples for such consistency: * No flickering of style between frames. So the next frame has always roughly the same style in the same locations. * No artefacts at the boundaries of objects, even if they are moving. * If an area gets occluded and then unoccluded a few frames later, the style of that area is still the same as before the occlusion. ### How * Assume that we have a frame to stylize $x$ and an image from which to extract the style $a$. * The basic process is the same as in the original Artistic Style Transfer paper, they just add a bit on top of that. * They start with a gaussian noise image $x'$ and change it gradually so that a loss function gets minimized. * The loss function has the following components: * Content loss *(old, same as in the Artistic Style Transfer paper)* * This loss makes sure that the content in the generated/stylized image still matches the content of the original image. * $x$ and $x'$ are fed forward through a pretrained network (VGG in their case). * Then the generated representations of the intermediate layers of the network are extracted/read. * One or more layers are picked and the difference between those layers for $x$ and $x'$ is measured via a MSE. * E.g. if we used only the representations of the layer conv5 then we would get something like `(conv5(x)  conv5(x'))^2` per example. (Where conv5() also executes all previous layers.) * Style loss *(old)* * This loss makes sure that the style of the generated/stylized image matches the style source $a$. * $x'$ and $a$ are fed forward through a pretrained network (VGG in their case). * Then the generated representations of the intermediate layers of the network are extracted/read. * One or more layers are picked and the Gram Matrices of those layers are calculated. * Then the difference between those matrices is measured via a MSE. * Temporal loss *(new)* * This loss enforces consistency in style between a pair of frames. * The main sources of inconsistency are boundaries of moving objects and areas that get unonccluded. * They use the optical flow to detect motion. * Applying an optical flow method to two frames $(i, i+1)$ returns per pixel the movement of that pixel, i.e. if the pixel at $(x=1, y=2)$ moved to $(x=2, y=4)$ the optical flow at that pixel would be $(u=1, v=2)$. * The optical flow can be split into the forward flow (here `fw`) and the backward flow (here `bw`). The forward flow is the flow from frame i to i+1 (as described in the previous point). The backward flow is the flow from frame $i+1$ to $i$ (reverse direction in time). * Boundaries * At boundaries of objects the derivative of the flow is high, i.e. the flow "suddenly" changes significantly from one pixel to the other. * So to detect boundaries they use (per pixel) roughly the equation `gradient(u)^2 + gradient(v)^2 > length((u,v))`. * Occlusions and disocclusions * If a pixel does not get occluded/disoccluded between frames, the optical flow method should be able to correctly estimate the motion of that pixel between the frames. The forward and backward flows then should be roughly equal, just in opposing directions. * If a pixel does get occluded/disoccluded between frames, it will not be visible in one the two frames and therefore the optical flow method cannot reliably estimate the motion for that pixel. It is then expected that the forward and backward flow are unequal. * To measure that effect they roughly use (per pixel) a formula matching `length(fw + bw)^2 > length(fw)^2 + length(bw)^2`. * Mask $c$ * They create a mask $c$ with the size of the frame. * For every pixel they estimate whether the boundaryequation *or* the disocclusionequation is true. * If either of them is true, they add a 0 to the mask, otherwise a 1. So the mask is 1 wherever there is *no* disocclusion or motion boundary. * Combination * The final temporal loss is the mean (over all pixels) of $c*(xw)^2$. * $x$ is the frame to stylize. * $w$ is the previous *stylized* frame (frame i1), warped according to the optical flow between frame i1 and i. * `c` is the mask value at the pixel. * By using the difference `xw` they ensure that the difference in styles between two frames is low. * By adding `c` they ensure the styleconsistency only at pixels that probably should have a consistent style. * Longterm loss *(new)* * This loss enforces consistency in style between pairs of frames that are longer apart from each other. * It is a simple extension of the temporal (shortterm) loss. * The temporal loss was computed for frames (i1, i). The longterm loss is the sum of the temporal losses for the frame pairs {(i4,i), (i2,i), (i1,i)}. * The $c$ mask is recomputed for every pair and 1 if there are no boundaries/disocclusions detected, but only if there is not a 1 for the same pixel in a later mask. The additional condition is intended to associate pixels with their closest neighbours in time to minimize possible errors. * Note that the longterm loss can completely replace the temporal loss as the latter one is contained in the former one. * Multipass approach *(new)* * They had problems with contrast around the boundaries of the frames. * To combat that, they use a multipass method in which they seem to calculate the optical flow in multiple forward and backward passes? (Not very clear here what they do and why it would help.) * Initialization with previous frame *(new)* * Instead of starting at a gaussian noise image every time, they instead use the previous stylized frame. * That immediately leads to more similarity between the frames. 
* They use an implementation of Qlearning (i.e. reinforcement learning) with CNNs to automatically play Atari games. * The algorithm receives the raw pixels as its input and has to choose buttons to press as its output. No handengineered features are used. So the model "sees" the game and "uses" the controller, just like a human player would. * The model achieves good results on various games, beating all previous techniques and sometimes even surpassing human players. ### How * Deep Q Learning * *This is yet another explanation of deep Q learning, see also [this blog post](http://www.nervanasys.com/demystifyingdeepreinforcementlearning/) for longer explanation.* * While playing, sequences of the form (`state1`, `action1`, `reward`, `state2`) are generated. * `state1` is the current game state. The agent only sees the pixels of that state. (Example: Screen shows enemy.) * `action1` is an action that the agent chooses. (Example: Shoot!) * `reward` is the direct reward received for picking `action1` in `state1`. (Example: +1 for a kill.) * `state2` is the next game state, after the action was chosen in `state1`. (Example: Screen shows dead enemy.) * One can pick actions at random for some time to generate lots of such tuples. That leads to a replay memory. * Direct reward * After playing randomly for some time, one can train a model to predict the direct reward given a screen (we don't want to use the whole state, just the pixels) and an action, i.e. `Q(screen, action) > direct reward`. * That function would need a forward pass for each possible action that we could take. So for e.g. 8 buttons that would be 8 forward passes. To make things more efficient, we can let the model directly predict the direct reward for each available action, e.g. for 3 buttons `Q(screen) > (direct reward of action1, direct reward of action2, direct reward of action3)`. * We can then sample examples from our replay memory. The input per example is the screen. The output is the reward as a tuple. E.g. if we picked button 1 of 3 in one example and received a reward of +1 then our output/label for that example would be `(1, 0, 0)`. * We can then train the model by playing completely randomly for some time, then sample some batches and train using a mean squared error. Then play a bit less randomly, i.e. start to use the action which the network thinks would generate the highest reward. Then train again, and so on. * Indirect reward * Doing the previous steps, the model will learn to anticipate the *direct* reward correctly. However, we also want it to predict indirect rewards. Otherwise, the model e.g. would never learn to shoot rockets at enemies, because the reward from killing an enemy would come many frames later. * To learn the indirect reward, one simply adds the reward value of highest reward action according to `Q(state2)` to the direct reward. * I.e. if we have a tuple (`state1`, `action1`, `reward`, `state2`), we would not add (`state1`, `action1`, `reward`) to the replay memory, but instead (`state1`, `action1`, `reward + highestReward(Q(screen2))`). (Where `highestReward()` returns the reward of the action with the highest reward according to Q().) * By training to predict `reward + highestReward(Q(screen2))` the network learns to anticipate the direct reward *and* the indirect reward. It takes a leap of faith to accept that this will ever converge to a good solution, but it does. * We then add `gamma` to the equation: `reward + gamma*highestReward(Q(screen2))`. `gamma` may be set to 0.9. It is a discount factor that devalues future states, e.g. because the world is not deterministic and therefore we can't exactly predict what's going to happen. Note that Q will automatically learn to stack it, e.g. `state3` will be discounted to `gamma^2` at `state1`. * This paper * They use the mentioned Deep Q Learning to train their model Q. * They use a kth frame technique, i.e. they let the model decide upon an action at (here) every 4th frame. * Q is implemented via a neural net. It receives 84x84x4 grayscale pixels that show the game and projects that onto the rewards of 4 to 18 actions. * The input is HxWx4 because they actually feed the last 4 frames into the network, instead of just 1 frame. So the network knows more about what things are moving how. * The network architecture is: * 84x84x4 (input) * 16 convs, 8x8, stride 4, ReLU * 32 convs, 4x4, stride 2, ReLU * 256 fully connected neurons, ReLU * <N_actions> fully connected neurons, linear * They use a replay memory of 1 million frames. ### Results * They ran experiments on the Atari games Beam Rider, Breakout, Enduro, Pong, Qbert, Seaquest and Space Invaders. * Same architecture and hyperparameters for all games. * Rewards were based on score changes in the games, i.e. they used +1 (score increases) and 1 (score decreased). * Optimizer: RMSProp, Batch Size: 32. * Trained for 10 million examples/frames per game. * They had no problems with instability and their average Q value per game increased smoothly. * Their method beats all other state of the art methods. * They managed to beat a human player in games that required not so much "long" term strategies (the less frames the better). * Video: starts at 46:05. https://youtu.be/dV80NAlEins?t=46m05s ![Algorithm](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Playing_Atari_with_Deep_Reinforcement_Learning__algorithm.png?raw=true "Algorithm") *The original full algorithm, as shown in the paper.*  ### Rough chapterwise notes * (1) Introduction * Problems when using neural nets in reinforcement learning (RL): * Reward signal is often sparse, noise and delayed. * Often assumption that data samples are independent, while they are correlated in RL. * Data distribution can change when the algorithm learns new behaviours. * They use Qlearning with a CNN and stochastic gradient descent. * They use an experience replay mechanism (i.e. memory) from which they can sample previous transitions (for training). * They apply their method to Atari 2600 games in the Arcade Learning Environment (ALE). * They use only the visible pixels as input to the network, i.e. no manual feature extraction. * (2) Background * blablabla, standard deep q learning explanation * (3) Related Work * TDBackgammon: "Solved" backgammon. Worked similarly to Qlearning and used a multilayer perceptron. * Attempts to copy TDBackgammon to other games failed. * Research was focused on linear function approximators as there were problems with nonlinear ones diverging. * Recently again interest in using neural nets for reinforcement learning. Some attempts to fix divergence problems with gradient temporaldifference methods. * NFQ is a very similar method (to the one in this paper), but worked on the whole batch instead of minibatches, making it slow. It also first applied dimensionality reduction via autoencoders on the images instead of training on them endtoend. * HyperNEAT was applied to Atari games and evolved a neural net for each game. The networks learned to exploit design flaws. * (4) Deep Reinforcement Learning * They want to connect a reinforcement learning algorithm with a deep neural network, e.g. to get rid of handcrafted features. * The network is supposes to run on the raw RGB images. * They use experience replay, i.e. store tuples of (pixels, chosen action, received reward) in a memory and use that during training. * They use Qlearning. * They use an epsilongreedy policy. * Advantages from using experience replay instead of learning "live" during game playing: * Experiences can be reused many times (more efficient). * Samples are less correlated. * Learned parameters from one batch don't determine as much the distributions of the examples in the next batch. * They save the last N experiences and sample uniformly from them during training. * (4.1) Preprocessing and Model Architecture * Raw Atari images are 210x160 pixels with 128 possible colors. * They downsample them to 110x84 pixels and then crop the 84x84 playing area out of them. * They also convert the images to grayscale. * They use the last 4 frames as input and stack them. * So their network input has shape 84x84x4. * They use one output neuron per possible action. So they can compute the Qvalue (expected reward) of each action with one forward pass. * Architecture: 84x84x4 (input) => 16 8x8 convs, stride 4, ReLU => 32 4x4 convs stride 2 ReLU => fc 256, ReLU => fc N actions, linear * 4 to 18 actions/outputs (depends on the game). * Aside from the outputs, the architecture is the same for all games. * (5) Experiments * Games that they played: Beam Rider, Breakout, Enduro, Pong, Qbert, Seaquest, Space Invaders * They use the same architecture und hyperparameters for all games. * They give a reward of +1 whenever the ingame score increases and 1 whenever it decreases. * They use RMSProp. * Mini batch size was 32. * They train for 10 million frames/examples. * They initialize epsilon (in their epsilon greedy strategy) to 1.0 and decrease it linearly to 0.1 at one million frames. * They let the agent decide upon an action at every 4th ingame frame (3rd in space invaders). * (5.1) Training and stability * They plot the average reward und Qvalue per N games to evaluate the agent's training progress, * The average reward increases in a noisy way. * The average Q value increases smoothly. * They did not experience any divergence issues during their training. * (5.2) Visualizating the Value Function * The agent learns to predict the value function accurately, even for rather long sequences (here: ~25 frames). * (5.3) Main Evaluation * They compare to three other methods that use handengineered features and/or use the pixel data combined with significant prior knownledge. * They mostly outperform the other methods. * They managed to beat a human player in three games. The ones where the human won seemed to require strategies that stretched over longer time frames. 
* AIR (attend, infer, repeat) is a recurrent autoencoder architecture to transform images into latent representations object by object. * As an autoencoder it is unsupervised. * The latent representation is generated in multiple time steps. * Each time step is intended to encode information about exactly one object in the image. * The information encoded for each object is (mostly) a whatwhere information, i.e. which class the object has and where (in 2D: translation, scaling) it is shown. * AIR has a dynamic number of time step. After encoding one object the model can decide whether it has encoded all objects or whether there is another one to encode. As a result the latent layer size is not fixed. * AIR uses an attention mechanism during the encoding to focus on each object. ### How * At its core, AIR is a variational autoencoder. * It maximizes lower bounds on the error instead of using a "classic" reconstruction error (like MSE on the euclidean distance). * It has an encoder and a decoder. * The model uses a recurrent architecture via an LSTM. * It (ideally) encodes/decodes one object per time step. * Encoder * The encoder receives the image and generates latent information for one object (what object, where it is). * At the second timestep it receives the image, the previous timestep's latent information and the previous timestep's hidden layer. It then generates another latent information (for another object). * And so on. * Decoder * The decoder receives latent information from the encoder (timestep by timestep) and treats it as a whatwhere information when reconstructing the images. * It takes the whatpart and uses a "normal" decoder to generate an image that shows the object. * It takes the wherepart and the generated image and feeds both into a spatial transformer, which then transforms the generated image by translating or rotating it. * Dynamic size * AIR makes use of a dynamically sized latent layer. It is not necessarily limited to a fixed number of time steps. * Implementation: Instead of just letting the encoder generate whatwhere information, the encoder also generates a "present" information, which is 0 or 1. If it is 1, the reccurence will continue with encoding and decoding another object. Otherwise it will stop. * Attention * To add an attention mechanism, AIR first uses the LSTM's hidden layer to generate "where" and "present" information per object. * It stops if the "present" information is 0. * Otherwise it uses the "where" information to focus on the object using a spatial transformer. The object is then encoded to the "what" information. ### Results * On a dataset of images, each containing multiple MNIST digits, AIR learns to accurately count the digits and estimate their position and scale. * When AIR is trained on images of 0 to 2 digits and tested on images containing 3 digits it performs poorly. * When AIR is trained on images of 0, 1 or 3 digits and tested on images containing 2 digits it performs mediocre. * DAIR performs well on both tasks. Likely because it learns to remove each digit from the image after it has investigated it. * When AIR is trained on 0 to 2 digits and a second network is trained (separately) to work with the generated latent layer (trained to sum the shown digits and rate whether they are shown in ascending order), then that second network reaches high accuracy with relatively few examples. That indicates usefulness for unsupervised learning. * When AIR is trained on a dataset of handwritten characters from different alphabets, it learns to represent distinct strokes in its latent layer. * When AIR is trained in combination with a renderer (inverse graphics), it is able to accurately recover latent parameters of rendered objects  better than supervised networks. That indicates usefulness for robots which have to interact with objects. ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Attend_Infer_Repeat__architecture.png?raw=true "Architecture.") *AIR architecture for MNIST. Left: Decoder for two objects that are each first generated (y_att) and then fed into a Spatial Transformer (y) before being combined into an image (x). Middle: , Right: Encoder with multiple time steps that generates whatwhere information per object and stops when the "present" information (z_pres) is 0. Right: Combination of both for MNIST with Spatial Transformer for the attention mechanism (top left).* ![DAIR Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Attend_Infer_Repeat__architecture_dair.png?raw=true "DAIR Architecture.") *Encoder with DAIR architecture. DAIR modifies the image after every timestep (e.g. to remove objects that were already encoded).*  ### Rough chapterwise notes * (1) Introduction * Assumption: Images are made up of distinct objects. These objects have visual and physical properties. * They developed a framework for efficient inference in images (i.e. get from the image to a latent representation of the objects, i.e. inverse graphics). * Parts of the framework: High dimensional representations (e.g. object images), interpretable latent variables (e.g. for rotation) and generative processes (to combine object images with latent variables). * Contributions: * A scheme for efficient variational inference in latent spaces of variable dimensionality. * Idea: Treat inference as an iterative process, implemented via an RNN that looks at one object at a time and learns an appropriate number of inference steps. (AttendInferRepeat, AIR) * Endtoend training via amortized variational inference (continuous variables: gradient descent, discrete variables: blackbox optimization). * AIR allows to train generative models that automatically learn to decompose scenes. * AIR allows to recover objects and their attributes from rendered 3D scenes (inverse rendering). * (2) Approach * Just like in VAEs, the scene interpretation is treated with a bayesian approach. * There are latent variables `z` and images `x`. * Images are generated via a probability distribution `p(xz)`. * This can be reversed via bayes rule to `p(xz) = p(x)p(zx) / p(z)`, which means that `p(xz)p(z) / p(x) = p(zx)`. * The prior `p(z)` must be chosen and captures assumptions about the distributions of the latent variables. * `p(xz)` is the likelihood and represents the model that generates images from latent variables. * They assume that there can be multiple objects in an image. * Every object gets its own latent variables. * A probability distribution p(xz) then converts each object (on its own) from the latent variables to an image. * The number of objects follows a probability distribution `p(n)`. * For the prior and likelihood they assume two scenarios: * 2D: Three dimensions for X, Y and scale. Additionally n dimensions for its shape. * 3D: Dimensions for X, Y, Z, rotation, object identity/category (multinomial variable). (No scale?) * Both 2D and 3D can be separated into latent variables for "where" and "what". * It is assumed that the prior latent variables are independent of each other. * (2.1) Inference * Inference for their model is intractable, therefore they use an approximation `q(z,nx)`, which minizes `KL(q(z,nx)p(z,nx))`, i.e. KL(approximationreal) using amortized variational approximation. * Challenges for them: * The dimensionality of their latent variable layer is a random variable p(n) (i.e. no static size.). * Strong symmetries. * They implement inference via an RNN which encodes the image object by object. * The encoded latent variables can be gaussians. * They encode the latent layer length `n` via a vector (instead of an integer). The vector has the form of `n` ones followed by one zero. * If the length vector is `#z` then they want to approximate `q(z,#zx)`. * That can apparently be decomposed into `<product> q(latent variable value i, #z is still 1 at ix, previous latent variable values) * q(has length nz,x)`. * So instead of computing `#z` once, they instead compute at every time step whether there is another object in the image, which indirectly creates a chain of ones followed by a zero (the `#z` vector). * (2.2) Learning * The parameters theta (`p`, latent variable > image) and phi (`q`, image > latent variables) are jointly optimized. * Optimization happens by maximizing a lower bound `E[log(p(x,z,n) / q(z,nx))]` called the negative free energy. * (2.2.1) Parameters of the model theta * Parameters theta of log(p(x,z,n)) can easily be obtained using differentiation, so long as z and n are well approximated. * The differentiation of the lower bound with repsect to theta can be approximated using Monte Carlo methods. * (2.2.2) Parameters of the inference network phi * phi are the parameters of q, i.e. of the RNN that generates z and #z in i timesteps. * At each timestep (i.e. per object) the RNN generates three kinds of information: What (object), where (it is), whether it is present (i <= n). * Each of these information is represented via variables. These variables can be discrete or continuous. * When differentiating w.r.t. a continuous variable they use the reparameterization trick. * When differentiating w.r.t. a discrete variable they use the likelihood ratio estimator. * (3) Models and Experiments * The RNN is implemented via an LSTM. * DAIR * The "normal" AIR model uses at every time step the image and the RNN's hidden layer to generate the next latent information (what object, where it is and whether it is present). * DAIR uses that latent information to change the image at every time step and then use the difference (D) image for the next time step, i.e. DAIR can remove an object from the image after it has generated latent variables for it. * (3.1) MultiMNIST * They generate a dataset of images containing multiple MNIST digits. * Each image contains 0 to 2 digits. * AIR is trained on the dataset. * It learns without supervision a good attention scanning policy for the images (to "hit" all digits), to count the digits visible in the image and to use a matching number of time steps. * During training, the model seems to first learn proper reconstruction of the digits and only then to do it with as few timesteps as possible. * (3.1.1) Strong Generalization * They test the generalization capabilities of AIR. * *Extrapolation task*: They generate images with 0 to 2 digits for training, then test on images with 3 digits. The model is unable to correctly count the digits (~0% accuracy). * *Interpolation task*: They generate images with 0, 1 or 3 digits for training, then test on images with 2 digits. The model performs OKish (~60% accuracy). * DAIR performs in both cases well (~80% for extrapolation, ~95% accuracy for interpolation). * (3.1.2) Representational Power * They train AIR on images containing 0, 1 or 2 digits. * Then they train a second network. That network takes the output of the first one and computes a) the sum of the digits and b) estimates whether they are shown in ascending order. * Accuracy for both tasks is ~95%. * The network reaches that accuracy significantly faster than a separately trained CNN (i.e. requires less labels / is more unsupervised). * (3.2) Omniglot * They train AIR on the Omniglot dataset (1.6k handwritten characters from 50 alphabets). * They allow the model to use up to 4 timesteps. * The model learns to reconstruct the images in timesteps that resemble strokes. * (3.3) 3D Scenes * Here, the generator p(xz) is a 3D renderer, only q(zx) must be approximated. * The model has to learn to count the objects and to estimate per object its identity (class) and pose. * They use "finitedifferencing" to get gradients through the renderer and use "score function estimators" to get gradients with respect to discrete variables. * They first test with a setup where the object count is always 1. The network learns to accurately recover the object parameters. * A similar "normal" network has much more problems with recovering the parameters, especially rotation, because the conditional probabilities are multimodal. The lower bound maximization strategy seems to work better in those cases. * In a second experiment with multiple complex objects, AIR also achieves high reconstruction accuracy. 
* Certain activation functions, mainly sigmoid, tanh, hardsigmoid and hardtanh can saturate. * That means that their gradient is either flat 0 after threshold values (e.g. 1 and +1) or that it approaches zero for high/low values. * If there's no gradient, training becomes slow or stops completely. * That's a problem, because sigmoid, tanh, hardsigmoid and hardtanh are still often used in some models, like LSTMs, GRUs or Neural Turing Machines. * To fix the saturation problem, they add noise to the output of the activation functions. * The noise increases as the unit saturates. * Intuitively, once the unit is saturating, it will occasionally "test" an activation in the nonsaturating regime to see if that output performs better. ### How * The basic formula is: `phi(x,z) = alpha*h(x) + (1alpha)u(x) + d(x)std(x)epsilon` * Variables in that formula: * Nonlinear part `alpha*h(x)`: * `alpha`: A constant hyperparameter that determines the "direction" of the noise and the slope. Values below 1.0 let the noise point away from the unsaturated regime. Values <=1.0 let it point towards the unsaturated regime (higher alpha = stronger noise). * `h(x)`: The original activation function. * Linear part `(1alpha)u(x)`: * `u(x)`: Firstorder Taylor expansion of h(x). * For sigmoid: `u(x) = 0.25x + 0.5` * For tanh: `u(x) = x` * For hardsigmoid: `u(x) = max(min(0.25x+0.5, 1), 0)` * For hardtanh: `u(x) = max(min(x, 1), 1)` * Noise/Stochastic part `d(x)std(x)epsilon`: * `d(x) = sgn(x)sgn(1alpha)`: Changes the "direction" of the noise. * `std(x) = c(sigmoid(p*v(x))0.5)^2 = c(sigmoid(p*(h(x)u(x)))0.5)^2` * `c` is a hyperparameter that controls the scale of the standard deviation of the noise. * `p` controls the magnitude of the noise. Due to the `sigmoid(y)0.5` this can influence the sign. `p` is learned. * `epsilon`: A noise creating random variable. Usually either a Gaussian or the positive half of a Gaussian (i.e. `z` or `z`). * The hyperparameter `c` can be initialized at a high value and then gradually decreased over time. That would be comparable to simulated annealing. * Noise could also be applied to the input, i.e. `h(x)` becomes `h(x + noise)`. ### Results * They replaced sigmoid/tanh/hardsigmoid/hardtanh units in various experiments (without further optimizations). * The experiments were: * Learn to execute source code (LSTM?) * Language model from Penntreebank (2layer LSTM) * Neural Machine Translation engine trained on Europarl (LSTM?) * Image caption generation with soft attention trained on Flickr8k (LSTM) * Counting unique integers in a sequence of integers (LSTM) * Associative recall (Neural Turing Machine) * Noisy activations practically always led to a small or moderate improvement in resulting accuracy/NLL/BLEU. * In one experiment annealed noise significantly outperformed unannealed noise, even beating careful curriculum learning. (Somehow there are not more experiments about that.) * The Neural Turing Machine learned far faster with noisy activations and also converged to a much better solution. ![Influence of alphas](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Noisy_Activation_Functions__alphas.png?raw=true "Influence of alphas.") *Hardtanh with noise for various alphas. Noise increases in different ways in the saturing regimes.* ![Neural Turing Machine results](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Noisy_Activation_Functions__ntm.png?raw=true "Neural Turing Machine results.") *Performance during training of a Neural Turing Machine with and without noisy activation units.*  # Rough chapterwise notes * (1) Introduction * ReLU and Maxout activation functions have improved the capabilities of training deep networks. * Previously, tanh and sigmoid were used, which were only suited for shallow networks, because they saturate, which kills the gradient. * They suggest a different avenue: Use saturating nonlinearities, but inject noise when they start to saturate (and let the network learn how much noise is "good"). * The noise allows to train deep networks with saturating activation functions. * Many current architectures (LSTMs, GRUs, Neural Turing Machines, ...) require "hard" decisions (yes/no). But they use "soft" activation functions to implement those, because hard functions lack gradient. * The soft activation functions can still saturate (no more gradient) and don't match the nature of the binary decision problem. So it would be good to replace them with something better. * They instead use hard activation functions and compensate for the lack of gradient by using noise (during training). * Networks with hard activation functions outperform those with soft ones. * (2) Saturating Activation Functions * Activation Function = A function that maps a real value to a new real value and is differentiable almost everywhere. * Right saturation = The gradient of an activation function becomes 0 if the input value goes towards infinity. * Left saturation = The gradient of an activation function becomes 0 if the input value goes towards infinity. * Saturation = A activation function saturates if it rightsaturates and leftsaturates. * Hard saturation = If there is a constant c for which for which the gradient becomes 0. * Soft saturation = If there is no constant, i.e. the input value must become +/ infinity. * Soft saturating activation functions can be converted to hard saturating ones by using a firstorder Taylor expansion and then clipping the values to the required range (e.g. 0 to 1). * A hard activating tanh is just `f(x) = x`. With clipping to [1, 1]: `max(min(f(x), 1), 1)`. * The gradient for hard activation functions is 0 above/below certain constants, which will make training significantly more challenging. * hardsigmoid, sigmoid and tanh are contractive mappings, hardtanh for some reason only when it's greater than the threshold. * The fixedpoint for tanh is 0, for the others !=0. That can have influences on the training performance. * (3) Annealing with Noisy Activation Functions * Suppose that there is an activation function like hardsigmoid or hardtanh with additional noise (iid, mean=0, variance=std^2). * If the noise's `std` is 0 then the activation function is the original, deterministic one. * If the noise's `std` is very high then the derivatives and gradient become high too. The noise then "drowns" signal and the optimizer just moves randomly through the parameter space. * Let the signal to noise ratio be `SNR = std_signal / std_noise`. So if SNR is low then noise drowns the signal and exploration is random. * By letting SNR grow (i.e. decreaseing `std_noise`) we switch the model to fine tuning mode (less coarse exploration). * That is similar to simulated annealing, where noise is also gradually decreased to focus on better and better regions of the parameter space. * (4) Adding Noise when the Unit Saturate * This approach does not always add the same noise. Instead, noise is added proportinally to the saturation magnitude. More saturation, more noise. * That results in a clean signal in "good" regimes (nonsaturation, strong gradients) and a noisy signal in "bad" regimes (saturation). * Basic activation function with noise: `phi(x, z) = h(x) + (mu + std(x)*z)`, where `h(x)` is the saturating activation function, `mu` is the mean of the noise, `std` is the standard deviation of the noise and `z` is a random variable. * Ideally the noise is unbiased so that the expectation values of `phi(x,z)` and `h(x)` are the same. * `std(x)` should take higher values as h(x) enters the saturating regime. * To calculate how "saturating" a activation function is, one can `v(x) = h(x)  u(x)`, where `u(x)` is the firstorder Taylor expansion of `h(x)`. * Empirically they found that a good choice is `std(x) = c(sigmoid(p*v(x))  0.5)^2` where `c` is a hyperparameter and `p` is learned. * (4.1) Derivatives in the Saturated Regime * For values below the threshold, the gradient of the noisy activation function is identical to the normal activation function. * For values above the threshold, the gradient of the noisy activation function is `phi'(x,z) = std'(x)*z`. (Assuming that z is unbiased so that mu=0.) * (4.2) Pushing Activations towards the Linear Regime * In saturated regimes, one would like to have more of the noise point towards the unsaturated regimes than away from them (i.e. let the model try often whether the unsaturated regimes might be better). * To achieve this they use the formula `phi(x,z) = alpha*h(x) + (1alpha)u(x) + d(x)std(x)epsilon` * `alpha`: A constant hyperparameter that determines the "direction" of the noise and the slope. Values below 1.0 let the noise point away from the unsaturated regime. Values <=1.0 let it point towards the unsaturated regime (higher alpha = stronger noise). * `h(x)`: The original activation function. * `u(x)`: Firstorder Taylor expansion of h(x). * `d(x) = sgn(x)sgn(1alpha)`: Changes the "direction" of the noise. * `std(x) = c(sigmoid(p*v(x))0.5)^2 = c(sigmoid(p*(h(x)u(x)))0.5)^2` with `c` being a hyperparameter and `p` learned. * `epsilon`: Either `z` or `z`. If `z` is a Gaussian, then `z` is called "halfnormal" while just `z` is called "normal". Halfnormal lets the noise only point towards one "direction" (towards the unsaturated regime or away from it), while normal noise lets it point in both directions (with the slope being influenced by `alpha`). * The formula can be split into three parts: * `alpha*h(x)`: Nonlinear part. * `(1alpha)u(x)`: Linear part. * `d(x)std(x)epsilon`: Stochastic part. * Each of these parts resembles a path along which gradient can flow through the network. * During test time the activation function is made deterministic by using its expectation value: `E[phi(x,z)] = alpha*h(x) + (1alpha)u(x) + d(x)std(x)E[epsilon]`. * If `z` is halfnormal then `E[epsilon] = sqrt(2/pi)`. If `z` is normal then `E[epsilon] = 0`. * (5) Adding Noise to Input of the Function * Noise can also be added to the input of an activation function, i.e. `h(x)` becomes `h(x + noise)`. * The noise can either always be applied or only once the input passes a threshold. * (6) Experimental Results * They applied noise only during training. * They used existing setups and just changed the activation functions to noisy ones. No further optimizations. * `p` was initialized uniformly to [1,1]. * Basic experiment settings: * NAN: Normal noise applied to the outputs. * NAH: Halfnormal noise, i.e. `z`, i.e. noise is "directed" towards the unsaturated or satured regime. * NANI: Normal noise applied to the *input*, i.e. `h(x+noise)`. * NANIL: Normal noise applied to the input with learned variance. * NANIS: Normal noise applied to the input, but only if the unit saturates (i.e. above/below thresholds). * (6.1) Exploratory analysis * A very simple MNIST network performed slightly better with noisy activations than without. But comparison was only to tanh and hardtanh, not ReLU or similar. * In an experiment with a simple GRU, NANI (noisy input) and NAN (noisy output) performed practically identical. NANIS (noisy input, only when saturated) performed significantly worse. * (6.2) Learning to Execute * Problem setting: Predict the output of some lines of code. * They replaced sigmoids and tanhs with their noisy counterparts (NAH, i.e. halfnormal noise on output). The model learned faster. * (6.3) Penntreebank Experiments * They trained a standard 2layer LSTM language model on Penntreebank. * Their model used noisy activations, as opposed to the usually nonnoisy ones. * They could improve upon the previously best value. Normal noise and halfnormal noise performed roughly the same. * (6.4) Neural Machine Translation Experiments * They replaced all sigmoids and tanh units in the Neural Attention Model with noisy ones. Then they trained on the Europarl corpus. * They improved upon the previously best score. * (6.5) Image Caption Generation Experiments * They train a network with soft attention to generate captions for the Flickr8k dataset. * Using noisy activation units improved the result over normal sigmoids and tanhs. * (6.6) Experiments with Continuation * They build an LSTM and train it to predict how many unique integers there are in a sequence of random integers. * Instead of using a constant value for hyperparameter `c` of the noisy activations (scale of the standard deviation of the noise), they start at `c=30` and anneal down to `c=0.5`. * Annealed noise performed significantly better then unannealed noise. * Noise applied to the output (NAN) significantly beat noise applied to the input (NANIL). * In a second experiment they trained a Neural Turing Machine on the associative recall task. * Again they used annealed noise. * The NTM with annealed noise learned by far faster than the one without annealed noise and converged to a perfect solution. 
* The authors train a variant of AlexNet that has significantly fewer parameters than the original network, while keeping the network's accuracy stable. * Advantages of this: * More efficient distributed training, because less parameters have to be transferred. * More efficient transfer via the internet, because the model's file size is smaller. * Possibly less memory demand in production, because fewer parameters have to be kept in memory. ### How * They define a Fire Module. A Fire Module contains of: * Squeeze Module: A 1x1 convolution that reduces the number of channels (e.g. from 128x32x32 to 64x32x32). * Expand Module: A 1x1 convolution and a 3x3 convolution, both applied to the output of the Squeeze Module. Their results are concatenated. * Using many 1x1 convolutions is advantageous, because they need less parameters than 3x3s. * They use ReLUs, only convolutions (no fully connected layers) and Dropout (50%, before the last convolution). * They use late maxpooling. They argue that applying pooling late  rather than early  improves accuracy while not needing more parameters. * They try residual connections: * One network without any residual connections (performed the worst). * One network with residual connections based on identity functions, but only between layers of same dimensionality (performed the best). * One network with residual connections based on identity functions and other residual connections with 1x1 convs (where dimensionality changed) (performance between the other two). * They use pruning from Deep Compression to reduce the parameters further. Pruning simply collects the 50% of all parameters of a layer that have the lowest values and sets them to zero. That creates a sparse matrix. ### Results * 50x parameter reduction of AlexNet (1.2M parameters before pruning, 0.4M after pruning). * 510x file size reduction of AlexNet (from 250mb to 0.47mb) when combined with Deep Compression. * Top1 accuracy remained stable. * Pruning apparently can be used safely, even after the network parameters have already been reduced significantly. * While pruning was generally safe, they found that two of their later layers reacted quite sensitive to it. Adding parameters to these (instead of removing them) actually significantly improved accuracy. * Generally they found 1x1 convs to react more sensitive to pruning than 3x3s. Therefore they focused pruning on 3x3 convs. * First pruning a network, then readding the pruned weights (initialized with 0s) and then retraining for some time significantly improved accuracy. * The network was rather resilient to significant channel reduction in the Squeeze Modules. Reducing to 2550% of the original channels (e.g. 128x32x32 to 64x32x32) seemed to be a good choice. * The network was rather resilient to removing 3x3 convs and replacing them with 1x1 convs. A ratio of 2:1 to 1:1 (1x1 to 3x3) seemed to produce good results while mostly keeping the accuracy. * Adding some residual connections between the Fire Modules improved the accuracy. * Adding residual connections with identity functions *and also* residual connections with 1x1 convs (where dimensionality changed) improved the accuracy, but not as much as using *only* residual connections with identity functions (i.e. it's better to keep some modules without identity functions).  ### Rough chapterwise notes * (1) Introduction and Motivation * Advantages from having less parameters: * More efficient distributed training, because less data (parameters) have to be transfered. * Less data to transfer to clients, e.g. when a model used by some app is updated. * FPGAs often have hardly any memory, i.e. a model has to be small to be executed. * Target here: Find a CNN architecture with less parameters than an existing one but comparable accuracy. * (2) Related Work * (2.1) Model Compression * SVDmethod: Just apply SVD to the parameters of an existing model. * Network Pruning: Replace parameters below threshold with zeros (> sparse matrix), then retrain a bit. * Add quantization and huffman encoding to network pruning = Deep Compression. * (2.2) CNN Microarchitecture * The term "CNN Microarchitecture" refers to the "organization and dimensions of the individual modules" (so an Inception module would have a complex CNN microarchitecture). * (2.3) CNN Macroarchitecture * CNN Macroarchitecture = "big picture" / organization of many modules in a network / general characteristics of the network, like depth * Adding connections between modules can help (e.g. residual networks) * (2.4) Neural Network Design Space Exploration * Approaches for Design Space Exporation (DSE): * Bayesian Optimization, Simulated Annealing, Randomized Search, Genetic Algorithms * (3) SqueezeNet: preserving accuracy with few parameters * (3.1) Architectural Design Strategies * A conv layer with N filters applied to CxHxW input (e.g. 3x128x128 for a possible first layer) with kernel size kHxkW (e.g. 3x3) has `N*C*kH*kW` parameters. * So one way to reduce the parameters is to decrease kH and kW, e.g. from 3x3 to 1x1 (reduces parameters by a factor of 9). * A second way is to reduce the number of channels (C), e.g. by using 1x1 convs before the 3x3 ones. * They think that accuracy can be improved by performing downsampling later in the network (if parameter count is kept constant). * (3.2) The Fire Module * The Fire Module has two components: * Squeeze Module: * One layer of 1x1 convs * Expand Module: * Concat the results of: * One layer of 1x1 convs * One layer of 3x3 convs * The Squeeze Module decreases the number of input channels significantly. * The Expand Module then increases the number of input channels again. * (3.3) The SqueezeNet architecture * One standalone conv, then several fire modules, then a standalone conv, then global average pooling, then softmax. * Three late max pooling laters. * Gradual increase of filter numbers. * (3.3.1) Other SqueezeNet details * ReLU activations * Dropout before the last conv layer. * No linear layers. * (4) Evaluation of SqueezeNet * Results of competing methods: * SVD: 5x compression, 56% top1 accuracy * Pruning: 9x compression, 57.2% top1 accuracy * Deep Compression: 35x compression, ~57% top1 accuracy * SqueezeNet: 50x compression, ~57% top1 accuracy * SqueezeNet combines low parameter counts with Deep Compression. * The accuracy does not go down because of that, i.e. apparently Deep Compression can even be applied to small models without giving up on performance. * (5) CNN Microarchitecture Design Space Exploration * (5.1) CNN Microarchitecture metaparameters * blabla we test various values for this and that parameter * (5.2) Squeeze Ratio * In a Fire Module there is first a Squeeze Module and then an Expand Module. The Squeeze Module decreases the number of input channels to which 1x1 and 3x3 both are applied (at the same time). * They analyzed how far you can go down with the Sqeeze Module by training multiple networks and calculating the top5 accuracy for each of them. * The accuracy by Squeeze Ratio (percentage of input channels kept in 1x1 squeeze, i.e. 50% = reduced by half, e.g. from 128 to 64): * 12%: ~80% top5 accuracy * 25%: ~82% top5 accuracy * 50%: ~85% top5 accuracy * 75%: ~86% top5 accuracy * 100%: ~86% top5 accuracy * (5.3) Trading off 1x1 and 3x3 filters * Similar to the Squeeze Ratio, they analyze the optimal ratio of 1x1 filters to 3x3 filters. * E.g. 50% would mean that half of all filters in each Fire Module are 1x1 filters. * Results: * 01%: ~76% top5 accuracy * 12%: ~80% top5 accuracy * 25%: ~82% top5 accuracy * 50%: ~85% top5 accuracy * 75%: ~85% top5 accuracy * 99%: ~85% top5 accuracy * (6) CNN Macroarchitecture Design Space Exploration * They compare the following networks: * (1) Without residual connections * (2) With residual connections between modules of same dimensionality * (3) With residual connections between all modules (except pooling layers) using 1x1 convs (instead of identity functions) where needed * Adding residual connections (2) improved top1 accuracy from 57.5% to 60.4% without any new parameters. * Adding complex residual connections (3) worsed top1 accuracy again to 58.8%, while adding new parameters. * (7) Model Compression Design Space Exploration * (7.1) Sensitivity Analysis: Where to Prune or Add parameters * They went through all layers (including each one in the Fire Modules). * In each layer they set the 50% smallest weights to zero (pruning) and measured the effect on the top5 accuracy. * It turns out that doing that has basically no influence on the top5 accuracy in most layers. * Two layers towards the end however had significant influence (accuracy went down by 510%). * Adding parameters to these layers improved top1 accuracy from 57.5% to 59.5%. * Generally they found 1x1 layers to be more sensitive than 3x3 layers so they pruned them less aggressively. * (7.2) Improving Accuracy by Densifying Sparse Models * They found that first pruning a model and then retraining it again (initializing the pruned weights to 0) leads to higher accuracy. * They could improve top1 accuracy by 4.3% in this way. 
* They define four subtasks of image understanding: * *Classification*: Assign a single label to a whole image. * *Captioning*: Assign a sequence of words (description) to a whole image* * *Detection*: Find objects/regions in an image and assign a single label to each one. * *Dense Captioning*: Find objects/regions in an image and assign a sequence of words (description) to each one. * DenseCap accomplishes the fourth task, i.e. it is a model that finds objects/regions in images and describes them with natural language. ### How * Their model consists of four subcomponents, which run for each image in sequence: * (1) **Convolutional Network**: * Basically just VGG16. * (2) **Localization Layer**: * This layer uses a convolutional network that has mostly the same architecture as in the "Faster RCNN" paper. * That ConvNet is applied to a grid of anchor points on the image. * For each anchor point, it extracts the features generated by the VGGNet (model 1) around that point. * It then generates the attributes of `k` (default: 12) boxes using a shallow convolutional net. These attributes are (roughly): Height, width, center x, center y, confidence score. * It then extracts the features of these boxes from the VGGNet output (model 1) and uses bilinear sampling to project them onto a fixed size (height, width) for the next model. The result are the final region proposals. * By default every image pixel is an anchor point, which results in a large number of regions. Hence, subsampling is used during training and testing. * (3) **Recognition Network**: * Takes a region (flattened to 1d vector) and projects it onto a vector of length 4096. * It uses fully connected layers to do that (ReLU, dropout). * Additionally, the network takes the 4096 vector and outputs new values for the region's position and confidence (for late fine tuning). * The 4096 vectors of all regions are combined to a matrix that is fed into the next component (RNN). * The intended sense of the this component seems to be to convert the "visual" features of each region to a more abstract, highdimensional representation/description. * (4) **RNN Language Model**: * The take each 4096 vector and apply a fully connected layer + ReLU to it. * Then they feed it into an LSTM, followed by a START token. * The LSTM then generates word (as one hot vectors), which are fed back into the model for the next time step. * This is continued until the LSTM generates an END token. * Their full loss function has five components: * Binary logistic loss for the confidence values generated by the localization layer. * Binary logistic loss for the confidence values generated by the recognition layer. * Smooth L1 loss for the region dimensions generated by the localization layer. * Smooth L1 loss for the region dimensiosn generated by the recognition layer. * Crossentropy at every timestep of the language model. * The whole model can be trained endtoend. * Results * They mostly use the Visual Genome dataset. * Their model finds lots of good regions in images. * Their model generates good captions for each region. (Only short captions with simple language however.) * The model seems to love colors. Like 3050% of all captions contain a color. (Probably caused by the dataset?) * They compare to EdgeBoxes (other method to find regions in images). Their model seems to perform better. * Their model requires about 240ms per image (test time). * The generated regions and captions enable one to search for specific objects in images using text queries. ![Architecture](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/DenseCap__architecture.png?raw=true "Architecture.") *Architecture of the whole model. It starts with the VGGNet ("CNN"), followed by the localization layer, which generates region proposals. Then the recognition network converts the regions to abstract highdimensional representations. Then the language model ("RNN") generates the caption.* ![Elephant image with dense captioning.](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/DenseCap__elephant.png?raw=true "Elephant image with dense captioning.") ![Airplane image with dense captioning.](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/DenseCap__airplane.png?raw=true "Airplane image with dense captioning.")  ### Rough chapterwise notes * (1) Introduction * They define four subtasks of visual scene understanding: * Classification: Assign a single label to a whole image * Captioning: Assign a sequence of words (description) to a whole image * Detection: Find objects in an image and assign a single label to each one * Dense Captioning: Find objects in an image and assign a sequence of words (description) to each one * They developed a model for dense captioning. * It has two three important components: * A convoltional network for scene understanding * A localization layer for region level predictions. It predicts regions of interest and then uses bilinear sampling to extract the activations of these regions. * A recurrent network as the language model * They evaluate the model on the largescale Visual Genome dataset (94k images, 4.1M region captions). * (3) Model * Model architecture * Convolutional Network * They use VGG16, but remove the last pooling layer. * For an image of size W, H the output is 512xW/16xH/16. * That output is the input into the localization layer. * Fully Convolutional Localization Layer * Input to this layer: Activations from the convolutional network. * Output of this layer: Regions of interest, as fixedsized representations. * For B Regions: * Coordinates of the bounding boxes (matrix of shape Bx4) * Confidence scores (vector of length B) * Features (matrix of shape BxCxXxY) * Method: Faster RCNN (pooling replaced by bilinear interpolation) * This layer is fully differentiable. * The localization layer predicts boxes at anchor points. * At each anchor point it proposes `k` boxes using a small convolutional network. It assigns a confidence score and coordinates (center x, center y, height, width) to each proposal. * For an image with size 720x540 and k=12 the model would have to predict 17,280 boxes, hence subsampling is used. * During training they use minibatches with 256/2 positive and 256/2 negative region examples. A box counts as a positive example for a specific image if it has high overlap (intersection) with an annotated box for that image. * During test time they use greedy nonmaximum suppression (NMS) (?) to subsample the 300 most confident boxes. * The region proposals have varying box sizes, but the output of the localization layer (which will be fed into the RNN) is ought to have fixed sizes. * So they project each proposed region onto a fixed sized region. They use bilinear sampling for that projection, which is differentiable. * Recognition network * Each region is flattened to a onedimensional vector. * That vector is fed through 2 fully connected layers (unknown size, ReLU, dropout), ending with a 4096 neuron layer. * The confidence score and box coordinates are also adjusted by the network during that process (fine tuning). * RNN Language Model * Each region is translated to a sentence. * The region is fed into an LSTM (after a linear layer + ReLU), followed by a special START token. * The LSTM outputs multiple words as onehotvectors, where each vector has the length `V+1` (i.e. vocabulary size + END token). * Loss function is average crossentropy between output words and target words. * During test time, words are sampled until an END tag is generated. * Loss function * Their full loss function has five components: * Binary logistic loss for the confidence values generated by the localization layer. * Binary logistic loss for the confidence values generated by the recognition layer. * Smooth L1 loss for the region dimensions generated by the localization layer. * Smooth L1 loss for the region dimensiosn generated by the recognition layer. * Crossentropy at every timestep of the language model. * The language model term has a weight of 1.0, all other components have a weight of 0.1. * Training an optimization * Initialization: CNN pretrained on ImageNet, all other weights from `N(0, 0.01)`. * SGD for the CNN (lr=?, momentum=0.9) * Adam everywhere else (lr=1e6, beta1=0.9, beta2=0.99) * CNN is trained after epoch 1. CNN's first four layers are not trained. * Batch size is 1. * Image size is 720 on the longest side. * They use Torch. * 3 days of training time. * (4) Experiments * They use the Visual Genome Dataset (94k images, 4.1M regions with captions) * Their total vocabulary size is 10,497 words. (Rare words in captions were replaced with `<UNK>`.) * They throw away annotations with too many words as well as images with too few/too many regions. * They merge heavily overlapping regions to single regions with multiple captions. * Dense Captioning * Dense captioning task: The model receives one image and produces a set of regions, each having a caption and a confidence score. * Evaluation metrics * Evaluation of the output is nontrivial. * They compare predicted regions with regions from the annotation that have high overlap (above a threshold). * They then compare the predicted caption with the captions having similar METEOR score (above a threshold). * Instead of setting one threshold for each comparison they use multiple thresholds. Then they calculate the Mean Average Precision using the various pairs of thresholds. * Baseline models * Sources of region proposals during test time: * GT: Ground truth boxes (i.e. found by humans). * EB: EdgeBox (completely separate and pretrained system). * RPN: Their localization and recognition networks trained separately on VG regions dataset (i.e. trained without the RNN language model). * Models: * Region RNN model: Apparently the recognition layer and the RNN language model, trained on predefined regions. (Where do these regions come from? VG training dataset?) * Full Image RNN model: Apparently the recognition layer and the RNN language model, trained on full images from MSCOCO instead of small regions. * FCLN on EB: Apparently the recognition layer and the RNN language model, trained on regions generated by EdgeBox (EB) (on VG dataset?). * FCLN: Apparently their full model (trained on VG dataset?). * Discrepancy between region and image level statistics * When evaluating the models only on METEOR (language "quality"), the *Region RNN model* consistently outperforms the *Full Image RNN model*. * That's probably because the *Full Image RNN model* was trained on captions of whole images, while the *Region RNN model* was trained on captions of small regions, which tend to be a bit different from full image captions. * RPN outperforms external region proposals * Generating region proposals via RPN basically always beats EB. * Our model outperforms individual region description * Their full jointly trained model (FCLN) achieves the best results. * The full jointly trained model performs significantly better than `RPN + Region RNN model` (i.e. separately trained region proposal and region captioning networks). * Qualitative results * Finds plenty of good regions and generates reasonable captions for them. * Sometimes finds the same region twice. * Runtime evaluation * 240ms on 720x600 image with 300 region proposals. * 166ms on 720x600 image with 100 region proposals. * Recognition of region proposals takes up most time. * Generating region proposals takes up the 2nd most time. * Generating captions for regions (RNN) takes almost no time. * Image Retrieval using Regions and Captions * They try to search for regions based on search queries. * They search by letting their FCLN network or EB generate 100 region proposals per network. Then they calculate per region the probability of generating the search query as the caption. They use that probability to rank the results. * They pick images from the VG dataset, then pick captions within those images as search query. Then they evaluate the ranking of those images for the respective search query. * The results show that the model can learn to rank objects, object parts, people and actions as expected/desired. * The method described can also be used to detect an arbitrary number of distinct classes in images (as opposed to the usual 10 to 1000 classes), because the classes are contained in the generated captions. 
* Stochastic Depth (SD) is a method for residual networks, which randomly removes/deactivates residual blocks during training. * As such, it is similar to dropout. * While dropout removes neurons, SD removes blocks (roughly the layers of a residual network). * One can argue that dropout randomly changes the width of layers, while SD randomly changes the depth of the network. * One can argue that using dropout is similar to training an ensemble of networks with different layer widths, while using SD is similar to training an ensemble of networks with different depths. * Using SD has the following advantages: * It decreases the effects of vanishing gradients, because on average the network is shallower during training (per batch), thereby increasing the gradient that reaches the early blocks. * It increases training speed, because on average less convolutions have to be applied (due to blocks being removed). * It has a regularizing effect, because blocks cannot easily coadapt any more. (Similar to dropout avoiding coadaption of neurons.) * If using an increasing removal probability for later blocks: It spends more training time on the early (and thus most important) blocks than on the later blocks. ### How * Normal formula for a residual block (test and train): * `output = ReLU(f(input) + identity(input))` * `f(x)` are usually one or two convolutions. * Formula with SD (during training): * `output = ReLU(b * f(input) + identity(input))` * `b` is either exactly `1` (block survived, i.e. is not removed) or exactly `0` (block was removed). * `b` is sampled from a bernoulli random variable that has the hyperparameter `p`. * `p` is the survival probability of a block (i.e. chance to *not* be removed). (Note that this is the opposite of dropout, where higher values lead to more removal.) * Formula with SD (during test): * `output = ReLU(p * f(input) + input)` * `p` is the average probability with which this residual block survives during training, i.e. the hyperparameter for the bernoulli variable. * The test formula has to be changed, because the network will adapt during training to blocks being missing. Activating them all at the same time can lead to overly strong signals. This is similar to dropout, where weights also have to be changed during test. * There are two simple schemas to set `p` per layer: * Uniform schema: Every block gets the same `p` hyperparameter, i.e. the last block has the same chance of survival as the first block. * Linear decay schema: Survival probability is higher for early layers and decreases towards the end. * The formula is `p = 1  (l/L)(1q)`. * `l`: Number of the block for which to set `p`. * `L`: Total number of blocks. * `q`: Desired survival probability of the last block (0.5 is a good value). * For linear decay with `q=0.5` and `L` blocks, on average `(3/4)L` blocks will be trained per minibatch. * For linear decay with `q=0.5` the average speedup will be about `1/4` (25%). If using `q=0.2` the speedup will be ~40%. ### Results * 152 layer networks with SD outperform identical networks without SD on CIFAR10, CIFAR100 and SVHN. * The improvement in test error is quite significant. * SD seems to have a regularizing effect. Networks with SD are not overfitting where networks without SD already are. * Even networks with >1000 layers are well trainable with SD. * The gradients that reach the early blocks of the networks are consistently significantly higher with SD than without SD (i.e. less vanishing gradient). * The linear decay schema consistently outperforms the uniform schema (in test error). The best value seems to be `q=0.5`, though values between 0.4 and 0.8 all seem to be good. For the uniform schema only 0.8 seems to be good. ![SVHN 152 layers](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Deep_Networks_with_Stochastic_Depth__svhn.png?raw=true "SVHN 152 layers") *Performance on SVHN with 152 layer networks with SD (blue, bottom) and without SD (red, top).* ![CIFAR10 1202 layers](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Deep_Networks_with_Stochastic_Depth__svhn1202.png?raw=true "CIFAR10 1202 layers") *Performance on CIFAR10 with 1202 layer networks with SD (blue, bottom) and without SD (red, top).* ![Optimal choice of p_L](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Deep_Networks_with_Stochastic_Depth__optimal_p.png?raw=true "Optimal choice of p_l") *Optimal choice of the survival probability `p_L` (in this summary `q`) for the last layer, for the uniform schema (same for all other layers) and the linear decay schema (decreasing towards `p_L`). Linear decay performs consistently better and allows for lower `p_L` values, leading to more speedup.*  ### Rough chapterwise notes * (1) Introduction * Problems of deep networks: * Vanishing Gradients: During backpropagation, gradients approach zero due to being repeatedly multiplied with small weights. Possible countermeasures: Careful initialization of weights, "hidden layer supervision" (?), batch normalization. * Diminishing feature reuse: Aequivalent problem to vanishing gradients during forward propagation. Results of early layers are repeatedly multiplied with later layer's (randomly initialized) weights. The total result then becomes meaningless noise and doesn't have a clear/strong gradient to fix it. * Long training time: The time of each forwardbackward increases linearly with layer depth. Current 152layer networks can take weeks to train on ImageNet. * I.e.: Shallow networks can be trained effectively and fast, but deep networks would be much more expressive. * During testing we want deep networks, during training we want shallow networks. * They randomly "drop out" (i.e. remove) complete layers during training (per minibatch), resulting in shallow networks. * Result: Lower training time *and* lower test error. * While dropout randomly removes width from the network, stochastic depth randomly removes depth from the networks. * While dropout can be thought of as training an ensemble of networks with different depth, stochastic depth can be thought of as training an ensemble of networks with different depth. * Stochastic depth acts as a regularizer, similar to dropout and batch normalization. It allows deeper networks without overfitting (because 1000 layers clearly wasn't enough!). * (2) Background * Some previous methods to train deep networks: Greedy layerwise training, careful initializations, batch normalization, highway connections, residual connections. * <Standard explanation of residual networks> * <Standard explanation of dropout> * Dropout loses effectiveness when combined with batch normalization. Seems to have basically no benefit any more for deep residual networks with batch normalization. * (3) Deep Networks with Stochastic Depth * They randomly skip entire layers during training. * To do that, they use residual connections. They select random layers and use only the identity function for these layers (instead of the full residual block of identity + convolutions + add). * ResNet architecture: They use standard residual connections. ReLU activations, 2 convolutional layers (conv>BN>ReLU>conv>BN>add>ReLU). They use <= 64 filters per conv layer. * While the standard formula for residual connections is `output = ReLU(f(input) + identity(input))`, their formula is `output = ReLU(b * f(input) + identity(input))` with `b` being either 0 (inactive/removed layer) or 1 (active layer), i.e. is a sample of a bernoulli random variable. * The probabilities of the bernoulli random variables are now hyperparameters, similar to dropout. * Note that the probability here means the probability of *survival*, i.e. high value = more survivors. * The probabilities could be set uniformly, e.g. to 0.5 for each variable/layer. * They can also be set with a linear decay, so that the first layer has a very high probability of survival, while the last layer has a very low probability of survival. * Linear decay formula: `p = 1  (l/L)(1q)` where `l` is the current layer's number, `L` is the total number of layers, `p` is the survival probability of layer `l` and `q` is the desired survival probability of the last layer (e.g. 0.5). * They argue that linear decay is better, as the early layer extract low level features and are therefor more important. * The expected number of surviving layers is simply the sum of the probabilities. * For linear decay with `q=0.5` and `L=54` (i.e. 54 residual blocks = 110 total layers) the expected number of surviving blocks is roughly `(3/4)L = (3/4)54 = 40`, i.e. on average 14 residual blocks will be removed per training batch. * With linear decay and `q=0.5` the expected speedup of training is about 25%. `q=0.2` leads to about 40% speedup (while in one test still achieving the test error of the same network without stochastic depth). * Depending on the `q` setting, they observe significantly lower test errors. They argue that stochastic depth has a regularizing effect (training an ensemble of many networks with different depths). * Similar to dropout, the forward pass rule during testing must be slightly changed, because the network was trained on missing values. The residual formular during test time becomes `output = ReLU(p * f(input) + input)` where `p` is the average probability with which this residual block survives during training. * (4) Results * Their model architecture: * Three chains of 18 residual blocks each, so 3*18 blocks per model. * Number of filters per conv. layer: 16 (first chain), 32 (second chain), 64 (third chain) * Between each block they use average pooling. Then they zeropad the new dimensions (e.g. from 16 to 32 at the end of the first chain). * CIFAR10: * Trained with SGD (momentum=0.9, dampening=0, lr=0.1 after 1st epoch, 0.01 after epoch 250, 0.001 after epoch 375). * Weight decay/L2 of 1e4. * Batch size 128. * Augmentation: Horizontal flipping, crops (4px offset). * They achieve 5.23% error (compared to 6.41% in the original paper about residual networks). * CIFAR100: * Same settings as before. * 24.58% error with stochastic depth, 27.22% without. * SVHN: * The use both the hard and easy subdatasets of images. * They preprocess to zeromean, unitvariance. * Batch size 128. * Learning rate is 0.1 (start), 0.01 (after epoch 30), 0.001 (after epoch 35). * 1.75% error with stochastic depth, 2.01% error without. * Network without stochastic depth starts to overfit towards the end. * Stochastic depth with linear decay and `q=0.5` gives ~25% speedup. * 1202layer CIFAR10: * They trained a 1202layer deep network on CIFAR10 (previous tests: 152 layers). * Without stochastic depth: 6.72% test error. * With stochastic depth: 4.91% test error. * (5) Analytic experiments * Vanishing Gradient: * They analyzed the gradient that reaches the first layer. * The gradient with stochastic depth is consistently higher (throughout the epochs) than without stochastic depth. * The difference is very significant after decreasing the learning rate. * Hyperparameter sensitivity: * They evaluated with test error for different choices of the survival probability `q`. * Linear decay schema: Values between 0.4 and 0.8 perform best. 0.5 is suggested (nearly best value, good spedup). Even 0.2 improves the test error (compared to no stochastic depth). * Uniform schema: 0.8 performs best, other values mostly significantly worse. * Linear decay performs consistently better than the uniform schema. 
* They propose a CNNbased approach to detect faces in a wide range of orientations using a single model. However, since the training set is skewed, the network is more confident about upright faces. * The model does not require additional components such as segmentation, boundingbox regression, segmentation, or SVM classifiers ### How * __Data augmentation__: to increase the number of positive samples (24K face annotations), the authors used randomly sampled subwindows of the images with IOU > 50% and also randomly flipped these images. In total, there were 20K positive and 20M negative training samples. * __CNN Architecture__: 5 convolutional layers followed by 3 fullyconnected. The fullyconnected layers were converted to convolutional layers. NonMaximal Suppression is applied to merge predicted bounding boxes. * __Training__: the CNN was trained using Caffe Library in the AFLW dataset with the following parameters: * Finetuning with AlexNet model * Input image size = 227x227 * Batch size = 128 (32+, 96) * Stride = 32 * __Test__: the model was evaluated on PASCAL FACE, AFW, and FDDB dataset. * __Running time__: since the fullyconnected layers were converted to convolutional layers, the input image in running time may be of any size, obtaining a heat map as output. To detect faces of different sizes though, the image is scaled up/down and new heatmaps are obtained. The authors found that rescaling image 3 times per octave gives reasonable good performance. ![DDFD heatmap](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/DDFD__heatmap.png?raw=true "DDFD heatmap") * The authors realized that the model is more confident about upright faces than rotated/occluded ones. This trend is because the lack of good training examples to represent such faces in the training process. Better results can be achieved by using better sampling strategies and more sophisticated data augmentation techniques. ![DDFD example](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/DDFD__example.png?raw=true "DDFD example") * The authors tested different strategies for NMS and the effect of boundingbox regression for improving face detection. They NMSavg had better performance compared to NMSmax in terms of average precision. On the other hand, adding a boundingbox regressor degraded the performance for both NMS strategies due to the mismatch between annotations of the training set and the test set. This mismatch is mostly for sideview faces. ### Results: * In comparison to RCNN, the proposed face detector had significantly better performance independent of the NMS strategy. The authors believe the inferior performance of RCNN due to the loss of recall since selective search may miss some of the face regions; and loss in localization since boundingbox regression is not perfect and may not be able to fully align the segmentation boundingboxes, provided by selective search, with the ground truth. * In comparison to other stateofart methods like structural model, TSM and cascadebased methods the DDFD achieve similar or better results. However, this comparison is not completely fair since the most of methods use extra information of pose annotation or information about facial landmarks during the training. 
* They analyze the effects of using Batch Normalization (BN) and Weight Normalization (WN) in GANs (classical algorithm, like DCGAN). * They introduce a new measure to rate the quality of the generated images over time. ### How * They use BN as it is usually defined. * They use WN with the following formulas: * Strict weightnormalized layer: * ![Strict WN layer](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/On_the_Effects_of_BN_and_WN_in_GANs__strict_wn.jpg?raw=true "Strict WN layer") * Affine weightnormalized layer: * ![Affine WN layer](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/On_the_Effects_of_BN_and_WN_in_GANs__affine_wn.jpg?raw=true "Affine WN layer") * As activation units they use Translated ReLUs (aka "threshold functions"): * ![TReLU](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/On_the_Effects_of_BN_and_WN_in_GANs__trelu.jpg?raw=true "TReLU") * `alpha` is a learned parameter. * TReLUs play better with their WN layers than normal ReLUs. * Reconstruction measure * To evaluate the quality of the generated images during training, they introduce a new measure. * The measure is based on a L2Norm (MSE) between (1) a real image and (2) an image created by the generator that is as similar as possible to the real image. * They generate (2) by starting `G(z)` with a noise vector `z` that is filled with zeros. The desired output is the real image. They compute a MSE between the generated and real image and backpropagate the result. Then they use the generated gradient to update `z`, while leaving the parameters of `G` unaltered. They repeat this for a defined number of steps. * Note that the above described method is fairly timeconsuming, so they don't do it often. * Networks * Their networks are fairly standard. * Generator: Starts at 1024 filters, goes down to 64 (then 3 for the output). Upsampling via fractionally strided convs. * Discriminator: Starts at 64 filters, goes to 1024 (then 1 for the output). Downsampling via strided convolutions. * They test three variations of these networks: * Vanilla: No normalization. PReLUs in both G and D. * BN: BN in G and D, but not in the last layers and not in the first layer of D. PReLUs in both G and D. * WN: Strict weightnormalized layers in G and D, except for the last layers, which are affine weightnormalized layers. TPReLUs (Translated PReLUs) in both G and D. * Other * They train with RMSProp and batch size 32. ### Results * Their WN formulation trains stable, provided the learning rate is set to 0.0002 or lower. * They argue, that their achieved stability is similar to the one in WGAN. * BN had significant swings in quality. * Vanilla collapsed sooner or later. * Both BN and Vanilla reached an optimal point shortly after the start of the training. After that, the quality of the generated images only worsened. * Plot of their quality measure: * ![Losses over time](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/On_the_Effects_of_BN_and_WN_in_GANs__losses_over_time.jpg?raw=true "Losses over time") * Their quality measure is based on reconstruction of input images. The below image shows examples for that reconstruction (each person: original image, vanilla reconstruction, BN rec., WN rec.). * ![Reconstructions](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/On_the_Effects_of_BN_and_WN_in_GANs__reconstructions.jpg?raw=true "Reconstructions") * Examples generated by their WN network: * ![WN Examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/On_the_Effects_of_BN_and_WN_in_GANs__wn_examples.jpg?raw=true "WN Examples") 
* Weight Normalization (WN) is a normalization technique, similar to Batch Normalization (BN). * It normalizes each layer's weights. ### Differences to BN * WN normalizes based on each weight vector's orientation and magnitude. BN normalizes based on each weight's mean and variance in a batch. * WN works on each example on its own. BN works on whole batches. * WN is more deterministic than BN (due to not working an batches). * WN is better suited for noisy environment (RNNs, LSTMs, reinforcement learning, generative models). (Due to being more deterministic.) * WN is computationally simpler than BN. ### How its done * WN is a module added on top of a linear or convolutional layer. * If that layer's weights are `w` then WN learns two parameters `g` (scalar) and `v` (vector, identical dimension to `w`) so that `w = gv / v` is fullfilled (`v` = euclidean norm of v). * `g` is the magnitude of the weights, `v` are their orientation. * `v` is initialized to zero mean and a standard deviation of 0.05. * For networks without recursions (i.e. not RNN/LSTM/GRU): * Right after initialization, they feed a single batch through the network. * For each neuron/weight, they calculate the mean and standard deviation after the WN layer. * They then adjust the bias to `mean/stdDev` and `g` to `1/stdDev`. * That makes the network start with each feature being roughly zeromean and unitvariance. * The same method can also be applied to networks without WN. ### Results: * They define BNMEAN as a variant of BN which only normalizes to zeromean (not unitvariance). * CIFAR10 image classification (no data augmentation, some dropout, some white noise): * WN, BN, BNMEAN all learn similarly fast. Network without normalization learns slower, but catches up towards the end. * BN learns "more" per example, but is about 16% slower (timewise) than WN. * WN reaches about same test error as no normalization (both ~8.4%), BN achieves better results (~8.0%). * WN + BNMEAN achieves best results with 7.31%. * Optimizer: Adam * Convolutional VAE on MNIST and CIFAR10: * WN learns more per example und plateaus at better values than network without normalization. (BN was not tested.) * Optimizer: Adamax * DRAW on MNIST (heavy on LSTMs): * WN learns significantly more example than network without normalization. * Also ends up with better results. (Normal network might catch up though if run longer.) * Deep Reinforcement Learning (Space Invaders): * WN seemed to overall acquire a bit more reward per epoch than network without normalization. Variance (in acquired reward) however also grew. * Results not as clear as in DRAW. * Optimizer: Adamax ### Extensions * They argue that initializing `g` to `exp(cs)` (`c` constant, `s` learned) might be better, but they didn't get better test results with that. * Due to some gradient effects, `v` currently grows monotonically with every weight update. (Not necessarily when using optimizers that use separate learning rates per parameters.) * That grow effect leads the network to be more robust to different learning rates. * Setting a small hard limit/constraint for `v` can lead to better test set performance (parameter updates are larger, introducing more noise). ![CIFAR10 results](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Weight_Normalization__cifar10.png?raw=true "CIFAR10 results") *Performance of WN on CIFAR10 compared to BN, BNMEAN and no normalization.* ![DRAW, DQN results](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Weight_Normalization__draw_dqn.png?raw=true "DRAW, DQN results") *Performance of WN for DRAW (left) and deep reinforcement learning (right).* 
* Inception v4 is like Inception v3, but * Slimmed down, i.e. some parts were simplified * One new version with residual connections (InceptionResNetv2), one without (Inceptionv4) * They didn't observe an improved error rate when using residual connections. * They did however oberserve that using residual connections decreased their training times. * They had to scale down the results of their residual modules (multiply them by a constant ~0.1). Otherwise their networks would die (only produce 0s). * Results on ILSVRC 2012 (val set, 144 crops/image): * Top1 Error: * Inceptionv4: 17.7% * InceptionResNetv2: 17.8% * Top5 Error (ILSVRC 2012 val set, 144 crops/image): * Inceptionv4: 3.8% * InceptionResNetv2: 3.7% ### Architecture * Basic structure of InceptionResNetv2 (layers, dimensions): * `Image > Stem > 5x Module A > ReductionA > 10x Module B > Reduction B > 5x Module C > AveragePooling > Droput 20% > Linear, Softmax` * `299x299x3 > 35x35x256 > 35x35x256 > 17x17x896 > 17x17x896 > 8x8x1792 > 8x8x1792 > 1792 > 1792 > 1000` * Modules A, B, C are very similar. * They contain 2 (B, C) or 3 (A) branches. * Each branch starts with a 1x1 convolution on the input. * All branches merge into one 1x1 convolution (which is then added to the original input, as usually in residual architectures). * Module A uses 3x3 convolutions, B 7x1 and 1x7, C 3x1 and 1x3. * The reduction modules also contain multiple branches. One has max pooling (3x3 stride 2), the other branches end in convolutions with stride 2. ![Module A](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Inception_v4__module_a.png?raw=true "Module A") ![Module B](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Inception_v4__module_b.png?raw=true "Module B") ![Module C](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Inception_v4__module_c.png?raw=true "Module C") ![Reduction Module A](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Inception_v4__reduction_a.png?raw=true "Reduction Module A") *From top to bottom: Module A, Module B, Module C, Reduction Module A.* ![Top 5 error](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Inception_v4__top5_error.png?raw=true "Top 5 error") *Top 5 eror by epoch, models with (red, solid, bottom) and without (green, dashed) residual connections.*  ### Rough chapterwise notes ### Introduction, Related Work * Inception v3 was adapted to run on DistBelief. Inception v4 is designed for TensorFlow, which gets rid of some constraints and allows a simplified architecture. * Authors don't think that residual connections are inherently needed to train deep nets, but they do speed up the training. * History: * Inception v1  Introduced inception blocks * Inception v2  Added Batch Normalization * Inception v3  Factorized the inception blocks further (more submodules) * Inception v4  Adds residual connections ### Architectural Choices * Previous architectures were constrained due to memory problems. TensorFlow got rid of that problem. * Previous architectures were carefully/conservatively extended. Architectures ended up being quite complicated. This version slims down everything. * They had problems with residual networks dieing when they contained more than 1000 filters (per inception module apparently?). They could fix that by multiplying the results of the residual subnetwork (before the elementwise addition) with a constant factor of ~0.1. ### Training methodology * Kepler GPUs, TensorFlow, RMSProb (SGD+Momentum apprently performed worse) ### Experimental Results * Their residual version of Inception v4 ("InceptionResNetv2") seemed to learn faster than the nonresidual version. * They both peaked out at almost the same value. * Top1 Error (ILSVRC 2012 val set, 144 crops/image): * Inceptionv4: 17.7% * InceptionResNetv2: 17.8% * Top5 Error (ILSVRC 2012 val set, 144 crops/image): * Inceptionv4: 3.8% * InceptionResNetv2: 3.7% 
* GANs are based on adversarial training. * Adversarial training is a basic technique to train generative models (so here primarily models that create new images). * In an adversarial training one model (G, Generator) generates things (e.g. images). Another model (D, discriminator) sees real things (e.g. real images) as well as fake things (e.g. images from G) and has to learn how to differentiate the two. * Neural Networks are models that can be trained in an adversarial way (and are the only models discussed here). ### How * G is a simple neural net (e.g. just one fully connected hidden layer). It takes a vector as input (e.g. 100 dimensions) and produces an image as output. * D is a simple neural net (e.g. just one fully connected hidden layer). It takes an image as input and produces a quality rating as output (01, so sigmoid). * You need a training set of things to be generated, e.g. images of human faces. * Let the batch size be B. * G is trained the following way: * Create B vectors of 100 random values each, e.g. sampled uniformly from [1, +1]. (Number of values per components depends on the chosen input size of G.) * Feed forward the vectors through G to create new images. * Feed forward the images through D to create ratings. * Use a cross entropy loss on these ratings. All of these (fake) images should be viewed as label=0 by D. If D gives them label=1, the error will be low (G did a good job). * Perform a backward pass of the errors through D (without training D). That generates gradients/errors per image and pixel. * Perform a backward pass of these errors through G to train G. * D is trained the following way: * Create B/2 images using G (again, B/2 random vectors, feed forward through G). * Chose B/2 images from the training set. Real images get label=1. * Merge the fake and real images to one batch. Fake images get label=0. * Feed forward the batch through D. * Measure the error using cross entropy. * Perform a backward pass with the error through D. * Train G for one batch, then D for one (or more) batches. Sometimes D can be too slow to catch up with D, then you need more iterations of D per batch of G. ### Results * Good looking images MNISTnumbers and human faces. (Grayscale, rather homogeneous datasets.) * Not so good looking images of CIFAR10. (Color, rather heterogeneous datasets.) ![Generated Faces](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Generative_Adversarial_Networks__faces.jpg?raw=true "Generated Faces") *Faces generated by MLP GANs. (Rightmost column shows examples from the training set.)*  ### Rough chapterwise notes * Introduction * Discriminative models performed well so far, generative models not so much. * Their suggested new architecture involves a generator and a discriminator. * The generator learns to create content (e.g. images), the discriminator learns to differentiate between real content and generated content. * Analogy: Generator produces counterfeit art, discriminator's job is to judge whether a piece of art is a counterfeit. * This principle could be used with many techniques, but they use neural nets (MLPs) for both the generator as well as the discriminator. * Adversarial Nets * They have a Generator G (simple neural net) * G takes a random vector as input (e.g. vector of 100 random values between 1 and +1). * G creates an image as output. * They have a Discriminator D (simple neural net) * D takes an image as input (can be real or generated by G). * D creates a rating as output (quality, i.e. a value between 0 and 1, where 0 means "probably fake"). * Outputs from G are fed into D. The result can then be backpropagated through D and then G. G is trained to maximize log(D(image)), so to create a high value of D(image). * D is trained to produce only 1s for images from G. * Both are trained simultaneously, i.e. one batch for G, then one batch for D, then one batch for G... * D can also be trained multiple times in a row. That allows it to catch up with G. * Theoretical Results * Let * pd(x): Probability that image `x` appears in the training set. * pg(x): Probability that image `x` appears in the images generated by G. * If G is now fixed then the best possible D classifies according to: `D(x) = pd(x) / (pd(x) + pg(x))` * It is proofable that there is only one global optimum for GANs, which is reached when G perfectly replicates the training set probability distribution. (Assuming unlimited capacity of the models and unlimited training time.) * It is proofable that G and D will converge to the global optimum, so long as D gets enough steps per training iteration to model the distribution generated by G. (Again, assuming unlimited capacity/time.) * Note that these things are proofed for the general principle for GANs. Implementing GANs with neural nets can then introduce problems typical for neural nets (e.g. getting stuck in saddle points). * Experiments * They tested on MNIST, Toronto Face Database (TFD) and CIFAR10. * They used MLPs for G and D. * G contained ReLUs and Sigmoids. * D contained Maxouts. * D had Dropout, G didn't. * They use a Parzen Window Estimate aka KDE (sigma obtained via cross validation) to estimate the quality of their images. * They note that KDE is not really a great technique for such high dimensional spaces, but its the only one known. * Results on MNIST and TDF are great. (Note: both grayscale) * CIFAR10 seems to match more the texture but not really the structure. * Noise is noticeable in CIFAR10 (a bit in TFD too). Comes from MLPs (no convolutions). * Their KDE score for MNIST and TFD is competitive or better than other approaches. * Advantages and Disadvantages * Advantages * No Markov Chains, only backprob * Inferencefree training * Wide variety of functions can be incorporated into the model (?) * Generator never sees any real example. It only gets gradients. (Prevents overfitting?) * Can represent a wide variety of distributions, including sharp ones (Markov chains only work with blurry images). * Disadvantages * No explicit representation of the distribution modeled by G (?) * D and G must be well synchronized during training * If G is trained to much (i.e. D can't catch up), it can collapse many components of the random input vectors to the same output ("Helvetica") 
* Traditionally neural nets use max pooling with 2x2 grids (2MP). * 2MP reduces the image dimensions by a factor of 2. * An alternative would be to use pooling schemes that reduce by factors other than two, e.g. `1 < factor < 2`. * Pooling by a factor of `sqrt(2)` would allow twice as many pooling layers as 2MP, resulting in "softer" image size reduction throughout the network. * Fractional Max Pooling (FMP) is such a method to perform max pooling by factors other than 2. ### How * In 2MP you move a 2x2 grid always by 2 pixels. * Imagine that these step sizes follow a sequence, i.e. for 2MP: `2222222...` * If you mix in just a single `1` you get a pooling factor of `<2`. * By chosing the right amount of `1s` vs. `2s` you can pool by any factor between 1 and 2. * The sequences of `1s` and `2s` can be generated in fully *random* order or in *pseudorandom* order, where pseudorandom basically means "predictable sub patterns" (e.g. 211211211211211...). * FMP can happen *disjoint* or *overlapping*. Disjoint means 2x2 grids, overlapping means 3x3. ### Results * FMP seems to perform generally better than 2MP. * Better results on various tests, including CIFAR10 and CIFAR100 (often quite significant improvement). * Best configuration seems to be *random* sequences with *overlapping* regions. * Results are especially better if each test is repeated multiple times per image (as the random sequence generation creates randomness, similar to dropout). First 510 repetitions seem to be most valuable, but even 100+ give some improvement. * An FMPfactor of `sqrt(2)` was usually used. ![Examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Fractional_Max_Pooling__examples.jpg?raw=true "Examples") *Random FMP with a factor of sqrt(2) applied five times to the same input image (results upscaled back to original size).*  ### Rough chapterwise notes * (1) Convolutional neural networks * Advantages of 2x2 max pooling (2MP): fast; a bit invariant to translations and distortions; quick reduction of image sizes * Disadvantages: "disjoint nature of pooling regions" can limit generalization (i.e. that they don't overlap?); reduction of image sizes can be too quick * Alternatives to 2MP: 3x3 pooling with stride 2, stochastic 2x2 pooling * All suggested alternatives to 2MP also reduce sizes by a factor of 2 * Author wants to have reduction by sqrt(2) as that would enable to use twice as many pooling layers * Fractional Max Pooling = Pooling that reduces image sizes by a factor of `1 < alpha < 2` * FMP introduces randomness into pooling (by the choice of pooling regions) * Settings of FMP: * Pooling Factor `alpha` in range [1, 2] (1 = no change in image sizes, 2 = image sizes get halfed) * Choice of PoolingRegions: Random or pseudorandom. Random is stronger (?). Random+Dropout can result in underfitting. * Disjoint or overlapping pooling regions. Results for overlapping are better. * (2) Fractional maxpooling * For traditional 2MP, every grid's top left coordinate is at `(2i1, 2j1)` and it's bottom right coordinate at `(2i, 2j)` (i=col, j=row). * It will reduce the original size N to 1/2N, i.e. `2N_in = N_out`. * Paper analyzes `1 < alpha < 2`, but `alpha > 2` is also possible. * Grid top left positions can be described by sequences of integers, e.g. (only column): 1, 3, 5, ... * Disjoint 2x2 pooling might be 1, 3, 5, ... while overlapping would have the same sequence with a larger 3x3 grid. * The increment of the sequences can be random or pseudorandom for alphas < 2. * For 2x2 FMP you can represent any alpha with a "good" sequence of increments that all have values `1` or `2`, e.g. 2111121122111121... * In the case of random FMP, the optimal fraction of 1s and 2s is calculated. Then a random permutation of a sequence of 1s and 2s is generated. * In the case of pseudorandom FMP, the 1s and 2s follow a pattern that leads to the correct alpha, e.g. 112112121121211212... * Random FMP creates varying distortions of the input image. Pseudorandom FMP is a faithful downscaling. * (3) Implementation * In their tests they use a convnet starting with 10 convolutions, then 20, then 30, ... * They add FMP with an alpha of sqrt(2) after every conv layer. * They calculate the desired output size, then go backwards through their network to the input. They multiply the size of the image by sqrt(2) with every FMP layer and add a flat 1 for every conv layer. The result is the required image size. They pad the images to that size. * They use dropout, with increasing strength from 0% to 50% towards the output. * They use LeakyReLUs. * Every time they apply an FMP layer, they generate a new sequence of 1s and 2s. That indirectly makes the network an ensemble of similar networks. * The output of the network can be averaged over several forward passes (for the same image). The result then becomes more accurate (especially up to >=6 forward passes). * (4) Results * Tested on MNIST and CIFAR100 * Architectures (somehow different from (3)?): * MNIST: 36x36 img > 6 times (32 conv (3x3?) > FMP alpha=sqrt(2)) > ? > ? > output * CIFAR100: 94x94 img > 12 times (64 conv (3x3?) > FMP alpha=2^(1/3)) > ? > ? > output * Overlapping pooling regions seemed to perform better than disjoint regions. * Random FMP seemed to perform better than pseudorandom FMP. * Other tests: * "The Online Handwritten Assamese Characters Dataset": FMP performed better than 2MP (though their network architecture seemed to have significantly more parameters * "CASIAOLHWDB1.1 database": FMP performed better than 2MP (again, seemed to have more parameters) * CIFAR10: FMP performed better than current best network (especially with many tests per image) 
* ELUs are an activation function * The are most similar to LeakyReLUs and PReLUs ### How (formula) * f(x): * `if x >= 0: x` * `else: alpha(exp(x)1)` * f'(x) / Derivative: * `if x >= 0: 1` * `else: f(x) + alpha` * `alpha` defines at which negative value the ELU saturates. * E. g. `alpha=1.0` means that the minimum value that the ELU can reach is `1.0` * LeakyReLUs however can go to `Infinity`, ReLUs can't go below 0. ![ELUs vs LeakyReLUs vs ReLUs](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/ELUs__slopes.png?raw=true "ELUs vs LeakyReLUs vs ReLUs") *Form of ELUs(alpha=1.0) vs LeakyReLUs vs ReLUs.* ### Why * They derive from the unit natural gradient that a network learns faster, if the mean activation of each neuron is close to zero. * ReLUs can go above 0, but never below. So their mean activation will usually be quite a bit above 0, which should slow down learning. * ELUs, LeakyReLUs and PReLUs all have negative slopes, so their mean activations should be closer to 0. * In contrast to LeakyReLUs and PReLUs, ELUs saturate at a negative value (usually 1.0). * The authors think that is good, because it lets ELUs encode the degree of presence of input concepts, while they do not quantify the degree of absence. * So ELUs can measure the presence of concepts quantitatively, but the absence only qualitatively. * They think that this makes ELUs more robust to noise. ### Results * In their tests on MNIST, CIFAR10, CIFAR100 and ImageNet, ELUs perform (nearly always) better than ReLUs and LeakyReLUs. * However, they don't test PReLUs at all and use an alpha of 0.1 for LeakyReLUs (even though 0.33 is afaik standard) and don't test LeakyReLUs on ImageNet (only ReLUs). ![CIFAR100](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/ELUs__cifar100.png?raw=true "CIFAR100") *Comparison of ELUs, LeakyReLUs, ReLUs on CIFAR100. ELUs ends up with best values, beaten during the early epochs by LeakyReLUs. (Learning rates were optimized for ReLUs.)*  ### Rough chapterwise notes * Introduction * Currently popular choice: ReLUs * ReLU: max(0, x) * ReLUs are sparse and avoid the vanishing gradient problem, because their derivate is 1 when they are active. * ReLUs have a mean activation larger than zero. * Nonzero mean activation causes a bias shift in the next layer, especially if multiple of them are correlated. * The natural gradient (?) corrects for the bias shift by adjusting the weight update. * Having less bias shift would bring the standard gradient closer to the natural gradient, which would lead to faster learning. * Suggested solutions: * Centering activation functions at zero, which would keep the offdiagonal entries of the Fisher information matrix small. * Batch Normalization * Projected Natural Gradient Descent (implicitly whitens the activations) * These solutions have the problem, that they might end up taking away previous learning steps, which would slow down learning unnecessarily. * Chosing a good activation function would be a better solution. * Previously, tanh was prefered over sigmoid for that reason (pushed mean towards zero). * Recent new activation functions: * LeakyReLUs: x if x > 0, else alpha*x * PReLUs: Like LeakyReLUs, but alpha is learned * RReLUs: Slope of part < 0 is sampled randomly * Such activation functions with nonzero slopes for negative values seemed to improve results. * The deactivation state of such units is not very robust to noise, can get very negative. * They suggest an activation function that can return negative values, but quickly saturates (for negative values, not for positive ones). * So the model can make a quantitative assessment for positive statements (there is an amount X of A in the image), but only a qualitative negative one (something indicates that B is not in the image). * They argue that this makes their activation function more robust to noise. * Their activation function still has activations with a mean close to zero. * Zero Mean Activations Speed Up Learning * Natural Gradient = Update direction which corrects the gradient direction with the Fisher Information Matrix * HessianFree Optimization techniques use an extended GaussNewton approximation of Hessians and therefore can be interpreted as versions of natural gradient descent. * Computing the Fisher matrix is too expensive for neural networks. * Methods to approximate the Fisher matrix or to perform natural gradient descent have been developed. * Natural gradient = inverse(FisherMatrix) * gradientOfWeights * Lots of formulas. Apparently first explaining how the natural gradient descent works, then proofing that natural gradient descent can deal well with nonzeromean activations. * Natural gradient descent autocorrects bias shift (i.e. nonzeromean activations). * If that autocorrection does not exist, oscillations (?) can occur, which slow down learning. * Two ways to push means towards zero: * Unit zero mean normalization (e.g. Batch Normalization) * Activation functions with negative parts * Exponential Linear Units (ELUs) * *Formula* * f(x): * if x >= 0: x * else: alpha(exp(x)1) * f'(x) / Derivative: * if x >= 0: 1 * else: f(x) + alpha * `alpha` defines at which negative value the ELU saturates. * `alpha=0.5` => minimum value is 0.5 (?) * ELUs avoid the vanishing gradient problem, because their positive part is the identity function (like e.g. ReLUs) * The negative values of ELUs push the mean activation towards zero. * Mean activations closer to zero resemble more the natural gradient, therefore they should speed up learning. * ELUs are more noise robust than PReLUs and LeakyReLUs, because their negative values saturate and thus should create a small gradient. * "ELUs encode the degree of presence of input concepts, while they do not quantify the degree of absence" * Experiments Using ELUs * They compare ELUs to ReLUs and LeakyReLUs, but not to PReLUs (no explanation why). * They seem to use a negative slope of 0.1 for LeakyReLUs, even though 0.33 is standard afaik. * They use an alpha of 1.0 for their ELUs (i.e. minimum value is 1.0). * MNIST classification: * ELUs achieved lower mean activations than ReLU/LeakyReLU * ELUs achieved lower cross entropy loss than ReLU/LeakyReLU (and also seemed to learn faster) * They used 5 hidden layers of 256 units each (no explanation why so many) * (No convolutions) * MNIST Autoencoder: * ELUs performed consistently best (at different learning rates) * Usually ELU > LeakyReLU > ReLU * LeakyReLUs not far off, so if they had used a 0.33 value maybe these would have won * CIFAR100 classification: * Convolutional network, 11 conv layers * LeakyReLUs performed better during the first ~50 epochs, ReLUs mostly on par with ELUs * LeakyReLUs about on par for epochs 50100 * ELUs win in the end (the learning rates used might not be optimal for ELUs, were designed for ReLUs) * CIFER100, CIFAR10 (big convnet): * 6.55% error on CIFAR10, 24.28% on CIFAR100 * No comparison with ReLUs and LeakyReLUs for same architecture * ImageNet * Big convnet with spatial pyramid pooling (?) before the fully connected layers * Network with ELUs performed better than ReLU network (better score at end, faster learning) * Networks were still learning at the end, they didn't run till convergence * No comparison to LeakyReLUs 
* Deep plain/ordinary networks usually perform better than shallow networks. * However, when they get too deep their performance on the *training* set decreases. That should never happen and is a shortcoming of current optimizers. * If the "good" insights of the early layers could be transferred through the network unaltered, while changing/improving the "bad" insights, that effect might disappear. ### What residual architectures are * Residual architectures use identity functions to transfer results from previous layers unaltered. * They change these previous results based on results from convolutional layers. * So while a plain network might do something like `output = convolution(image)`, a residual network will do `output = image + convolution(image)`. * If the convolution resorts to just doing nothing, that will make the result a lot worse in the plain network, but not alter it at all in the residual network. * So in the residual network, the convolution can focus fully on learning what positive changes it has to perform, while in the plain network it *first* has to learn the identity function and then what positive changes it can perform. ### How it works * Residual architectures can be implemented in most frameworks. You only need something like a split layer and an elementwise addition. * Use one branch with an identity function and one with 2 or more convolutions (1 is also possible, but seems to perform poorly). Merge them with the elementwise addition. * Rough example block (for a 64x32x32 input): https://i.imgur.com/NJVb9hj.png * An example block when you have to change the dimensionality (e.g. here from 64x32x32 to 128x32x32): https://i.imgur.com/9NXvTjI.png * The authors seem to prefer using either two 3x3 convolutions or the chain of 1x1 then 3x3 then 1x1. They use the latter one for their very deep networks. * The authors also tested: * To use 1x1 convolutions instead of identity functions everywhere. Performed a bit better than using 1x1 only for dimensionality changes. However, also computation and memory demands. * To use zeropadding for dimensionality changes (no 1x1 convs, just fill the additional dimensions with zeros). Performed only a bit worse than 1x1 convs and a lot better than plain network architectures. * Pooling can be used as in plain networks. No special architectures are necessary. * Batch normalization can be used as usually (before nonlinearities). ### Results * Residual networks seem to perform generally better than similarly sized plain networks. * They seem to be able to achieve similar results with less computation. * They enable welltrainable very deep architectures with up to 1000 layers and more. * The activations of the residual layers are low compared to plain networks. That indicates that the residual networks indeed only learn to make "good" changes and default to "if in doubt, change nothing". ![Building blocks](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Deep_Residual_Learning_for_Image_Recognition__building_blocks.png?raw=true "Building blocks") *Examples of basic building blocks (other architectures are possible). The paper doesn't discuss the placement of the ReLU (after add instead of after the layer).* ![Activations](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Deep_Residual_Learning_for_Image_Recognition__activations.png?raw=true "Activations") *Activations of layers (after batch normalization, before nonlinearity) throughout the network for plain and residual nets. Residual networks have on average lower activations.*  ### Rough chapterwise notes * (1) Introduction * In classical architectures, adding more layers can cause the network to perform worse on the training set. * That shouldn't be the case. (E.g. a shallower could be trained and then get a few layers of identity functions on top of it to create a deep network.) * To combat that problem, they stack residual layers. * A residual layer is an identity function and can learn to add something on top of that. * So if `x` is an input image and `f(x)` is a convolution, they do something like `x + f(x)` or even `x + f(f(x))`. * The classical architecture would be more like `f(f(f(f(x))))`. * Residual architectures can be easily implemented in existing frameworks using skip connections with identity functions (split + merge). * Residual architecture outperformed other in ILSVRC 2015 and COCO 2015. * (3) Deep Residual Learning * If some layers have to fit a function `H(x)` then they should also be able to fit `H(x)  x` (change between `x` and `H(x)`). * The latter case might be easier to learn than the former one. * The basic structure of a residual block is `y = x + F(x, W)`, where `x` is the input image, `y` is the output image (`x + change`) and `F(x, W)` is the residual subnetwork that estimates a good change of `x` (W are the subnetwork's weights). * `x` and `F(x, W)` are added using elementwise addition. * `x` and the output of `F(x, W)` must be have equal dimensions (channels, height, width). * If different dimensions are required (mainly change in number of channels) a linear projection `V` is applied to `x`: `y = F(x, W) + Vx`. They use a 1x1 convolution for `V` (without nonlinearity?). * `F(x, W)` subnetworks can contain any number of layer. They suggest 2+ convolutions. Using only 1 layer seems to be useless. * They run some tests on a network with 34 layers and compare to a 34 layer network without residual blocks and with VGG (19 layers). * They say that their architecture requires only 18% of the FLOPs of VGG. (Though a lot of that probably comes from VGG's 2x4096 fully connected layers? They don't use any fully connected layers, only convolutions.) * A critical part is the change in dimensionality (e.g. from 64 kernels to 128). They test (A) adding the new dimensions empty (padding), (B) using the mentioned linear projection with 1x1 convolutions and (C) using the same linear projection, but on all residual blocks (not only for dimensionality changes). * (A) doesn't add parameters, (B) does (i.e. breaks the pattern of using identity functions). * They use batch normalization before each nonlinearity. * Optimizer is SGD. * They don't use dropout. * (4) Experiments * When testing on ImageNet an 18 layer plain (i.e. not residual) network has lower training set error than a deep 34 layer plain network. * They argue that this effect does probably not come from vanishing gradients, because they (a) checked the gradient norms and they looked healthy and (b) use batch normaliaztion. * They guess that deep plain networks might have exponentially low convergence rates. * For the residual architectures its the other way round. Stacking more layers improves the results. * The residual networks also perform better (in error %) than plain networks with the same number of parameters and layers. (Both for training and validation set.) * Regarding the previously mentioned handling of dimensionality changes: * (A) Pad new dimensions: Performs worst. (Still far better than plain network though.) * (B) Linear projections for dimensionality changes: Performs better than A. * (C) Linear projections for all residual blocks: Performs better than B. (Authors think that's due to introducing new parameters.) * They also test on very deep residual networks with 50 to 152 layers. * For these deep networks their residual block has the form `1x1 conv > 3x3 conv > 1x1 conv` (i.e. dimensionality reduction, convolution, dimensionality increase). * These deeper networks perform significantly better. * In further tests on CIFAR10 they can observe that the activations of the convolutions in residual networks are lower than in plain networks. * So the residual networks default to doing nothing and only change (activate) when something needs to be changed. * They test a network with 1202 layers. It is still easily optimizable, but overfits the training set. * They also test on COCO and get significantly better results than a FasterRCNN+VGG implementation. 
### What is BN: * Batch Normalization (BN) is a normalization method/layer for neural networks. * Usually inputs to neural networks are normalized to either the range of [0, 1] or [1, 1] or to mean=0 and variance=1. The latter is called *Whitening*. * BN essentially performs Whitening to the intermediate layers of the networks. ### How its calculated: * The basic formula is $x^* = (x  E[x]) / \sqrt{\text{var}(x)}$, where $x^*$ is the new value of a single component, $E[x]$ is its mean within a batch and `var(x)` is its variance within a batch. * BN extends that formula further to $x^{**} = gamma * x^* +$ beta, where $x^{**}$ is the final normalized value. `gamma` and `beta` are learned per layer. They make sure that BN can learn the identity function, which is needed in a few cases. * For convolutions, every layer/filter/kernel is normalized on its own (linear layer: each neuron/node/component). That means that every generated value ("pixel") is treated as an example. If we have a batch size of N and the image generated by the convolution has width=P and height=Q, we would calculate the mean (E) over `N*P*Q` examples (same for the variance). ### Theoretical effects: * BN reduces *Covariate Shift*. That is the change in distribution of activation of a component. By using BN, each neuron's activation becomes (more or less) a gaussian distribution, i.e. its usually not active, sometimes a bit active, rare very active. * Covariate Shift is undesirable, because the later layers have to keep adapting to the change of the type of distribution (instead of just to new distribution parameters, e.g. new mean and variance values for gaussian distributions). * BN reduces effects of exploding and vanishing gradients, because every becomes roughly normal distributed. Without BN, low activations of one layer can lead to lower activations in the next layer, and then even lower ones in the next layer and so on. ### Practical effects: * BN reduces training times. (Because of less Covariate Shift, less exploding/vanishing gradients.) * BN reduces demand for regularization, e.g. dropout or L2 norm. (Because the means and variances are calculated over batches and therefore every normalized value depends on the current batch. I.e. the network can no longer just memorize values and their correct answers.) * BN allows higher learning rates. (Because of less danger of exploding/vanishing gradients.) * BN enables training with saturating nonlinearities in deep networks, e.g. sigmoid. (Because the normalization prevents them from getting stuck in saturating ranges, e.g. very high/low values for sigmoid.) ![MNIST and neuron activations](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Batch_Normalization__performance_and_activations.png?raw=true "MNIST and neuron activations") *BN applied to MNIST (a), and activations of a randomly selected neuron over time (b, c), where the middle line is the median activation, the top line is the 15th percentile and the bottom line is the 85th percentile.*  ### Rough chapterwise notes * (2) Towards Reducing Covariate Shift * Batch Normalization (*BN*) is a special normalization method for neural networks. * In neural networks, the inputs to each layer depend on the outputs of all previous layers. * The distributions of these outputs can change during the training. Such a change is called a *covariate shift*. * If the distributions stayed the same, it would simplify the training. Then only the parameters would have to be readjusted continuously (e.g. mean and variance for normal distributions). * If using sigmoid activations, it can happen that one unit saturates (very high/low values). That is undesired as it leads to vanishing gradients for all units below in the network. * BN fixes the means and variances of layer inputs to specific values (zero mean, unit variance). * That accomplishes: * No more covariate shift. * Fixes problems with vanishing gradients due to saturation. * Effects: * Networks learn faster. (As they don't have to adjust to covariate shift any more.) * Optimizes gradient flow in the network. (As the gradient becomes less dependent on the scale of the parameters and their initial values.) * Higher learning rates are possible. (Optimized gradient flow reduces risk of divergence.) * Saturating nonlinearities can be safely used. (Optimized gradient flow prevents the network from getting stuck in saturated modes.) * BN reduces the need for dropout. (As it has a regularizing effect.) * How BN works: * BN normalizes layer inputs to zero mean and unit variance. That is called *whitening*. * Naive method: Train on a batch. Update model parameters. Then normalize. Doesn't work: Leads to exploding biases while distribution parameters (mean, variance) don't change. * A proper method has to include the current example *and* all previous examples in the normalization step. * This leads to calculating in covariance matrix and its inverse square root. That's expensive. The authors found a faster way. * (3) Normalization via MiniBatch Statistics * Each feature (component) is normalized individually. (Due to cost, differentiability.) * Normalization according to: `componentNormalizedValue = (componentOldValue  E[component]) / sqrt(Var(component))` * Normalizing each component can reduce the expressitivity of nonlinearities. Hence the formula is changed so that it can also learn the identiy function. * Full formula: `newValue = gamma * componentNormalizedValue + beta` (gamma and beta learned per component) * E and Var are estimated for each mini batch. * BN is fully differentiable. Formulas for gradients/backpropagation are at the end of chapter 3 (page 4, left). * (3.1) Training and Inference with BatchNormalized Networks * During test time, E and Var of each component can be estimated using all examples or alternatively with moving averages estimated during training. * During test time, the BN formulas can be simplified to a single linear transformation. * (3.2) BatchNormalized Convolutional Networks * Authors recommend to place BN layers after linear/fullyconnected layers and before the ninlinearities. * They argue that the linear layers have a better distribution that is more likely to be similar to a gaussian. * Placing BN after the nonlinearity would also not eliminate covariate shift (for some reason). * Learning a separate bias isn't necessary as BN's formula already contains a biaslike term (beta). * For convolutions they apply BN equally to all features on a feature map. That creates effective batch sizes of m\*pq, where m is the number of examples in the batch and p q are the feature map dimensions (height, width). BN for linear layers has a batch size of m. * gamma and beta are then learned per feature map, not per single pixel. (Linear layers: Per neuron.) * (3.3) Batch Normalization enables higher learning rates * BN normalizes activations. * Result: Changes to early layers don't amplify towards the end. * BN makes it less likely to get stuck in the saturating parts of nonlinearities. * BN makes training more resilient to parameter scales. * Usually, large learning rates cannot be used as they tend to scale up parameters. Then any change to a parameter amplifies through the network and can lead to gradient explosions. * With BN gradients actually go down as parameters increase. Therefore, higher learning rates can be used. * (something about singular values and the Jacobian) * (3.4) Batch Normalization regularizes the model * Usually: Examples are seen on their own by the network. * With BN: Examples are seen in conjunction with other examples (mean, variance). * Result: Network can't easily memorize the examples any more. * Effect: BN has a regularizing effect. Dropout can be removed or decreased in strength. * (4) Experiments * (4.1) Activations over time ** They tested BN on MNIST with a 100x100x10 network. (One network with BN before each nonlinearity, another network without BN for comparison.) ** Batch Size was 60. ** The network with BN learned faster. Activations of neurons (their means and variances over several examples) seemed to be more consistent during training. ** Generalization of the BN network seemed to be better. * (4.2) ImageNet classification ** They applied BN to the Inception network. ** Batch Size was 32. ** During training they used (compared to original Inception training) a higher learning rate with more decay, no dropout, less L2, no local response normalization and less distortion/augmentation. ** They shuffle the data during training (i.e. each batch contains different examples). ** Depending on the learning rate, they either achieve the same accuracy (as in the nonBN network) in 14 times fewer steps (5x learning rate) or a higher accuracy in 5 times fewer steps (30x learning rate). ** BN enables training of Inception networks with sigmoid units (still a bit lower accuracy than ReLU). ** An ensemble of 6 Inception networks with BN achieved better accuracy than the previously best network for ImageNet. * (5) Conclusion ** BN is similar to a normalization layer suggested by Gülcehre and Bengio. However, they applied it to the outputs of nonlinearities. ** They also didn't have the beta and gamma parameters (i.e. their normalization could not learn the identity function). 
* The paper describes a method to separate content and style from each other in an image. * The style can then be transfered to a new image. * Examples: * Let a photograph look like a painting of van Gogh. * Improve a dark beach photo by taking the style from a sunny beach photo. ### How * They use the pretrained 19layer VGG net as their base network. * They assume that two images are provided: One with the *content*, one with the desired *style*. * They feed the content image through the VGG net and extract the activations of the last convolutional layer. These activations are called the *content representation*. * They feed the style image through the VGG net and extract the activations of all convolutional layers. They transform each layer to a *Gram Matrix* representation. These Gram Matrices are called the *style representation*. * How to calculate a *Gram Matrix*: * Take the activations of a layer. That layer will contain some convolution filters (e.g. 128), each one having its own activations. * Convert each filter's activations to a (1dimensional) vector. * Pick all pairs of filters. Calculate the scalar product of both filter's vectors. * Add the scalar product result as an entry to a matrix of size `#filters x #filters` (e.g. 128x128). * Repeat that for every pair to get the Gram Matrix. * The Gram Matrix roughly represents the *texture* of the image. * Now you have the content representation (activations of a layer) and the style representation (Gram Matrices). * Create a new image of the size of the content image. Fill it with random white noise. * Feed that image through VGG to get its content representation and style representation. (This step will be repeated many times during the image creation.) * Make changes to the new image using gradient descent to optimize a loss function. * The loss function has two components: * The mean squared error between the new image's content representation and the previously extracted content representation. * The mean squared error between the new image's style representation and the previously extracted style representation. * Add up both components to get the total loss. * Give both components a weight to alter for more/less style matching (at the expense of content matching). ![Examples](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/A_Neural_Algorithm_for_Artistic_Style__examples.jpg?raw=true "Examples") *One example input image with different styles added to it.*  ### Rough chapterwise notes * Page 1 * A painted image can be decomposed in its content and its artistic style. * Here they use a neural network to separate content and style from each other (and to apply that style to an existing image). * Page 2 * Representations get more abstract as you go deeper in networks, hence they should more resemble the actual content (as opposed to the artistic style). * They call the feature responses in higher layers *content representation*. * To capture style information, they use a method that was originally designed to capture texture information. * They somehow build a feature space on top of the existing one, that is somehow dependent on correlations of features. That leads to a "stationary" (?) and multiscale representation of the style. * Page 3 * They use VGG as their base CNN. * Page 4 * Based on the extracted style features, they can generate a new image, which has equal activations in these style features. * The new image should match the style (texture, color, localized structures) of the artistic image. * The style features become more and more abtstract with higher layers. They call that multiscale the *style representation*. * The key contribution of the paper is a method to separate style and content representation from each other. * These representations can then be used to change the style of an existing image (by changing it so that its content representation stays the same, but its style representation matches the artwork). * Page 6 * The generated images look most appealing if all features from the style representation are used. (The lower layers tend to reflect small features, the higher layers tend to reflect larger features.) * Content and style can't be separated perfectly. * Their loss function has two terms, one for content matching and one for style matching. * The terms can be increased/decreased to match content or style more. * Page 8 * Previous techniques work only on limited or simple domains or used nonparametric approaches (see nonphotorealistic rendering). * Previously neural networks have been used to classify the time period of paintings (based on their style). * They argue that separating content from style might be useful and many other domains (other than transfering style of paintings to images). * Page 9 * The style representation is gathered by measuring correlations between activations of neurons. * They argue that this is somehow similar to what "complex cells" in the primary visual system (V1) do. * They note that deep convnets seem to automatically learn to separate content from style, probably because it is helpful for styleinvariant classification. * Page 9, Methods * They use the 19 layer VGG net as their basis. * They use only its convolutional layers, not the linear ones. * They use average pooling instead of max pooling, as that produced slightly better results. * Page 10, Methods * The information about the image that is contained in layers can be visualized. To do that, extract the features of a layer as the labels, then start with a white noise image and change it via gradient descent until the generated features have minimal distance (MSE) to the extracted features. * The build a style representation by calculating Gram Matrices for each layer. * Page 11, Methods * The Gram Matrix is generated in the following way: * Convert each filter of a convolutional layer to a 1dimensional vector. * For a pair of filters i, j calculate the value in the Gram Matrix by calculating the scalar product of the two vectors of the filters. * Do that for every pair of filters, generating a matrix of size #filters x #filters. That is the Gram Matrix. * Again, a white noise image can be changed with gradient descent to match the style of a given image (i.e. minimize MSE between two Gram Matrices). * That can be extended to match the style of several layers by measuring the MSE of the Gram Matrices of each layer and giving each layer a weighting. * Page 12, Methods * To transfer the style of a painting to an existing image, proceed as follows: * Start with a white noise image. * Optimize that image with gradient descent so that it minimizes both the content loss (relative to the image) and the style loss (relative to the painting). * Each distance (content, style) can be weighted to have more or less influence on the loss function. 