Welcome to ShortScience.org! 
[link]
In the years before this paper came out in 2017, a number of different graph convolution architectures  which use weightsharing and orderinvariant operations to create representations at nodes in a graph that are contextualized by information in the rest of the graph  had been suggested for learning representations of molecules. The authors of this paper out of Google sought to pull all of these proposed models into a single conceptual framework, for the sake of better comparing and testing the design choices that went into them. All empirical tests were done using the QM9 dataset, where 134,000 molecules have predicted chemical properties attached to them, things like the amount of energy released if bombs are sundered and the energy of electrons at different electron shells. https://i.imgur.com/Mmp8KO6.png An interesting note is that these properties weren't measured empirically, but were simulated by a very expensive quantum simulation, because the former wouldn't be feasible for this large of a dataset. However, this is still a moderately interesting test because, even if we already have the capability to computationally predict these features, a neural network would do much more quickly. And, also, one might aspirationally hope that architectures which learn good representations of molecules for quantum predictions are also useful for tasks with a less available automated prediction mechanism. The framework assumes the existence of "hidden" feature vectors h at each node (atom) in the graph, as well as features that characterize the edges between nodes (whether that characterization comes through sorting into discrete bond categories or through a continuous representation). The features associated with each atom at the lowest input level of the moleculesummarizing networks trained here include: the element ID, the atomic number, whether it accepts electrons or donates them, whether it's in an aromatic system, and which shells its electrons are in. https://i.imgur.com/J7s0q2e.png Given these building blocks, the taxonomy lays out three broad categories of function, each of which different architectures implement in slightly different ways. 1. The Message function, M(). This function is defined with reference to a node w, that the message is coming from, and a node v, that it's being sent to, and is meant to summarize the information coming from w to inform the node representation that will be calculated at v. It takes into account the feature vectors of one or both nodes at the next level down, and sometimes also incorporates feature vectors attached to the edge connecting the two nodes. In a notable example of weight sharing, you'd use the same Message function for every combination of v and w, because you need to be able to process an arbitrary number of pairs, with each v having a different number of neighbors. The simplest example you might imagine here is a simple concatenation of incoming node and edge features; a more typical example from the architectures reviewed is a concatenation followed by a neural network layer. The aggregate message being sent to the receiver node is calculated by summing together the messages from each incoming vector (though it seems like other options are possible; I'm a bit confused why the paper presented summing as the only orderinvariant option). 2. The Update function, U(). This function governs how to take the aggregated message vector sent to a particular node, and combine that with the priorlayer representation at that node, to come up with a nextlayer representation at that node. Similarly, the same Update function weights are shared across all atoms. 3. The Readout function, R(), which takes the finallayer representation of each atom node and aggregates the representations into a final graphlevel representation an orderinvariant way Rather than following in the footsteps of the paper by describing each proposed model type and how it can be described in this framework, I'll instead try to highlight some of the more interesting ways in which design choices differed across previously proposed architectures.  Does the message function being sent from w to v depend on the feature value at both w and v, or just v? To put the question more colloquially, you might imagine w wanting to contextually send different information based on different values of the feature vector at node v, and this extra degree of expressivity (not present in the earliest 2015 paper), seems like a quite valuable addition (in that all subsequent papers include it)  Are the edge features static, categorical things, or are they feature vectors that get iteratively updated in the same way that the node vectors do? For most of the architectures reviewed, the former is true, but the authors found that the highest performance in their tests came from networks with continuous edge vectors, rather than just having different weights for different category types of edge  Is the Readout function something as simple as a summation of all toplevel feature vectors, or is it more complex? Again, the authors found that they got the best performance by using a more complex approach, a Set2Set aggregator, which uses itemtoitem attention within the set of finallayer atom representations to construct an aggregated graplevel embedding The empirical tests within the paper highlight a few more interestingly relevant design choices that are less directly captured by the framework. The first is the fact that it's quite beneficial to explicitly include Hydrogen atoms as part of the graph, rather than just "attaching" them to their nearestby atoms as a count that goes on that atom's feature vector. The second is that it's valuable to start out your edge features with a continuous representation of the spatial distance between atoms, along with an embedding of the bond type. This is particularly worth considering because getting spatial distance data for a molecule requires solving the freeenergy problem to determine its spatial conformation, a costly process. We might ideally prefer a network that can work on bond information alone. The authors do find a nonspatialinformation network that can perform reasonably well  reaching full accuracy on 5 of 13 targets, compared to 11 with spatial information. However, the difference is notable, which, at least from my perspective, begs the question of whether it'd ever be possible to learn representations that can match the performance of spatiallyinformed ones without explicitly providing that information. 
[link]
# Object detection system overview. https://i.imgur.com/vd2YUy3.png 1. takes an input image, 2. extracts around 2000 bottomup region proposals, 3. computes features for each proposal using a large convolutional neural network (CNN), and then 4. classifies each region using classspecific linear SVMs. * RCNN achieves a mean average precision (mAP) of 53.7% on PASCAL VOC 2010. * On the 200class ILSVRC2013 detection dataset, RCNN’s mAP is 31.4%, a large improvement over OverFeat , which had the previous best result at 24.3%. ## There is a 2 challenges faced in object detection 1. localization problem 2. labeling the data 1 localization problem : * One approach frames localization as a regression problem. they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method. * An alternative is to build a slidingwindow detector. considered adopting a slidingwindow approach increases the number of convolutional layers to 5, have very large receptive fields (195 x 195 pixels) and strides (32x32 pixels) in the input image, which makes precise localization within the slidingwindow paradigm. 2 labeling the data: * The conventional solution to this problem is to use unsupervised pretraining, followed by supervise finetuning * supervised pretraining on a large auxiliary dataset (ILSVRC), followed by domain specific finetuning on a small dataset (PASCAL), * finetuning for detection improves mAP performance by 8 percentage points. * Stochastic gradient descent via back propagation was used to effective for training convolutional neural networks (CNNs) ## Object detection with RCNN This system consists of three modules * The first generates categoryindependent region proposals. These proposals define the set of candidate detections available to our detector. * The second module is a large convolutional neural network that extracts a fixedlength feature vector from each region. * The third module is a set of class specific linear SVMs. Module design 1 Region proposals * which detect mitotic cells by applying a CNN to regularlyspaced square crops. * use selective search method in fast mode (Capture All Scales, Diversification, Fast to Compute). * the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU) 2 Feature extraction. * extract a 4096dimensional feature vector from each region proposal using the Caffe implementation of the CNN * Features are computed by forward propagating a meansubtracted 227x227 RGB image through five convolutional layers and two fully connected layers. * warp all pixels in a tight bounding box around it to the required size * The feature matrix is typically 2000x4096 3 Test time detection * At test time, run selective search on the test image to extract around 2000 region proposals (we use selective search’s “fast mode” in all experiments). * warp each proposal and forward propagate it through the CNN in order to compute features. Then, for each class, we score each extracted feature vector using the SVM trained for that class. * Given all scored regions in an image, we apply a greedy nonmaximum suppression (for each class independently) that rejects a region if it has an intersectionover union (IoU) overlap with a higher scoring selected region larger than a learned threshold. ## Training 1 Supervised pretraining: * pretrained the CNN on a large auxiliary dataset (ILSVRC2012 classification) using imagelevel annotations only (bounding box labels are not available for this data) 2 Domainspecific finetuning. * use the stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals with learning rate of 0.001. 3 Object category classifiers. * use intersectionover union (IoU) overlap threshold method to label a region with The overlap threshold of 0.3. * Once features are extracted and training labels are applied, we optimize one linear SVM per class. * adopt the standard hard negative mining method to fit large training data in memory. ### Results on PASCAL VOC 201012 1 VOC 2010 * compared against four strong baselines including SegDPM, DPM, UVA, Regionlets. * Achieve a large improvement in mAP, from 35.1% to 53.7% mAP, while also being much faster https://i.imgur.com/0dGX9b7.png 2 ILSVRC2013 detection. * ran RCNN on the 200class ILSVRC2013 detection dataset * RCNN achieves a mAP of 31.4% https://i.imgur.com/GFbULx3.png #### Performance layerbylayer, without finetuning 1 pool5 layer * which is the max pooled output of the network’s fifth and final convolutional layer. *The pool5 feature map is 6 x6 x 256 = 9216 dimensional * each pool5 unit has a receptive field of 195x195 pixels in the original 227x227 pixel input 2 Layer fc6 * fully connected to pool5 * it multiplies a 4096x9216 weight matrix by the pool5 feature map (reshaped as a 9216dimensional vector) and then adds a vector of biases 3 Layer fc7 * It is implemented by multiplying the features computed by fc6 by a 4096 x 4096 weight matrix, and similarly adding a vector of biases and applying halfwave rectification #### Performance layerbylayer, with finetuning * CNN’s parameters finetuned on PASCAL. * finetuning increases mAP by 8.0 % points to 54.2% ### Network architectures * 16layer deep network, consisting of 13 layers of 3 _ 3 convolution kernels, with five max pooling layers interspersed, and topped with three fullyconnected layers. We refer to this network as “ONet” for OxfordNet and the baseline as “TNet” for TorontoNet. * RCNN with ONet substantially outperforms RCNN with TNet, increasing mAP from 58.5% to 66.0% * drawback in terms of compute time, with in terms of compute time, with than TNet. 1 The ILSVRC2013 detection dataset * dataset is split into three sets: train (395,918), val (20,121), and test (40,152) #### CNN features for segmentation. * full RCNN: The first strategy (full) ignores the re region’s shape and computes CNN features directly on the warped window. Two regions might have very similar bounding boxes while having very little overlap. * fg RCNN: the second strategy (fg) computes CNN features only on a region’s foreground mask. We replace the background with the mean input so that background regions are zero after mean subtraction. * full+fg RCNN: The third strategy (full+fg) simply concatenates the full and fg features https://i.imgur.com/n1bhmKo.png
1 Comments

[link]
* They suggest a new stochastic optimization method, similar to the existing SGD, Adagrad or RMSProp. * Stochastic optimization methods have to find parameters that minimize/maximize a stochastic function. * A function is stochastic (nondeterministic), if the same set of parameters can generate different results. E.g. the loss of different minibatches can differ, even when the parameters remain unchanged. Even for the same minibatch the results can change due to e.g. dropout. * Their method tends to converge faster to optimal parameters than the existing competitors. * Their method can deal with nonstationary distributions (similar to e.g. SGD, Adadelta, RMSProp). * Their method can deal with very sparse or noisy gradients (similar to e.g. Adagrad). ### How * Basic principle * Standard SGD just updates the parameters based on `parameters = parameters  learningRate * gradient`. * Adam operates similar to that, but adds more "cleverness" to the rule. * It assumes that the gradient values have means and variances and tries to estimate these values. * Recall here that the function to optimize is stochastic, so there is some randomness in the gradients. * The mean is also called "the first moment". * The variance is also called "the second (raw) moment". * Then an update rule very similar to SGD would be `parameters = parameters  learningRate * means`. * They instead use the update rule `parameters = parameters  learningRate * means/sqrt(variances)`. * They call `means/sqrt(variances)` a 'Signal to Noise Ratio'. * Basically, if the variance of a specific parameter's gradient is high, it is pretty unclear how it should be changend. So we choose a small step size in the update rule via `learningRate * mean/sqrt(highValue)`. * If the variance is low, it is easier to predict how far to "move", so we choose a larger step size via `learningRate * mean/sqrt(lowValue)`. * Exponential moving averages * In order to approximate the mean and variance values you could simply save the last `T` gradients and then average the values. * That however is a pretty bad idea, because it can lead to high memory demands (e.g. for millions of parameters in CNNs). * A simple average also has the disadvantage, that it would completely ignore all gradients before `T` and weight all of the last `T` gradients identically. In reality, you might want to give more weight to the last couple of gradients. * Instead, they use an exponential moving average, which fixes both problems and simply updates the average at every timestep via the formula `avg = alpha * avg + (1  alpha) * avg`. * Let the gradient at timestep (batch) `t` be `g`, then we can approximate the mean and variance values using: * `mean = beta1 * mean + (1  beta1) * g` * `variance = beta2 * variance + (1  beta2) * g^2`. * `beta1` and `beta2` are hyperparameters of the algorithm. Good values for them seem to be `beta1=0.9` and `beta2=0.999`. * At the start of the algorithm, `mean` and `variance` are initialized to zerovectors. * Bias correction * Initializing the `mean` and `variance` vectors to zero is an easy and logical step, but has the disadvantage that bias is introduced. * E.g. at the first timestep, the mean of the gradient would be `mean = beta1 * 0 + (1  beta1) * g`, with `beta1=0.9` then: `mean = 0.9 * g`. So `0.9g`, not `g`. Both the mean and the variance are biased (towards 0). * This seems pretty harmless, but it can be shown that it lowers the convergence speed of the algorithm by quite a bit. * So to fix this pretty they perform biascorrections of the mean and the variance: * `correctedMean = mean / (1beta1^t)` (where `t` is the timestep). * `correctedVariance = variance / (1beta2^t)`. * Both formulas are applied at every timestep after the exponential moving averages (they do not influence the next timestep). ![Algorithm](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Adam__algorithm.png?raw=true "Algorithm") 
[link]
This paper presents an unsupervised generative model, based on the variational autoencoder framework, but where the encoder is a recurrent neural network that sequentially infers the identity, pose and number of objects in some input scene (2D image or 3D scene). In short, this is done by extending the DRAW model to incorporate discrete latent variables that determine whether an additional object is present or not. Since the reparametrization trick cannot be used for discrete variables, the authors estimate the gradient through the sampling operation using a likelihood ratio estimator. Another innovation over DRAW is the application to 3D scenes, in which the decoder is a graphics renderer. Since it is not possible to backpropagate through the renderer, gradients are estimated using finitedifference estimates (which require going through the renderer several times). Experiments are presented where the evaluation is focused on the ability of the model to detect and count the number of objects in the image or scene. **My two cents** This is a nice, natural extension of DRAW. I'm particularly impressed by the results for the 3D scene setting. Despite the fact that setup is obviously synthetic and simplistic, I really surprised that estimating the decoder gradients using finitedifferences worked at all. It's also interesting to see that the proposed model does surprisingly well compared to a CNN supervised approach that directly predicts the objects identity and pose. Quite cool! To see the model in action, see [this cute video][1]. [1]: https://www.youtube.com/watch?v=4tc84kKdpY4 
[link]
[code](https://github.com/openai/improvedgan), [demo](http://infinitechamber35121.herokuapp.com/cifarminibatch/1/?), [related](http://www.inference.vc/understandingminibatchdiscriminationingans/) ### Feature matching problem: overtraining on the current discriminator solution: ￼$E_{x \sim p_{\text{data}}}f(x)  E_{z \sim p_{z}(z)}f(G(z))_{2}^{2}$ were f(x) activations intermediate layer in discriminator ### Minibatch discrimination problem: generator to collapse to a single point solution: for each sample i, concatenate to $f(x_i)$ features $b$ measuring its distance to other samples j (i and j are both real or generated samples in same batch): $\sum_j \exp(M_{i, b}  M_{j, b}_{L_1})$ ￼ this generates visually appealing samples very quickly ### Historical averaging problem: SGD fails by going into extended orbits solution: parameters revert to the mean $ \theta  \frac{1}{t} \sum_{i=1}^t \theta[i] ^2$ ￼ ### Onesided label smoothing problem: discriminator vulnerability to adversarial examples solution: discriminator target for positive samples is 0.9 instead of 1 ### Virtual batch normalization problem: using BN cause output of examples in batch to be dependent solution: use reference batch chosen once at start of training and each sample is normalized using itself and the reference. It's expensive so used only on generation ### Assessment of image quality problem: MTurk not reliable solution: use inception model p(yx) to compute ￼$\exp(\mathbb{E}_x \text{KL}(p(y  x)  p(y)))$ on 50K generated images x ### Semisupervised learning use the discriminator to also classify on K labels when known and use all real samples (labels and unlabeled) in the discrimination task ￼$D(x) = \frac{Z(x)}{Z(x) + 1}, \text{ where } Z(x) = \sum_{k=1}^{K} \exp[l_k(x)]$. In this case use feature matching but not minibatch discrimination. It also improves the quality of generated images.
3 Comments

[link]
Dinh et al. show that it is unclear whether flat minima necessarily generalize better than sharp ones. In particular, they study several notions of flatness, both based on the local curvature and based on the notion of “low change in error”. The authors show that the parameterization of the network has a significant impact on the flatness; this means that functions leading to the same prediction function (i.e., being indistinguishable based on their test performance) might have largely varying flatness around the obtained minima, as illustrated in Figure 1. In conclusion, while networks that generalize well usually correspond to flat minima, it is not necessarily true that flat minima generalize better than sharp ones. https://i.imgur.com/gHfolEV.jpg Figure 1: Illustration of the influence of parameterization on the flatness of the obtained minima. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
This paper presents an interpretation of dropout training as performing approximate Bayesian learning in a deep Gaussian process (DGP) model. This connection suggests a very simple way of obtaining, for networks trained with dropout, estimates of the model's output uncertainty. This estimate is based and computed from an ensemble of networks each obtained by sampling a new dropout mask. #### My two cents This is a really nice and thought provoking contribution to our understanding of dropout. Unfortunately, the paper in fact doesn't provide a lot of comparisons with either other ways of estimating the predictive uncertainty of deep networks, or to other approximate inference schemes in deep GPs (actually, see update below). The qualitative examples provided however do suggest that the uncertainty estimate isn't terrible. Irrespective of the quality of the uncertainty estimate suggested here, I find the observation itself really valuable. Perhaps future research will then shed light on how useful that method is compared to other approaches, including Bayesian dark knowledge \cite{conf/nips/BalanRMW15}. `Update: On September 27th`, the authors uploaded to arXiv a new version that now includes comparisons with 2 alternative Bayesian learning methods for deep networks, specifically the stochastic variational inference approach of Graves and probabilistic backpropagation of HernandezLobato and Adams. Dropout actually does very well against these baselines and, across datasets, is almost always amongst the best performing method! 
[link]
Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. Advantages: * Learning the identity becomes learning 0 which is simpler * Loss in information flow in the forward pass is not a problem anymore * No vanishing / exploding gradient * Identities don't have parameters to be learned ## Evaluation The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{4}$), momentum of 0.9. They use minibatches of size 128. * ImageNet ILSVRC 2015: 3.57% (ensemble) * CIFAR10: 6.43% * MS COCO: 59.0% mAp@0.5 (ensemble) * PASCAL VOC 2007: 85.6% mAp@0.5 * PASCAL VOC 2012: 83.8% mAp@0.5 ## See also * [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993) 
[link]
Spatial Pyramid Pooling (SPP) is a technique which allows Convolutional Neural Networks (CNNs) to use input images of any size, not only $224\text{px} \times 224\text{px}$ as most architectures do. (However, there is a lower bound for the size of the input image). ## Idea * Convolutional layers operate on any size, but fully connected layers need fixedsize inputs * Solution: * Add a new SPP layer on top of the last convolutional layer, before the fully connected layer * Use an approach similar to bag of words (BoW), but maintain the spatial information. The BoW approach is used for text classification, where the order of the words is discarded and only the number of occurences is kept. * The SPP layer operates on each feature map independently. * The output of the SPP layer is of dimension $k \cdot M$, where $k$ is the number of feature maps the SPP layer got as input and $M$ is the number of bins. Example: We could use spatial pyramid pooling with 21 bins: * 1 bin which is the max of the complete feature map * 4 bins which divide the image into 4 regions of equal size (depending on the input size) and rectangular shape. Each bin gets the max of its region. * 16 bins which divide the image into 4 regions of equal size (depending on the input size) and rectangular shape. Each bin gets the max of its region. ## Evaluation * Pascal VOC 2007, Caltech101: stateoftheart, without finetuning * ImageNet 2012: Boosts accuracy for various CNN architectures * ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014: Rank #2 ## Code The paper claims that the code is [here](http://research.microsoft.com/enus/um/people/kahe/), but this seems not to be the case any more. People have tried to implement it with Tensorflow ([1](http://stackoverflow.com/q/40913794/562769), [2](https://github.com/fchollet/keras/issues/2080), [3](https://github.com/tensorflow/tensorflow/issues/6011)), but by now no public working implementation is available. ## Related papers * [Atrous Convolution](https://arxiv.org/abs/1606.00915)
1 Comments

[link]
The authors have a dataset of 780 electronic health records and they use it to detect various medical events such as adverse drug events, drug dosage, etc. The task is done by assigning a label to each word in the document. https://i.imgur.com/bZ7yM0z.png Annotation statistics for the corpus of health records. They look at CRFs, LSTMs and GRUs. Both LSTMs and GRUs outperform the CRF, but the best performance is achieved by a GRU trained on whole documents. 