Welcome to ShortScience.org! 
[link]
This is followup work to the ResNets paper. It studies the propagation formulations behind the connections of deep residual networks and performs ablation experiments. A residual block can be represented with the equations $y_l = h(x_l) + F(x_l, W_l); x_{l+1} = f(y_l)$. $x_l$ is the input to the lth unit and $x_{l+1}$ is the output of the lth unit. In the original ResNets paper, $h(x_l) = x_l$, $f$ is ReLu, and F consists of 23 convolutional layers (bottleneck architecture) with BN and ReLU in between. In this paper, they propose a residual block with both $h(x)$ and $f(x)$ as identity mappings, which trains faster and performs better than their earlier baseline. Main contributions:  Identity skip connections work much better than other multiplicative interactions that they experiment with:  Scaling $(h(x) = \lambda x)$: Gradients can explode or vanish depending on whether modulating scalar \lambda > 1 or < 1.  Gating ($1g(x)$ for skip connection and $g(x)$ for function F): For gradients to propagate freely, $g(x)$ should approach 1, but F gets suppressed, hence suboptimal. This is similar to highway networks. $g(x)$ is a 1x1 convolutional layer.  Gating (shortcutonly): Setting high biases pushes initial $g(x)$ towards identity mapping, and test error is much closer to baseline.  1x1 convolutional shortcut: These work well for shallower networks (~34 layers), but training error becomes high for deeper networks, probably because they impede gradient propagation.  Experiments on activations.  BN after addition messes up information flow, and performs considerably worse.  ReLU before addition forces the signal to be nonnegative, so the signal is monotonically increasing, while ideally a residual function should be free to take values in (inf, inf).  BN + ReLU preactivation works best. This also prevents overfitting, due to BN's regularizing effect. Input signals to all weight layers are normalized. ## Strengths  Thorough set of experiments to show that identity shortcut connections are easiest for the network to learn. Activation of any deeper unit can be written as the sum of the activation of a shallower unit and a residual function. This also implies that gradients can be directly propagated to shallower units. This is in contrast to usual feedforward networks, where gradients are essentially a series of matrixvector products, that may vanish, as networks grow deeper.  Improved accuracies than their previous ResNets paper. ## Weaknesses / Notes  Residual units are useful and share the same core idea that worked in LSTM units. Even though stacked nonlinear layers are capable of asymptotically approximating any arbitrary function, it is clear from recent work that residual functions are much easier to approximate than the complete function. The [latest Inception paper](http://arxiv.org/abs/1602.07261) also reports that training is accelerated and performance is improved by using identity skip connections across Inception modules.  It seems like the degradation problem, which serves as motivation for residual units, exists in the first place for nonidempotent activation functions such as sigmoid, hyperbolic tan. This merits further investigation, especially with recent work on functionpreserving transformations such as [Network Morphism](http://arxiv.org/abs/1603.01670), which expands the Net2Net idea to sigmoid, tanh, by using parameterized activations, initialized to identity mappings. 
[link]
This paper builds on top of a bunch of existing ideas for building neural conversational agents so as to control against generic and repetitive responses. Their model is the sequencetosequence model with attention (Bahdanau et al.), first trained with the usual MLE loss and finetuned with policy gradients to optimize for specific conversational properties. Specifically, they define 3 rewards: 1. Ease of answering — Measured as the likelihood of responding to a query with a list of handpicked dull responses (more negative log likelihood is higher reward). 2. Information flow — Consecutive responses from the same agent (person) should have different information, measured as negative of log cosine distance (more negative is better). 3. Semantic coherence — Mutual information between source and target (the response should make sense wrt query). $P(aq) + P(qa)$ where a is answer, q is question. The model is pretrained with the usual supervised objective function, taking source as concatenation of two previous utterances. Then they have two stages of policy gradient training, first with just a mutual information reward and then with a combination of all three. The policy network (sequencetosequence model) produces a probability distribution over actions (responses) given state (previous utterances). To estimate the gradient in an iteration, the network is frozen and responses are sampled from the model, the rewards for which are then averaged and gradients are computed for first L tokens of response using MLE and remaining TL tokens with policy gradients, with L being gradually annealed to zero (moving towards just the longterm reward). Evaluation is done based on length of dialogue, diversity (distinct unigram, bigrams) and human studies on 1. Which of two outputs has better quality (single turn) 2. Which of two outputs is easier to respond to, and 3. Which of two conversations have better quality (multi turn). ## Strengths  Interesting results  Avoids generic responses  'Ease of responding' reward encourages responses to be questionlike  Adding in handengineereed approximate reward functions based on conversational properties and using those to finetune a pretrained network using policy gradients is neat.  Policy gradient training also encourages two dialogue agents to interact with each other and explore the complete action space (space of responses), which seems desirable to identify modes of the distribution and not converge on a single, highscoring, generic response. ## Weaknesses / Notes  Evaluating conversational agents is hard. BLEU / perplexity are intentionally avoided as they don't necessarily reward desirable conversational properties. 
[link]
This method is based on improving the speed of RCNN \cite{conf/cvpr/GirshickDDM14} 1. Where RCNN would have two different objective functions, Fast RCNN combines localization and classification losses into a "multitask loss" in order to speed up training. 2. It also uses a pooling method based on \cite{journals/pami/HeZR015} called the RoI pooling layer that scales the input so the images don't have to be scaled before being set an an input image to the CNN. "RoI max pooling works by dividing the $h \times w$ RoI window into an $H \times W$ grid of subwindows of approximate size $h/H \times w/W$ and then maxpooling the values in each subwindow into the corresponding output grid cell." 3. Backprop is performed for the RoI pooling layer by taking the argmax of the incoming gradients that overlap the incoming values. This method is further improved by the paper "Faster RCNN" \cite{conf/nips/RenHGS15} 
[link]
# Object detection system overview. https://i.imgur.com/vd2YUy3.png 1. takes an input image, 2. extracts around 2000 bottomup region proposals, 3. computes features for each proposal using a large convolutional neural network (CNN), and then 4. classifies each region using classspecific linear SVMs. * RCNN achieves a mean average precision (mAP) of 53.7% on PASCAL VOC 2010. * On the 200class ILSVRC2013 detection dataset, RCNN’s mAP is 31.4%, a large improvement over OverFeat , which had the previous best result at 24.3%. ## There is a 2 challenges faced in object detection 1. localization problem 2. labeling the data 1 localization problem : * One approach frames localization as a regression problem. they report a mAP of 30.5% on VOC 2007 compared to the 58.5% achieved by our method. * An alternative is to build a slidingwindow detector. considered adopting a slidingwindow approach increases the number of convolutional layers to 5, have very large receptive fields (195 x 195 pixels) and strides (32x32 pixels) in the input image, which makes precise localization within the slidingwindow paradigm. 2 labeling the data: * The conventional solution to this problem is to use unsupervised pretraining, followed by supervise finetuning * supervised pretraining on a large auxiliary dataset (ILSVRC), followed by domain specific finetuning on a small dataset (PASCAL), * finetuning for detection improves mAP performance by 8 percentage points. * Stochastic gradient descent via back propagation was used to effective for training convolutional neural networks (CNNs) ## Object detection with RCNN This system consists of three modules * The first generates categoryindependent region proposals. These proposals define the set of candidate detections available to our detector. * The second module is a large convolutional neural network that extracts a fixedlength feature vector from each region. * The third module is a set of class specific linear SVMs. Module design 1 Region proposals * which detect mitotic cells by applying a CNN to regularlyspaced square crops. * use selective search method in fast mode (Capture All Scales, Diversification, Fast to Compute). * the time spent computing region proposals and features (13s/image on a GPU or 53s/image on a CPU) 2 Feature extraction. * extract a 4096dimensional feature vector from each region proposal using the Caffe implementation of the CNN * Features are computed by forward propagating a meansubtracted 227x227 RGB image through five convolutional layers and two fully connected layers. * warp all pixels in a tight bounding box around it to the required size * The feature matrix is typically 2000x4096 3 Test time detection * At test time, run selective search on the test image to extract around 2000 region proposals (we use selective search’s “fast mode” in all experiments). * warp each proposal and forward propagate it through the CNN in order to compute features. Then, for each class, we score each extracted feature vector using the SVM trained for that class. * Given all scored regions in an image, we apply a greedy nonmaximum suppression (for each class independently) that rejects a region if it has an intersectionover union (IoU) overlap with a higher scoring selected region larger than a learned threshold. ## Training 1 Supervised pretraining: * pretrained the CNN on a large auxiliary dataset (ILSVRC2012 classification) using imagelevel annotations only (bounding box labels are not available for this data) 2 Domainspecific finetuning. * use the stochastic gradient descent (SGD) training of the CNN parameters using only warped region proposals with learning rate of 0.001. 3 Object category classifiers. * use intersectionover union (IoU) overlap threshold method to label a region with The overlap threshold of 0.3. * Once features are extracted and training labels are applied, we optimize one linear SVM per class. * adopt the standard hard negative mining method to fit large training data in memory. ### Results on PASCAL VOC 201012 1 VOC 2010 * compared against four strong baselines including SegDPM, DPM, UVA, Regionlets. * Achieve a large improvement in mAP, from 35.1% to 53.7% mAP, while also being much faster https://i.imgur.com/0dGX9b7.png 2 ILSVRC2013 detection. * ran RCNN on the 200class ILSVRC2013 detection dataset * RCNN achieves a mAP of 31.4% https://i.imgur.com/GFbULx3.png #### Performance layerbylayer, without finetuning 1 pool5 layer * which is the max pooled output of the network’s fifth and final convolutional layer. *The pool5 feature map is 6 x6 x 256 = 9216 dimensional * each pool5 unit has a receptive field of 195x195 pixels in the original 227x227 pixel input 2 Layer fc6 * fully connected to pool5 * it multiplies a 4096x9216 weight matrix by the pool5 feature map (reshaped as a 9216dimensional vector) and then adds a vector of biases 3 Layer fc7 * It is implemented by multiplying the features computed by fc6 by a 4096 x 4096 weight matrix, and similarly adding a vector of biases and applying halfwave rectification #### Performance layerbylayer, with finetuning * CNN’s parameters finetuned on PASCAL. * finetuning increases mAP by 8.0 % points to 54.2% ### Network architectures * 16layer deep network, consisting of 13 layers of 3 _ 3 convolution kernels, with five max pooling layers interspersed, and topped with three fullyconnected layers. We refer to this network as “ONet” for OxfordNet and the baseline as “TNet” for TorontoNet. * RCNN with ONet substantially outperforms RCNN with TNet, increasing mAP from 58.5% to 66.0% * drawback in terms of compute time, with in terms of compute time, with than TNet. 1 The ILSVRC2013 detection dataset * dataset is split into three sets: train (395,918), val (20,121), and test (40,152) #### CNN features for segmentation. * full RCNN: The first strategy (full) ignores the re region’s shape and computes CNN features directly on the warped window. Two regions might have very similar bounding boxes while having very little overlap. * fg RCNN: the second strategy (fg) computes CNN features only on a region’s foreground mask. We replace the background with the mean input so that background regions are zero after mean subtraction. * full+fg RCNN: The third strategy (full+fg) simply concatenates the full and fg features https://i.imgur.com/n1bhmKo.png
1 Comments

[link]
FaceNet directly maps face images to $\mathbb{R}^{128}$ where distances directly correspond to a measure of face similarity. They use a triplet loss function. The triplet is (face of person A, other face of person A, face of person which is not A). Later, this is called (anchor, positive, negative). The loss function is learned and inspired by LMNN. The idea is to minimize the distance between the two images of the same person and maximize the distance to the other persons image. ## LMNN Large Margin Nearest Neighbor (LMNN) is learning a pseudometric $$d(x, y) = (x y) M (x y)^T$$ where $M$ is a positivedefinite matrix. The only difference between a pseudometric and a metric is that $d(x, y) = 0 \Leftrightarrow x = y$ does not hold. ## Curriculum Learning: Triplet selection Show simple examples first, then increase the difficulty. This is done by selecting the triplets. They use the triplets which are *hard*. For the positive example, this means the distance between the anchor and the positive example is high. For the negative example this means the distance between the anchor and the negative example is low. They want to have $$f(x_i^a)  f(x_i^p)_2^2 + \alpha < f(x_i^a)  f(x_i^n)_2^2$$ where $\alpha$ is a margin and $x_i^a$ is the anchor, $x_i^p$ is the positive face example and $x_i^n$ is the negative example. They increase $\alpha$ over time. It is crucial that $f$ maps the images not in the complete $\mathbb{R}^{128}$, but on the unit sphere. Otherwise one could double $\alpha$ by simply making $f' = 2 \cdot f$. ## Tasks * **Face verification**: Is this the same person? * **Face recognition**: Who is this person? ## Datasets * 99.63% accuracy on Labeled FAces in the Wild (LFW) * 95.12% accuracy on YouTube Faces DB ## Network Two models are evaluated: The [Zeiler & Fergus model](http://www.shortscience.org/paper?bibtexKey=journals/corr/ZeilerF13) and an architecture based on the [Inception model](http://www.shortscience.org/paper?bibtexKey=journals/corr/SzegedyLJSRAEVR14). ## See also * [DeepFace](http://www.shortscience.org/paper?bibtexKey=conf/cvpr/TaigmanYRW14#martinthoma) 
[link]
_Disclaimer: I'm the first author of this paper._ The code for this paper can be found at https://github.com/fabioperez/skindataaugmentation. In this work, we wanted to compare different data augmentation scenarios for skin lesion analysis. We tried 13 scenarios, including commonly used augmentation techniques (color and geometry transformations), unusual ones (random erasing, elastic transformation, and a novel lesion mix to simulate collision lesions), and a combination of those. Examples of the augmentation scenarios: https://i.imgur.com/TpgxzLZ.png a) no augmentation b) color (saturation, contrast, and brightness) c) color (saturation, contrast, brightness, and hue) d) affine (rotation, shear, scaling) e) random flips f) random crops g) random erasing h) elastic i) lesion mix j) basic set (f, d, e, c) k) basic set + erasing (f, g, d, e, c) l) basic set + elastic (f, d, h, e, c) m) basic set + mix (i, f, d, e, c)  We used the ISIC 2017 Challenge dataset (2000 training images, 150 validation images, and 600 test images). We tried three network architectures: Inceptionv4, ResNet152, and DenseNet161. We also compared different testtime data augmentation methods: a) no augmentation; b) 144crops; c) same data augmentation as training (64 augmented copies of the original image). Final prediction was the average of all augmented predictions. ## Results https://i.imgur.com/WK5VKUf.png * Basic set (combination of commonly used augmentations) is the best scenario. * Data augmentation at testtime is very beneficial. * Elastic is better than no augmentation, but when compared incorporated to the basic set, decreases the performance. * The best result was better than the winner of the challenge in 2017, without using ensembling. * Test data augmentation is very similar with 144crop, but takes less images during prediction (64 vs 144), so it's faster. # Impact of data augmentation on dataset sizes We also used the basic set scenarios on different dataset sizes by sampling random subsets of the original dataset, with sizes 1500, 1000, 500, 250 and 125. https://i.imgur.com/m3Ut6ht.png ## Results * Using data augmentation can be better than using more data (but you should always use more data since the model can benefit from both). For instance, using 500 images with data augmentation on training and test for Inception is better than training with no data augmentation with 2000 images. * ResNet and DenseNet works better than Inception for less data. * Testtime data augmentation is always better than not augmenting on testtime. * Using data augmentation on train only was worse than not augmenting at all in some cases. 
[link]
**Object detection** is the task of drawing one bounding box around each instance of the type of object one wants to detect. Typically, image classification is done before object detection. With neural networks, the usual procedure for object detection is to train a classification network, replace the last layer with a regression layer which essentially predicts pixelwise if the object is there or not. An bounding box inference algorithm is added at last to make a consistent prediction (see [Deep Neural Networks for Object Detection](http://papers.nips.cc/paper/5207deepneuralnetworksforobjectdetection.pdf)). The paper introduces RPNs (Region Proposal Networks). They are endtoend trained to generate region proposals.They simoultaneously regress region bounds and bjectness scores at each location on a regular grid. RPNs are one type of fully convolutional networks. They take an image of any size as input and output a set of rectangular object proposals, each with an objectness score. ## See also * [RCNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen) * [Fast RCNN](http://www.shortscience.org/paper?bibtexKey=conf/iccv/Girshick15#joecohen) * [Faster RCNN](http://www.shortscience.org/paper?bibtexKey=conf/nips/RenHGS15#martinthoma) * [Mask RCNN](http://www.shortscience.org/paper?bibtexKey=journals/corr/HeGDG17) 
[link]
Here the authors present a model which projects queries and documents into a low dimensional space, where you can fetch relevant documents by computing distance, *here cosine is used*, between the query vector and the document vectors. ### Model Description #### Word Hashing Layer They have used bag of trigrams for representing words(office > #office# > {#of, off, ffi, fic, ice, ce#}). This is able to generalize unseen words and maps morphological variation of same words to points which are close in ngram space. #### Context Window Vector Then for representing a sentence they are taking a `Window Size` around a word and appending them to form a context window vector. If we take `Window Size` = 3: (He is going to Office > { [vec of 'he', vec of 'is', vec of 'going'], [vec of 'is', vec of 'going', vec of 'to'], [vec of 'going', vec of 'to', vec of 'Office'] } #### Convolutional Layer and MaxPool layer Run a convolutional layer over each of the context window vector (for an intuition these are local features). Max pool over the resulting features to get global features. The output dimension is taken here to be 300. #### Semantic Layer Use a fully connected layer and project the 300D vector to a 128D vector. They have used two different networks, one for queries and other for documents. Now for each query and document (we are given labeled documents, one of them is positive and rest are negative) they compute the cosine similarity of the 128D output vector. And then they learn the weights of convolutional filters and the fully connected layer by maximizing conditional likelihood of positive documents. My thinking is that they have used two different networks as their is significant difference between Query length and Document Length. 
[link]
This paper describes a class of algorithms for classification or regression in the online setting. That is, the data is a bunch of pairs $(X_t,Y_t)$ (where X may be a vector), and these data items arrive in some order: the algorithm must predict each $\hat{Y}_t$ using only the $X_t$ and previously seen pairs. In the regression setting, each misprediction has a loss that is like $(Y_t  \hat{Y}_t)^2$, and in the classification setting $Y_t$ is always 0 or 1 and the loss is $ Y_t  \hat{Y}_t $. Roughly, the algorithm makes linear predictions using some internal weight vector $(\hat{y} = w * X)$, and does a gradientdescent like weight update. However, it tries to keep the qnorm (q can be any number) of the weight vector "small", preventing the weights themselves from becoming too large. The algorithm is actually simple, and the weight update takes advantage of link functions, which the author defines. The majority of the paper is focused on deriving loss bounds, showing that the loss incurred by this algorithm isn't much worse than that incurred by the best weight vector, chosen in hindsight. Typical readers will be interested in the first few pages, as the latter part of the paper is mainly technical proofs. 
[link]
This paper proposes a variant of Neural Turing Machine (NTM) for metalearning or "learning to learn", in the specific context of fewshot learning (i.e. learning from few examples). Specifically, the proposed model is trained to ingest as input a training set of examples and improve its output predictions as examples are processed, in a purely feedforward way. This is a form of metalearning because the model is trained so that its forward pass effectively executes a form of "learning" from the examples it is fed as input. During training, the model is fed multiples sequences (referred to as episodes) of labeled examples $({\bf x}_1, {\rm null}), ({\bf x}_2, y_1), \dots, ({\bf x}_T, y_{T1})$, where $T$ is the size of the episode. For instance, if the model is trained to learn how to do 5class classification from 10 examples per class, $T$ would be $5 \times 10 = 50$. Mainly, the paper presents experiments on the Omniglot dataset, which has 1623 classes. In these experiments, classes are separated into 1200 "training classes" and 423 "test classes", and each episode is generated by randomly selecting 5 classes (each assigned some arbitrary vector representation, e.g. a onehot vector that is consistent within the episode, but not across episodes) and constructing a randomly ordered sequence of 50 examples from within the chosen 5 classes. Moreover, the correct label $y_t$ of a given input ${\bf x}_t$ is always provided only at the next time step, but the model is trained to be good at its prediction of the label of ${\bf x}_t$ at the current time step. This is akin to the scenario of online learning on a stream of examples, where the label of an example is revealed only once the model has made a prediction. The proposed NTM is different from the original NTM of Alex Graves, mostly in how it writes into its memory. The authors propose to focus writing to either the least recently used memory location or the most recently used memory location. Moreover, the least recently used memory location is reset to zero before every write (an operation that seems to be ignored when backpropagating gradients). Intuitively, the proposed NTM should learn a strategy by which, given a new input, it looks into its memory for information from other examples earlier in the episode (perhaps similarly to what a nearest neighbor classifier would do) to predict the class of the new input. The paper presents experiments in learning to do multiclass classification on the Omniglot dataset and regression based on functions synthetically generated by a GP. The highlights are that: 1. The proposed model performs much better than an LSTM and better than an NTM with the original write mechanism of Alex Graves (for classification). 2. The proposed model even performs better than a 1st nearest neighbor classifier. 3. The proposed model is even shown to outperform human performance, for the 5class scenario. 4. The proposed model has decent performance on the regression task, compared to GP predictions using the groundtruth kernel. **My two cents** This is probably one of my favorite ICML 2016 papers. I really think metalearning is a problem that deserves more attention, and this paper presents both an interesting proposal for how to do it and an interesting empirical investigation of it. Much like previous work [\[1\]][1] [\[2\]][2], learning is based on automatically generating a metalearning training set. This is clever I think, since a very large number of such "metalearning" examples (the episodes) can be constructed, thus transforming what is normally a "small data problem" (few shot learning) into a "big data problem", for which deep learning is more effective. I'm particularly impressed by how the proposed model outperforms a 1nearest neighbor classifier. That said, the proposed NTM actually performs 4 reads at each time step, which suggests that a fairer comparison might be with a 4nearest neighbor classifier. I do wonder how this baseline would compare. I'm also impressed with the observation that the proposed model surpassed humans. The paper also proposes to use 5letter words to describe classes, instead of onehot vectors. The motivation is that this should make it easier for the model to scale to much more than 5 classes. However, I don't entirely follow the logic as to why onehot vectors are problematic. In fact, I would think that arbitrarily assigning 5letter words to classes would instead imply some similarity between classes that share letters that is arbitrary and doesn't reflect true class similarity. Also, while I find it encouraging that the performance for regression of the proposed model is decent, I'm curious about how it would compare with a GP approach that incrementally learns the kernel's hyperparameter (instead of using the groundtruth values, which makes this baseline unrealistically strong). Finally, I'm still not 100% sure how exactly the NTM is able to implement the type of feedforward inference I'd expect to be required. I would expect it to learn a memory representation of examples that combines information from the input vector ${\bf x}_t$ *and* its label $y_t$. However, since the label of an input is presented at the following time step in an episode, it is not intuitive to me then how the read/write mechanisms are able to deal with this misalignment. My only guess is that since the controller is an LSTM, then it can somehow remember ${\bf x}_t$ until it gets $y_t$ and appropriately include the combined information into the memory. This could be supported by the fact that using a nonrecurrent feedforward controller is much worse than using an LSTM controller. But I'm not 100% sure of this either. All the above being said, this is still a really great paper, which I hope will help stimulate more research on metalearning. Hopefully code for this paper can eventually be released, which would help in popularizing the topic. [1]: http://snowedin.net/tmp/Hochreiter2001.pdf [2]: http://www.thespermwhale.com/jaseweston/ram/papers/paper_16.pdf 