Welcome to ShortScience.org! |
[link]
This paper proposes a variant of Neural Turing Machine (NTM) for meta-learning or "learning to learn", in the specific context of few-shot learning (i.e. learning from few examples). Specifically, the proposed model is trained to ingest as input a training set of examples and improve its output predictions as examples are processed, in a purely feed-forward way. This is a form of meta-learning because the model is trained so that its forward pass effectively executes a form of "learning" from the examples it is fed as input. During training, the model is fed multiples sequences (referred to as episodes) of labeled examples $({\bf x}_1, {\rm null}), ({\bf x}_2, y_1), \dots, ({\bf x}_T, y_{T-1})$, where $T$ is the size of the episode. For instance, if the model is trained to learn how to do 5-class classification from 10 examples per class, $T$ would be $5 \times 10 = 50$. Mainly, the paper presents experiments on the Omniglot dataset, which has 1623 classes. In these experiments, classes are separated into 1200 "training classes" and 423 "test classes", and each episode is generated by randomly selecting 5 classes (each assigned some arbitrary vector representation, e.g. a one-hot vector that is consistent within the episode, but not across episodes) and constructing a randomly ordered sequence of 50 examples from within the chosen 5 classes. Moreover, the correct label $y_t$ of a given input ${\bf x}_t$ is always provided only at the next time step, but the model is trained to be good at its prediction of the label of ${\bf x}_t$ at the current time step. This is akin to the scenario of online learning on a stream of examples, where the label of an example is revealed only once the model has made a prediction. The proposed NTM is different from the original NTM of Alex Graves, mostly in how it writes into its memory. The authors propose to focus writing to either the least recently used memory location or the most recently used memory location. Moreover, the least recently used memory location is reset to zero before every write (an operation that seems to be ignored when backpropagating gradients). Intuitively, the proposed NTM should learn a strategy by which, given a new input, it looks into its memory for information from other examples earlier in the episode (perhaps similarly to what a nearest neighbor classifier would do) to predict the class of the new input. The paper presents experiments in learning to do multiclass classification on the Omniglot dataset and regression based on functions synthetically generated by a GP. The highlights are that: 1. The proposed model performs much better than an LSTM and better than an NTM with the original write mechanism of Alex Graves (for classification). 2. The proposed model even performs better than a 1st nearest neighbor classifier. 3. The proposed model is even shown to outperform human performance, for the 5-class scenario. 4. The proposed model has decent performance on the regression task, compared to GP predictions using the groundtruth kernel. **My two cents** This is probably one of my favorite ICML 2016 papers. I really think meta-learning is a problem that deserves more attention, and this paper presents both an interesting proposal for how to do it and an interesting empirical investigation of it. Much like previous work [\[1\]][1] [\[2\]][2], learning is based on automatically generating a meta-learning training set. This is clever I think, since a very large number of such "meta-learning" examples (the episodes) can be constructed, thus transforming what is normally a "small data problem" (few shot learning) into a "big data problem", for which deep learning is more effective. I'm particularly impressed by how the proposed model outperforms a 1-nearest neighbor classifier. That said, the proposed NTM actually performs 4 reads at each time step, which suggests that a fairer comparison might be with a 4-nearest neighbor classifier. I do wonder how this baseline would compare. I'm also impressed with the observation that the proposed model surpassed humans. The paper also proposes to use 5-letter words to describe classes, instead of one-hot vectors. The motivation is that this should make it easier for the model to scale to much more than 5 classes. However, I don't entirely follow the logic as to why one-hot vectors are problematic. In fact, I would think that arbitrarily assigning 5-letter words to classes would instead imply some similarity between classes that share letters that is arbitrary and doesn't reflect true class similarity. Also, while I find it encouraging that the performance for regression of the proposed model is decent, I'm curious about how it would compare with a GP approach that incrementally learns the kernel's hyper-parameter (instead of using the groundtruth values, which makes this baseline unrealistically strong). Finally, I'm still not 100% sure how exactly the NTM is able to implement the type of feed-forward inference I'd expect to be required. I would expect it to learn a memory representation of examples that combines information from the input vector ${\bf x}_t$ *and* its label $y_t$. However, since the label of an input is presented at the following time step in an episode, it is not intuitive to me then how the read/write mechanisms are able to deal with this misalignment. My only guess is that since the controller is an LSTM, then it can somehow remember ${\bf x}_t$ until it gets $y_t$ and appropriately include the combined information into the memory. This could be supported by the fact that using a non-recurrent feed-forward controller is much worse than using an LSTM controller. But I'm not 100% sure of this either. All the above being said, this is still a really great paper, which I hope will help stimulate more research on meta-learning. Hopefully code for this paper can eventually be released, which would help in popularizing the topic. [1]: http://snowedin.net/tmp/Hochreiter2001.pdf [2]: http://www.thespermwhale.com/jaseweston/ram/papers/paper_16.pdf |
[link]
Grosse et al. use statistical tests to detect adversarial examples; additionally, machine learning algorithms are adapted to detect adversarial examples on-the-fly of performing classification. The idea of using statistics tests to detect adversarial examples is simple: assuming that there is a true data distribution, a machine learning algorithm can only approximate this distribution – i.e. each algorithm “learns” an approximate distribution. The ideal adversary uses this discrepancy to draw a sample from the data distribution where data distribution and learned distribution differ – resulting in mis-classification. In practice, they show that kernel-based two-sample statistics hypothesis testing can be used to identify a set of adversarial examples (but not individual one). In order to also detect individual ones, each classifier is augmented to also detect whether the input is an adversarial example. This approach is similar to adversarial training, where adversarial examples are included in the training set with the correct label. However, I believe that it is possible to again craft new examples to the augmented classifier – as is also possible with adversarial training. |
[link]
Tramèr et al. introduce both a novel adversarial attack as well as a defense mechanism against black-box attacks termed ensemble adversarial training. I first want to highlight that – in addition to the proposed methods – the paper gives a very good discussion of state-of-the-art attacks as well as defenses and how to put them into context. Tramèr et al. consider black-box attacks, focussing on transferrable adversarial examples. Their main observation is as follows: one-shot attacks (i.e. one evaluation of the model's gradient) on adversarially trained models are likely to overfit to the model's training loss. This observation has two aspects that are experimentally validated in the paper. First, the loss of the adversarially trained model increases sharply when considering adversarial examples crafted on a different model; second, the network learns to fool the attacker by, locally, misleading the gradient – this means that perturbations computed on adversarially trained models are specialized to the local loss. These observations are also illustrated in Figure 1, however, I refer to the paper for a detailed discussion. https://i.imgur.com/dIpRz9P.png Figure 1: Illustration of the discussed observations. On the left, the loss function of an adversarially trained model considering a sample $x = x + \epsilon_1 x' + \epsilon_2 x''$ where $x'$ is a perturbation computed on the adversarially trained model and $x''$ is a perturbation computed on a different model. On the right, zoomed in version where it can be seen that the loss rises sharply in the direction of $\epsilon_1$; i.e. the model gives misleading gradients. Based on the above observations, Tramèr et al. First introduce a new one-shot attack exploiting the fact that the adversarially trained model is trained on overfitted perturbations and second introduce a new counter-measure for training more robust networks. Their attack is quite simple; they consider one Fast-Gradient Sign Method (FSGM) step, but apply a random perturbation first to leave the local vicinity of the sample first: $x' = x + \alpha \text{sign}(\mathcal{N}(0, I))$ $x'' = x' + (\epsilon - \alpha)\text{sign}(\nabla_{x'} J(x', y))$ where $J$ is the loss function and $y$ the label corresponding to sample $x$. In experiments, they show that the attack has higher success rates on adversarially trained models. To counter the proposed attack, they propose ensemble adversarial training. The key idea is to train the model utilizing not only adversarial samples crafted on the model itself but also transferred from pre-trained models. On MNIST, for example, they randomly select 64 FGSM samples from 4 different models (including the one in training). Experimentally, they show that ensemble adversarial training improves the defense again all considered attacks, including FGSM, iterative FGSM as well as the proposed attack. Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
Carlini and Wagner study the effectiveness of adversarial example detectors as defense strategy and show that most of them can by bypassed easily by known attacks. Specifically, they consider a set of adversarial example detection schemes, including neural networks as detectors and statistical tests. After extensive experiments, the authors provide a set of lessons which include: - Randomization is by far the most effective defense (e.g. dropout). - Defenses seem to be dataset-specific. There is a discrepancy between defenses working well on MNIST and on CIFAR. - Detection neural networks can easily be bypassed. Additionally, they provide a set of recommendations for future work: - For developing defense mechanism, we always need to consider strong white-box attacks (i.e. attackers that are informed about the defense mechanisms). - Reporting accuracy only is not meaningful; instead, false positives and negatives should be reported. - Simple datasets such as MNIST and CIFAR are not enough for evaluation. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
1. U-NET learns segmentation in an end to end images. 2. They solved Challenges are * Very few annotated images (approx. 30 per application). * Touching objects of the same class. # How: * Input image is fed in to the network, then the data is propagated through the network along all possible path at the end segmentation maps comes out. * In U-net architecture, each blue box corresponds to a multi-channel feature map. The number of channels is denoted on top of the box. The x-y-size is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations. https://i.imgur.com/Usxmv6r.png * In two 3x3 convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU) and a 2x2 max pooling operation with stride 2for down sampling. At each down sampling step they double the number of feature channels. * Contracting path (left side from up to down) is increases the feature channel and reduces the steps and an expansive path (right side from down to up) consists of sequence of up convolution and concatenation with the corresponds high resolution features from contracting path. * The network does not have any fully connected layers and only uses the valid part of each convolution, i.e., the segmentation map only contains the pixels, for which the full context is available in the input image. ## Challenges: 1. Overlap-tile strategy for seamless segmentation of arbitrary large images: * To predict the pixels in the border region of the image, the missing context is extrapolated by mirroring the input image. * In fig, segmentation of the yellow area uses input data of the blue area and the raw data extrapolation by mirroring. https://i.imgur.com/NUbBRUG.png 2. Augment training data using deformation: * They use excessive data augmentation by applying elastic deformations to the available training images. * Then the network to learn invariance to such deformations, without the need to see these transformations in the annotated image corpus. * Deformation used to be the most common variation in tissue and realistic deformations can be simulated efficiently. https://i.imgur.com/CyC8Hmd.png 3. Segmentation of touching object of the same class: * They propose the use of a weighted loss, where the separating background labels between touching cells obtain a large weight in the loss function. * Ensure separation of touching objects, in that segmentation mask for training (inserted background between touching objects) get the loss weights for each pixel. https://i.imgur.com/ds7psDB.png 4. Segmentation of neural structure in electro-microscopy(EM): * Ongoing challenge since ISBI 2012 in this dataset structures with low contrast, fuzzy membranes and other cell components. * The training data is a set of 30 images (512x512 pixels) from serial section transmission electron microscopy of the Drosophila first instar larva ventral nerve cord (VNC). Each image comes with corresponding fully annotated ground truth segmentation map for cells(white) and membranes (black). * An evaluation can be obtained by sending the predicted membrane probability map to the organizers. The evaluation is done by thresholding the map at 10 different levels and computation of the warping error, the Rand error and the pixel error. ### Results: * The u-net (averaged over 7 rotated versions of the input data) achieves with-out any further pre or post-processing a warping error of 0.0003529, a rand-error of 0.0382 and a pixel error of 0.0611. https://i.imgur.com/6BDrByI.png * ISBI cell tracking challenge 2015, one of the dataset contains cell phase contrast microscopy has strong shape variations,weak outer borders, strong irrelevant inner borders and cytoplasm has same structure like background. https://i.imgur.com/vDflYEH.png * The first data set PHC-U373 contains Glioblastoma-astrocytoma U373 cells on a polyacrylimide substrate recorded by phase contrast microscopy- It contains 35 partially annotated training images. Here we achieve an average IOU ("intersection over union") of 92%,which is significantly better than the second best algorithm with 83%. https://i.imgur.com/of4rAYP.png * The second data set DIC-HeLa are HeLa cells on a flat glass recorded by differential interference contrast (DIC) microscopy - It contains 20 partially annotated training images. Here we achieve an average IOU of 77.5% which is significantly better than the second best algorithm with 46%. https://i.imgur.com/Y9wY6Lc.png |