Welcome to ShortScience.org! 
[link]
This paper argues that, in semisupervised learning, it's suboptimal to use the same weight for all examples (as happens implicitly, when the unsupervised component of the loss for each example is just added together directly. Instead, it tries to learn weights for each specific data example, through a metalearningesque process. The form of semisupervised learning being discussed here is labelbased consistency loss, where a labeled image is augmented and run through the current version of the model, and the model is optimized to try to induce the same loss for the augmented image as the unaugmented one. The premise of the authors argument for learning perexample weights is that, ideally, you would enforce consistency loss less on examples where a model was unconfident in its label prediction for an unlabeled example. As a way to solve this, the authors suggest learning a vector of parameters  one for each example in the dataset  where element i in the vector is a weight for element i of the dataset, in the summedup unsupervised loss. They do this via a twostep process, where first they optimize the parameters of the network given the example weights, and then the optimize the example weights themselves. To optimize example weights, they calculate a gradient of those weights on the posttraining validation loss, which requires backpropogating through the optimization process (to determine how different weights might have produced a different gradient, which might in turn have produced better validation loss). This requires calculating the inverse Hessian (second derivative matrix of the loss), which is, generally speaking, a quite costly operation for hugeparameter nets. To lessen this cost, they pretend that only the final layer of weights in the network are being optimized, and so only calculate the Hessian with respect to those weights. They also try to minimize cost by only updating the example weights for the examples that were used during the previous update step, since, presumably those were the only ones we have enough information to upweight or downweight. With this model, the authors achieve modest improvements  performance comparable to or withinerrorbounds better than the current state of the art, FixMatch. Overall, I find this paper a little baffling. It's just a crazy amount of effort to throw into something that is a minor improvement. A few issues I have with the approach:  They don't seem to have benchmarked against the simpler baseline of some inverse of using Dropoutestimated uncertainty as the weight on examples, which would, presumably, more directly capture the property of "is my model unsure of its prediction on this unlabeled example"  If the presumed need for this is the lack of certainty of the model, that's a nonstationary problem that's going to change throughout the course of training, and so I'd worry that you're basically taking steps in the direction of a moving target  Despite using techniques rooted in metalearning, it doesn't seem like this models learns anything generalizable  it's learning indexbased weights on specific examples, which doesn't give it anything useful it can do with some new data point it finds that it wasn't specifically trained on Given that, I think I'd need to see a much stronger case for dramatic performance benefits for something like this to seem like it was worth the increase in complexity (not to mention computation, even with the optimized Hessian scheme) 
[link]
Zhao et al. propose a generative adversarial network (GAN) based approach to generate meaningful and natural adversarial examples for images and text. With natural adversarial examples, the authors refer to meaningful changes in the image content instead of adding seemingly random/adversarial noise – as illustrated in Figure 1. These natural adversarial examples can be crafted by first learning a generative model of the data, e.g., using a GAN together with an inverter (similar to an encoder), see Figure 2. Then, given an image $x$ and its latent code $z$, adversarial examples $\tilde{z} = z + \delta$ can be found within the latent code. The hope is that these adversarial examples will correspond to meaningful, naturally looking adversarial examples in the image space. https://i.imgur.com/XBhHJuY.png Figure 1: Illustration of natural adversarial examples in comparison ot regular, FGSM adversarial examples. https://i.imgur.com/HT2StGI.png Figure 2: Generative model (GAN) together with the required inverter. In practice, e.g., on MNIST, any blackbox classifier can be attacked by randomly sampling possible perturbations $\delta$ in the random space (with increasing norm) until an adversarial perturbation is found. Here, the inverted from Figure 2 is trained on top of the critic of the GAN (although specific details are missing in the paper). Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
The main contribution of this paper is introducing a new transformation that the authors call Batch Normalization (BN). The need for BN comes from the fact that during the training of deep neural networks (DNNs) the distribution of each layer’s input change. This phenomenon is called internal covariate shift (ICS). #### What is BN? Normalize each (scalar) feature independently with respect to the mean and variance of the mini batch. Scale and shift the normalized values with two new parameters (per activation) that will be learned. The BN consists of making normalization part of the model architecture. #### What do we gain? According to the author, the use of BN provides a great speed up in the training of DNNs. In particular, the gains are greater when it is combined with higher learning rates. In addition, BN works as a regularizer for the model which allows to use less dropout or less L2 normalization. Furthermore, since the distribution of the inputs is normalized, it also allows to use sigmoids as activation functions without the saturation problem. #### What follows? This seems to be specially promising for training recurrent neural networks (RNNs). The vanishing and exploding gradient problems \cite{journals/tnn/BengioSF94} have their origin in the iteration of transformation that scale up or down the activations in certain directions (eigenvectors). It seems that this regularization would be specially useful in this context since this would allow the gradient to flow more easily. When we unroll the RNNs, we usually have ultra deep networks. #### Like * Simple idea that seems to improve training. * Makes training faster. * Simple to implement. Probably. * You can be less careful with initialization. #### Dislike * Does not work with stochastic gradient descent (minibatch size = 1). * This could reduce the parallelism of the algorithm since now all the examples in a mini batch are tied. * Results on ensemble of networks for ImageNet makes it harder to evaluate the relevance of BN by itself. (Although they do mention the performance of a single model). 
[link]
In this tutorial paper, Carl E. Rasmussen gives an introduction to Gaussian Process Regression focusing on the definition, the hyperparameter learning and future research directions. A Gaussian Process is completely defined by its mean function $m(\pmb{x})$ and its covariance function (kernel) $k(\pmb{x},\pmb{x}')$. The mean function $m(\pmb{x})$ corresponds to the mean vector $\pmb{\mu}$ of a Gaussian distribution whereas the covariance function $k(\pmb{x}, \pmb{x}')$ corresponds to the covariance matrix $\pmb{\Sigma}$. Thus, a Gaussian Process $f \sim \mathcal{GP}\left(m(\pmb{x}), k(\pmb{x}, \pmb{x}')\right)$ is a generalization of a Gaussian distribution over vectors to a distribution over functions. A random function vector $\pmb{\mathrm{f}}$ can be generated by a Gaussian Process through the following procedure: 1. Compute the components $\mu_i$ of the mean vector $\pmb{\mu}$ for each input $\pmb{x}_i$ using the mean function $m(\pmb{x})$ 2. Compute the components $\Sigma_{ij}$ of the covariance matrix $\pmb{\Sigma}$ using the covariance function $k(\pmb{x}, \pmb{x}')$ 3. A function vector $\pmb{\mathrm{f}} = [f(\pmb{x}_1), \dots, f(\pmb{x}_n)]^T$ can be drawn from the Gaussian distribution $\pmb{\mathrm{f}} \sim \mathcal{N}\left(\pmb{\mu}, \pmb{\Sigma} \right)$ Applying this procedure to regression, means that the resulting function vector $\pmb{\mathrm{f}}$ shall be drawn in a way that a function vector $\pmb{\mathrm{f}}$ is rejected if it does not comply with the training data $\mathcal{D}$. This is achieved by conditioning the distribution on the training data $\mathcal{D}$ yielding the posterior Gaussian Process $f \rvert \mathcal{D} \sim \mathcal{GP}(m_D(\pmb{x}), k_D(\pmb{x},\pmb{x}'))$ for noisefree observations with the posterior mean function $m_D(\pmb{x}) = m(\pmb{x}) + \pmb{\Sigma}(\pmb{X},\pmb{x})^T \pmb{\Sigma}^{1}(\pmb{\mathrm{f}}  \pmb{\mathrm{m}})$ and the posterior covariance function $k_D(\pmb{x},\pmb{x}')=k(\pmb{x},\pmb{x}')  \pmb{\Sigma}(\pmb{X}, \pmb{x}')$ with $\pmb{\Sigma}(\pmb{X},\pmb{x})$ being a vector of covariances between every training case of $\pmb{X}$ and $\pmb{x}$. Noisy observations $y(\pmb{x}) = f(\pmb{x}) + \epsilon$ with $\epsilon \sim \mathcal{N}(0,\sigma_n^2)$ can be taken into account with a second Gaussian Process with mean $m$ and covariance function $k$ resulting in $f \sim \mathcal{GP}(m,k)$ and $y \sim \mathcal{GP}(m, k + \sigma_n^2\delta_{ii'})$. The figure illustrates the cases of noisy observations (variance at training points) and of noisefree observationshttps://i.imgur.com/BWvsB7T.png (no variance at training points). In the Machine Learning perspective, the mean and the covariance function are parametrised by hyperparameters and provide thus a way to include prior knowledge e.g. knowing that the mean function is a second order polynomial. To find the optimal hyperparameters $\pmb{\theta}$, 1. determine the log marginal likelihood $L= \mathrm{log}(p(\pmb{y} \rvert \pmb{x}, \pmb{\theta}))$, 2. take the first partial derivatives of $L$ w.r.t. the hyperparameters, and 3. apply an optimization algorithm. It should be noted that a regularization term is not necessary for the log marginal likelihood $L$ because it already contains a complexity penalty term. Also, the tradeoff between datafit and penalty is performed automatically. Gaussian Processes provide a very flexible way for finding a suitable regression model. However, they require the high computational complexity $\mathcal{O}(n^3)$ due to the inversion of the covariance matrix. In addition, the generalization of Gaussian Processes to nonGaussian likelihoods remains complicated. 
[link]
The paper introduces a sequential variational autoencoder that generates complex images iteratively. The authors also introduce a new spatial attention mechanism that allows the model to focus on small subsets of the image. This new approach for image generation produces images that can’t be distinguished from the training data. #### What is DRAW: The deep recurrent attention writer (DRAW) model has two differences with respect to other variational autoencoders. First, the encoder and the decoder are recurrent networks. Second, it includes an attention mechanism that restricts the input region observed by the encoder and the output region observed by the decoder. #### What do we gain? The resulting images are greatly improved by allowing a conditional and sequential generation. In addition, the spatial attention mechanism can be used in other contexts to solve the “Where to look?” problem. #### What follows? A possible extension to this model would be to use a convolutional architecture in the encoder or the decoder. Although this might be less useful since we are already restricting the input of the network. #### Like: * As observed in the samples generated by the model, the attention mechanism works effectively by reconstructing images in a local way. * The attention model is fully differentiable. #### Dislike: * I think a better exposition of the attention mechanism would improve this paper. 
[link]
## Idea When we recommend items to users, some of them are not chosen by the user. These rejected recommendations are usually treated as hard mistakes. Authors argue that these bad recommendations still may influence user's choice even though they were not picked. For example user didn't click on "Die Hard" but watched another Bruce Willis movie. This seems to be a not so bad recommendation after all and maybe we should not penalize it as hard as we usually do. Ultimate goal is to invent a metric that would have good correlation between offline results and real online performance. ## User study Authors held a user study, showing people a set of 5 items: watched movie, 3 rejected recommendations and an item chosen after recommendation. Rejected recommendations were generated into 4 groups:  only high content similarity  only high collaborative similarity  only high popularity similarity  all medium similarities The question was "**How good is this recommendation 15?**"  Content  Collaborative  Popularity  Other            3.8  3.52  2.93  1.99  ## Proposal If standard precision is $$p_u = \frac{c_u \cap r_u}{r_u}$$ where $c_u$ are items chosen by user and $r_u$ items recommended to user, then we can define a refined precision as $$p_u^{sim} = p_u + \frac{\sum_{i \in r_u \setminus c_u}max_{j \in \{ c_u:\ t(u,j) > t(u,i)\}}sim(i, j)}{r_u}$$ where $t(u,i)$ is the time when user $u$ interacted with item $i$. ## Evaluation Authors used Xing dataset containing user interactions with a system for seeking employment opportunities. It contains logs of what was recommended and what was clicked. ### "Online" evaluation Measure correlation between different refinement types of precision of RS presented in dataset and actual user clicks.  Content  Collaborative  Regular          0.615  0.197  0.184  ### Offline evaluation Split logs 70/30 by time and measure correlation between number of clicks per user on test and metrics on train as if we were training a model on train part.  Train clicks  Content  Collaborative  Random            0.5  0.35  0.16  0.087  ## Open question What is the best way to calculate item similarity? 
[link]
# Conditional Generative Adversarial Nets ## Introduction * Conditional version of [Generative Adversarial Nets (GAN)](https://gist.github.com/shagunsodhani/1f9dc0444142be8bd8a7404a226880eb) where both generator and discriminator are conditioned on some data **y** (class label or data from some other modality). * [Link to the paper](https://arxiv.org/abs/1411.1784) ## Architecture * Feed **y** into both the generator and discriminator as additional input layers such that **y** and input are combined in a joint hidden representation. ## Experiment ### Unimodal Setting * Conditioning MNIST images on class labels. * *z* (random noise) and **y** mapped to hidden layers with ReLu with layer sizes of 200 and 1000 respectively and are combined to obtain ReLu layer of dimensionality 1200. * Discriminator maps *x* (input) and **y** to maxout layers and the joint maxout layer is fed to sigmoid layer. * Results do not outperform the stateoftheart results but do provide a proofoftheconcept. ### Multimodal Setting * Map images (from Flickr) to labels (or user tags) to obtain the onetomany mapping. * Extract image and text features using convolutional and language model. * Generative Model * Map noise and convolutional features to a single 200 dimensional representation. * Discriminator Model * Combine the representation of word vectors (corresponding to tags) and images. ## Future Work * While the results are not so good, they do show the potential of Conditional GANs, especially in the multimodal setting. 
[link]
The paper discusses and empirically investigates by empirical testing the effect of "catastrophic forgetting" (**CF**), i.e. the inability of a model to perform a task it was previously trained to perform if retrained to perform a second task. An illuminating example is what happens in ML systems with convex objectives: regardless of the initialization (i.e. of what was learnt by doing the first task), the training of the second task will always end in the global minimum, thus totally "forgetting" the first one. Neuroscientific evidence (and common sense) suggest that the outcome of the experiment is deeply influenced by the similarity of the tasks involved. Namely, if (i) the two tasks are *functionally identical but input is presented in a different format* or if (ii) *tasks are similar* and the third case for (iii) *dissimilar tasks*. Relevant examples may be provided respectively by (i) performing the same image classification task starting from two different image representations as RGB or HSL, (ii) performing image classification tasks with semantically similar as classifying two similar animals and (iii) performing a text classification followed by image classification. The problem is investigated by an empirical study covering two methods of training ("SGD" and "dropout") combined with 4 activations functions (logistic sigmoid, RELU, LWTA, Maxout). A random search is carried out on these parameters. From a practitioner's point of view, it is interesting to note that dropout has been set to 0.5 in hidden units and 0.2 in the visible one since this is a reasonably wellknown parameter. ## Why the paper is important It is apparently the first to provide a systematic empirical analysis of CF. Establishes a framework and baselines to face the problem. ## Key conclusions, takeaways and modelling remarks * dropout helps in preventing CF * dropout seems to increase the optimal model size with respect to the model without dropout * choice of activation function has a less consistent effect than dropout\no dropout choice * dissimilar task experiment provides a notable exception of then dissimilar task experiment * the previous hypothesis that LWTA activation is particularly resistant to CF is rejected (even if it performs best in the new task in the dissimilar task pair the behaviour is inconsistent) * choice of activation function should always be crossvalidated * If computational resources are insufficient for crossvalidation the combination dropout + maxout activation function is recommended. 
[link]
#### Introduction * Opendomain Question Answering (Open QA)  efficiently querying largescale knowledge base(KB) using natural language. * Two main approaches: * Information Retrieval * Transform question (in natural language) into a valid query(in terms of KB) to get a broad set of candidate answers. * Perform finegrained detection on candidate answers. * Semantic Parsing * Interpret the correct meaning of the question and convert it into an exact query. * Limitations: * Human intervention to create lexicon, grammar, and schema. * This work builds upon the previous work where an embedding model learns low dimensional vector representation of words and symbols. * [Link](https://arxiv.org/abs/1406.3676) to the paper. #### Task Definition * Input  Training set of questions (paired with answers). * KB providing a structure among the answers. * Answers are entities in KB and questions are strings with one identified KB entity. * The paper has used FREEBASE as the KB. * Datasets * WebQuestions  Built using FREEBASE, Google Suggest API, and Mechanical Turk. * FREEBASE triplets transformed into questions. * Clue Web Extractions dataset with entities linked with FREEBASE triplets. * Dataset of paraphrased questions using WIKIANSWERS. #### Embedding Questions and Answers * Model learns lowdimensional vector embeddings of words in question entities and relation types of FREEBASE such that questions and their answers are represented close to each other in the joint embedding space. * Scoring function $S(q, a)$, where $q$ is a question and $a$ is an answer, generates high score if $a$ answers $q$. * $S(q, a) = f(q)^{T} . g(a)$ * $f(q)$ maps question to embedding space. * $f(q) = W \phi (q)$ * $W$ is a matrix of dimension $K * N$ * $K$  dimension of embedding space (hyper parameter). * $N$  total number of words/entities/relation types. * $\psi(q)$  Sparse Vector encoding the number of times a word appears in $q$. * Similarly, $g(a) = W \psi (a)$ maps answer to embedding space. * $\psi(a)$ gives answer representation, as discussed below. #### Possible Representations of Candidate Answers * Answer represented as a **single entity** from FREEBASE and TBD is a oneofN encoded vector. * Answer represented as a **path** from question to answer. The paper considers only one or two hop paths resulting in 3ofN or 4ofN encoded vectors(middle entities are not recorded). * Encode the above two representations using **subgraph representation** which represents both the path and the entire subgraph of entities connected to answer entity as a subgraph. Two embedding representations are used to differentiate between entities in path and entities in the subgraph. * SubGraph approach is based on the hypothesis that including more information about the answers would improve results. #### Training and Loss Function * Minimize margin based ranking loss to learn matrix $W$. * Stochastic Gradient Descent, multithreaded with Hogwild. #### Multitask Training of Embeddings * To account for a large number of synthetically generated questions, the paper also multitasks the training of model with paraphrased prediction. * Scoring function $S_{prp} (q1, q2) = f(q1)^{T} f(q2)$, where $f$ uses the same weight matrix $W$ as before. * High score is assigned if $q1$ and $q2$ belong to same paraphrase cluster. * Additionally, the model multitasks the task of mapping embeddings of FREEBASE entities (mids) to actual words. #### Inference * For each question, a candidate set is generated. * The answer (from candidate set) with the highest set is reported as the correct answer. * Candidate set generation strategy * $C_1$  All KB triplets containing the KB entity from the question forms a candidate set. Answers would be limited to 1hop paths. * $C_2$  Rank all relation types and keep top 10 types and add only those 2hop candidates where the selected relations appear in the path. #### Results * $C_2$ strategy outperforms $C_1$ approach supporting the hypothesis that a richer representation for answers can store more information. * Proposed approach outperforms the baseline methods but is outperformed by an ensemble of proposed approach with semantic parsing via paraphrasing model. 
[link]
## Terms * Semantic Segmentation: Traditional segmentation divides the image in visually similar patches. Semantic segmentation on the other hand divides the image in semantically meaningful patches. This usually means to classify each pixel (e.g.: This pixel belongs to a cat, that pixel belongs to a dog, the other pixel is background). ## Main ideas * Complete neural networks which were trained for image classification can be used as a convolution. Those networks can be trained on Image Net (e.g. VGG, AlexNet, GoogLeNet) * Use upsampling to (1) reduce training and prediction time (2) improve consistency of output. (See [What are deconvolutional layers?](http://datascience.stackexchange.com/a/12110/8820) for an explanation.) ## How FCNs work 1. Train a neural network for image classification which is trained on input images of a fixed size ($d \times w \times h$) 2. Interpret the network as a single convolutional filter for each output neuron (so $k$ output neurons means you have $k$ filters) over the complete image area on which the original network was trained. 3. Run the network as a CNN over an image of any size (but at least $d \times w \times h$) with a stride $s \in \mathbb{N}_{\geq 1}$ 4. If $s > 1$, then you need an upsampling layer (deconvolutional layer) to convert the coarse output into a dense output. ## Nice properties * FCNs take images of arbitrary size and produce an image of the same output size. * Computationally efficient ## See also: https://www.quora.com/Whatarethebenefitsofconvertingafullyconnectedlayerinadeepneuralnetworktoanequivalentconvolutionallayer > They allow you to treat the convolutional neural network as one giant filter. You can then spatially apply the neural net as a convolution to images larger than the original training image size, getting a spatially dense output. > > Let's say you train a neural net (with some loss function) with a convolutional layer (3 x 3, stride of 2), pooling layer (3 x 3, stride of 2), and a fully connected layer with 10 units, using 25 x 25 images. Note that the receptive field size of each max pooling unit is 7 x 7, so the pooling output is 5 x 5. You can convert the fully connected layer to to a set of 10 5 x 5 convolutional filters (unit strides). If you do that, the entire net can be treated as a filter with receptive field size 35 x 35 and stride of 4. You can then take that net and apply it to a 50 x 50 image, and you'd get a 3 x 3 x 10 spatially dense output.
1 Comments
