ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

arxiv.org
arxiv-vanity.com
scholar.google.com

Certifying Some Distributional Robustness with Principled Adversarial Training
Aman Sinha and Hongseok Namkoong and John Duchi
arXiv e-Print archive - 2017 via Local arXiv
Keywords: stat.ML, cs.LG
more

[link] Summary by David Stutz 5 years ago

Sinha et al. introduce a variant of adversarial training based on distributional robust optimization. I strongly recommend reading the paper for understanding the introduced theoretical framework. The authors also provide guarantees on the obtained adversarial loss – and show experimentally that this guarantee is a realistic indicator. The adversarial training variant itself follows the general strategy of training on adversarially perturbed training samples in a min-max framework. In each iteration, an attacker crafts an adversarial examples which the network is trained on. In a nutshell, their approach differs from previous ones (apart from the theoretical framework) in the used attacker. Specifically, their attacker optimizes

$\arg\max_z l(\theta, z) - \gamma \|z – z^t\|_p^2$

where $z^t$ is a training sample chosen randomly during training.

On a side note, I also recommend reading the reviews of this paper: https://openreview.net/forum?id=Hk6kPgZA-

Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

doi.ieeecomputersociety.org
sci-hub
scholar.google.com

Exploiting local features from deep networks for image retrieval
Ng, Joe Yue-Hei and Yang, Fan and Davis, Larry S.
Conference and Computer Vision and Pattern Recognition - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Vivek Gandhi 7 years ago

In this paper, the authors raise a very important point for instance based image retrieval. For a task like an image recognition features extracted from higher layer of deep networks works really well in general, but for task like instance based image retrieval features extracted from higher layers don't prove to be that useful, so the authors suggest that we take features from lower layer and on those features, apply [VLAD encoding](https://www.robots.ox.ac.uk/~vgg/publications/2013/arandjelovic13/arandjelovic13.pdf). On top of the VLAD encoding as part of post processing, we perform steps like intra-normalisation and then apply PCA and reduce the encoding to a size of 128 Dimension. The authors have performed their experiments using [Googlenet](https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf) and [VGG-16](https://arxiv.org/pdf/1409.1556v6.pdf), and they tried Inception 3a, Inception 4a and Inception 4e on GoogleNet and conv4_2, conv5_1 and conv5_2 on VGG-16. The above mentioned layers has almost similar performance on the dataset they have used. The performance metric used by the authors is Mean Average Precision(MAP).

dx.doi.org
sci-hub
scholar.google.com

Training Deep and Recurrent Networks with Hessian-Free Optimization
Martens, James and Sutskever, Ilya
Springer Neural Networks: Tricks of the Trade (2nd ed.) - 2012 via Local Bibsonomy
Keywords: dblp

[link] Summary by Raza Habib 7 years ago

## Very Short Summary
The authors introduce a number of modifications to traditional hessian-free optimisation that makes the method work better for neural networks. The modifications are:
* Use the Generalised Gauss Newton Matrix (GGN) rather than the Hessian.
* Damp the GGN so that $G' = G + \lambda I$ and adjust $\lambda$ using levenberg-marquardt heuristic.
* Use an efficient recursion to calculate the GGN.
* Initialise each round of conjugated gradients with the final vector of the previous iteration.
* A new simpler termination criterion for CG. Terminate CG when the relative decrease in the objective falls below some threshold.
* Back-tracking of the CG solution. ie you store intermediate solutions to CG and only update if the new CG solution actually decreases the over all problem objective.

## Less Short Summary

### Hessian Free Optimisation in General
Hessian free optimisation is used when one wishes to optimise some objective $f(\theta)$ using second order methods but inversion or even computation of the Hessian is intractable or infeasible. The method is an iterative method and at each iteration, we take a second order approximation to the objective. i.e at iterantion n, we take a second order taylor expansion of $f$ to get:

$M^n(\theta) = f(\theta^n) + \nabla_{\theta}^Tf(\theta^n)(\theta - \theta^n) + (\theta - \theta^n)^TH(\theta - \theta^n) $

Where $H$ is the hessian matrix. If we minimise this second order approximation with respect to $\theta$ we would find that that $\theta^{n+1} = H^{-1}(-\nabla_{\theta}^Tf(\theta^n))$. However, inverting $H$ is usually not possible for even moderately sized neural networks.

There does however exist an efficient algorithm for calculating hessian vector products $Hv$ for any $v$. The insight of hessian-free optimisation is that one can solve linear problems of the form $Hx = v$ using only hessian vector products via the linear conjugated gradients algorithm. You therefore avoid the need to ever actually compute either the Hessian or its inverse.

To run vanilla hessian free all you need to do at each iteration is:

1) Calculate the gradient vector using standard backprop.

2) Calculate $H\theta$ product using an efficient recursion.

3) calculate the next update $\theta^{n+1} = ConjugatedGradients(H, -\nabla_{\theta}^Tf(\theta^n))$

The main contribution of this paper is to take the above algorithm and make the changes outlined in the very short summary.

## Take aways
Hessian-Free optimisation was perhaps the best method at the time of publication. Recently it seems that first order methods using per-parameter learning rates like ADAM or even learning-to-learn can outperform Hessian-Free. This is primarily because of the increased cost per iteration of Hessian Free. However it still seems that using curvature information if its available is beneficial though expensive.

More resent second order curvature appoximations like Kroeniker Factored Approximate Curvature (KFAC) and Kroeniker Factored Recursive Approximation (KFRA) are cheaper ways to achieve the same benefit.

www.wikidata.org
sci-hub
scholar.google.com

Gaussian Processes in Machine Learning
Rasmussen, Carl Edward
Springer Advanced Lectures on Machine Learning - 2003 via Local Bibsonomy
Keywords: dblp

[link] Summary by Friedrich-Maximilian Weberling 4 years ago

In this tutorial paper, Carl E. Rasmussen gives an introduction to Gaussian Process Regression focusing on the definition, the hyperparameter learning and future research directions.

A Gaussian Process is completely defined by its mean function $m(\pmb{x})$ and its covariance function (kernel) $k(\pmb{x},\pmb{x}')$. The mean function $m(\pmb{x})$ corresponds to the mean vector $\pmb{\mu}$ of a Gaussian distribution whereas the covariance function $k(\pmb{x}, \pmb{x}')$ corresponds to the covariance matrix $\pmb{\Sigma}$. Thus, a Gaussian Process $f \sim \mathcal{GP}\left(m(\pmb{x}), k(\pmb{x}, \pmb{x}')\right)$ is a generalization of a Gaussian distribution over vectors to a distribution over functions. A random function vector $\pmb{\mathrm{f}}$ can be generated by a Gaussian Process through the following procedure:

1. Compute the components $\mu_i$ of the mean vector $\pmb{\mu}$ for each input $\pmb{x}_i$ using the mean function $m(\pmb{x})$
2. Compute the components $\Sigma_{ij}$ of the covariance matrix $\pmb{\Sigma}$ using the covariance function $k(\pmb{x}, \pmb{x}')$
3. A function vector $\pmb{\mathrm{f}} = [f(\pmb{x}_1), \dots, f(\pmb{x}_n)]^T$ can be drawn from the Gaussian distribution $\pmb{\mathrm{f}} \sim \mathcal{N}\left(\pmb{\mu}, \pmb{\Sigma} \right)$

Applying this procedure to regression, means that the resulting function vector $\pmb{\mathrm{f}}$ shall be drawn in a way that a function vector $\pmb{\mathrm{f}}$ is rejected if it does not comply with the training data $\mathcal{D}$. This is achieved by conditioning the distribution on the training data $\mathcal{D}$ yielding the posterior Gaussian Process $f \rvert \mathcal{D} \sim \mathcal{GP}(m_D(\pmb{x}), k_D(\pmb{x},\pmb{x}'))$ for noise-free observations with the posterior mean function $m_D(\pmb{x}) = m(\pmb{x}) + \pmb{\Sigma}(\pmb{X},\pmb{x})^T \pmb{\Sigma}^{-1}(\pmb{\mathrm{f}} - \pmb{\mathrm{m}})$  and the posterior covariance function $k_D(\pmb{x},\pmb{x}')=k(\pmb{x},\pmb{x}') - \pmb{\Sigma}(\pmb{X}, \pmb{x}')$ with $\pmb{\Sigma}(\pmb{X},\pmb{x})$ being a vector of covariances between every training case of $\pmb{X}$ and $\pmb{x}$.

Noisy observations $y(\pmb{x}) = f(\pmb{x}) + \epsilon$ with $\epsilon \sim \mathcal{N}(0,\sigma_n^2)$ can be taken into account with a second Gaussian Process with mean $m$ and covariance function $k$ resulting in $f \sim \mathcal{GP}(m,k)$ and $y \sim \mathcal{GP}(m, k + \sigma_n^2\delta_{ii'})$. The figure illustrates the cases of noisy observations (variance at training points) and of noise-free observationshttps://i.imgur.com/BWvsB7T.png (no variance at training points).

In the Machine Learning perspective, the mean and the covariance function are parametrised by hyperparameters and provide thus a way to include prior knowledge e.g. knowing that the mean function is a second order polynomial. To find the optimal hyperparameters $\pmb{\theta}$, 

1. determine the log marginal likelihood $L= \mathrm{log}(p(\pmb{y} \rvert \pmb{x}, \pmb{\theta}))$,
2. take the first partial derivatives of $L$ w.r.t. the hyperparameters, and
3. apply an optimization algorithm.

It should be noted that a regularization term is not necessary for the log marginal likelihood $L$ because it already contains a complexity penalty term. Also, the tradeoff between data-fit and penalty is performed automatically.

Gaussian Processes provide a very flexible way for finding a suitable regression model. However, they require the high computational complexity $\mathcal{O}(n^3)$ due to the inversion of the covariance matrix. In addition, the generalization of Gaussian Processes to non-Gaussian likelihoods remains complicated.

www.elen.ucl.ac.be
sci-hub
scholar.google.com

Using very deep autoencoders for content-based image retrieval
Krizhevsky, Alex and Hinton, Geoffrey E.
ESANN - 2011 via Local Bibsonomy
Keywords: dblp

[link] Summary by nishnik 7 years ago

Here the author presents a way to retrieve similar images in O(distance) time (where distance can be termed as 100 - correlation in percentage between two images), but it uses O(2^n) memory (n is the number of bits in which we are encoding the image). Therefore this approach is independent of the size of database (Though the value of 'n' is very important, it is kind of precision measure for the correlation, if we choose very small 'n' the difference between similar images to given image can't be scored efficiently, while if we use very large 'n' the semantic similarity won't be captured efficiently enough).
(But this is only hashing, what is peculiar about this paper? It uses semantic hashing.)

For all the images, they are fed to autoencoder as input, a code is generated as output which is used for hashing. And then for the query image, again a code is generated and k-nearest images are retrieved from the output space.

But what is autoencoder? It is an artificial neural network with layers which are not intra-connected but inter-connected. They are comprised of two parts: encoder and decoder.

In the paper n is 28 that means the autoencoder will have 28 units in the middle layer.

*Training:*
1. The encoder training is done greedily one by one (layer by layer) using the way Restricted Boltzmann Machines are trained.
2. Then the decoder is just the inverse of the encoder layer (*Unsymmetrical auto-encoders generally give poor results.)
3. And then they are fine tuned using back-propagation, using the input image itself for calculating the loss.
(*Due to huge number of weights, the back-propagation training converges very slowly).

In short the encoder transforms the high dimensional input to low dimensional output.
And then the decoder represents it back to high dimensional space.

In the paper this low dimensional output has been used as the code representation for the input image. (Note that these are rounded so that they are binary.) And these capture the semantic content of the image. The author has also written that this approach wont be useful if applied to words with respect to documents as words w.r.t. documents are much more semantically meaningful than pixels w.r.t. images.

Then they have used one more thing to tackle translation related variance. They have taken 81 patches of each of the image (regularly spaced patches), and have applied the above mentioned algorithm to compute the hashes. (*It is a kind of convolution except for the fact that we are not summing anything, just iterating over the bags to find their representation. It won't be much effective for tackling rotational variance.)

In the array of size 2^28, for the images whose code comes to be `a1,a2..a28` the value of `a1,a2..a28` is computed in decimal representation as 'i' and the image is stored at index 'i'.

Now for a query image, it is broken into 81 overlapping patches which are fed to the network and its code is computed. Then all the images at indexes whose difference is less than 3 bits are returned and given a score based on difference in bits. And then scores are summed for each of the image and then images are returned as per descending order of score.

The author has used two layer searching, where in first layer for the given input image, the output images are returned using the 28-bit codes, then the 256 bit code of input image and the previously returned output images are compared and based on the 256-bit codes a refined order is returned.

Though I recommend people to study A Practical Guide to Training Restricted Boltzmann Machines for a better understanding of the learning used in the training part above.