[link]
Karmon et al. propose a gradient-descent based method for obtaining adversarial patch like localized adversarial examples. In particular, after selecting a region of the image to be modified, several iterations of gradient descent are run in order to maximize the probability of the target class and simultaneously minimize the probability in the true class. After each iteration, the perturbation is masked to the patch and projected onto the valid range of [0,1] for images. On ImageNet, the authors show that these adversarial examples are effective against a normal, undefended network. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
Teye et al. show that neural networks with batch normalization can be used to give uncertainty estimates through Monte Carlo sampling. In particular, instead of using the test mode of batch normalization, where the statistics (mean and variance) of each batch normalization layer are fixed, these statistics are computed per batch, as in training mode. To this end, for a specific query image, random batches from the training set are sampled, and prediction uncertainty is estimated using Monte Carlo sampling to compute mean and variance. This is summarized in Algorithm 1, depicting the proposed Monte Carlo Batch Normalization method. In the paper, this approach is further interpreted as approximate inference in Bayesian models. https://i.imgur.com/nRdOvzs.jpg Algorithm 1: Monte Carlo approach for using batch normalization for uncertainty estimation. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
This paper approaches the problem of optimizing parameters of a discrete distribution with respect to some loss function that is an expectation over that distribution. In other words, an experiment will probably be a variational autoencoder with discrete latent variables, but there are many real applications: $$ \mathcal{L} (\eta) : = \mathbb{E}_{z \sim q_{\eta} (z)} \left[ f_{\eta} (z) \right] $$ Using the [product rule of differentiation][product] the derivative of this loss function can be computed by enumerating all $1 \to K$ possible values of $z$: $$ \nabla_\eta \mathbb{E}_{z \sim q_{\eta} (z)} \left[ f_{\eta} (z) \right] = \nabla_\eta \sum_{k=1}^{K} q_\eta (k) f_\eta (k) \\ = \sum_{k=1}^{K} f_\eta (k) \nabla_\eta q_\eta (k) + q_\eta (k) \nabla_\eta f_\eta (k) $$ This expectation can also be expressed as the score function estimator (aka the REINFORCE estimator): $$ \nabla_\eta \mathbb{E}_{z \sim q_{\eta} (z)} \left[ f_{\eta} (z) \right] = \sum_{k=1}^{K} \left(f_\eta (k) \nabla_\eta q_\eta (k) + q_\eta (k) \nabla_\eta f_\eta (k)\right)\frac{q_\eta (k)}{q_\eta (k)} \\ = \sum_{k=1}^{K} q_\eta (k) f_\eta (k) \nabla_\eta \log q_\eta (k) + q_\eta (k) \nabla_\eta f_\eta (k) \\ = \mathbb{E}_{z \sim q_{\eta} (z)} \left[ f_\eta (k) \nabla_\eta \log q_\eta (k) + \nabla_\eta f_\eta (k) \right] \\ = \sum_{k=1}^{K} f_\eta (k) \nabla_\eta q_\eta (k) + q_\eta (k) \nabla_\eta f_\eta (k) = \mathbb{E}_{z \sim q_{\eta} (z)} \left[ g(z) \right] $$ In other words, both can be referred to as estimators $g(z)$. The authors note that this can be calculated over a subset of the $k$ most probable states (overloading their $k$ from possible values of $z$). Call this set $C_k$: $$ \nabla_\eta \mathbb{E}_{z \sim q_{\eta} (z)} \left[ f_{\eta} (z) \right] = \mathbb{E}_{z \sim q_{\eta} (z)} \left[ g(z) \right] \\ = \mathbb{E}_{z \sim q_{\eta} (z)} \left[ g(z) \mathbb{1}\{ z \in C_k\} + g(z) \mathbb{1} \{ z \notin C_k \} \right] \\ = \sum_{z \in C_k} q_\eta(z) g(z) + \mathbb{E}_{z \sim q_{\eta} (z)} \left[ g(z) \mathbb{1} \{ z \notin C_k \} \right] $$ As long as $k$ is small, it's easy to calculate the first term, and if most of the probability mass is contained in that set, then it shouldn't matter how well we approximate the second term. The authors choose an importance-sampling for the second term, but this is where I get confused. They denote their importance weighting function $q_\eta (z \notin C_k)$ which *could* mean all of the probability mass *not* under the states in $C_k$? Later, they define a decision variable $b$ that expresses whether we are in this set or not, and it's sampled with probability $q_\eta (z \notin C_k)$, so I think my interpretation is correct. The gradient estimator then becomes: $$ \hat{g} (v) = \sum_{z \in C_k} q_\eta (z) g(z) + q_\eta (z \notin C_k) g(v)\\ v \sim q_\eta | v \notin C_k $$ [product]: https://en.wikipedia.org/wiki/Product_rule Showing this is Rao-Blackwellization ---------------------------------------------- Another way to express $z$ would be to sample a Bernoulli r.v. with probability $\sum_{j \notin C_k} q_\eta (j) $, then if it's $1$ sample from $z \in C_k$ and if it's $0$ sample from $z \notin C_k$. As long as those samples are drawn using $q_\eta$ then: $$ T(u,v,b) \stackrel{d}{=} z \\ T := u^{1-b} v^b $$ where $u \sim q_\eta | C_k$, $v \sim q_\eta | v \notin C_k$ and $b \sim \text{Bernoulli}(\sum_{j \notin C_k} q_\eta (j))$. Expressing $z$ in this way means the gradient estimator from before can be written as: $$ \hat{g} (v) = \mathbb{E} \left[ g( T(u,v,b) ) | v \right] $$ And they left it as an exercise for the reader to expand that out and show it's the same as equation 6: $$ \mathbb{E} \left[ g( T(u,v,b) ) | v \right] = \mathbb{E} \left[ g( T(u,v,b)) \mathbb{1} \{ b=0 \} + g( T(u,v,b)) \mathbb{1} \{ b=1 \} \right] \\ = \mathbb{E} \left[ g(z) \mathbb{1} \{ z \in C_k \} + g( z) \mathbb{1} \{ z \notin C_k \} \right] = \mathbb{E} \left[ g(z) \right] $$ Writing the estimator as a conditional expectation of some statistic of the random variables under the distribution is sufficient to show that this is an instance of Rao-Blackwellization. To be safe, the authors also apply the [conditional variance decomposition][eve] to reinforce the property that RB estimators always have lower variance: $$ Var(Y) = E\left[ Var (Y|X) \right] + Var(E \left[ Y | X \right] ) \\ Var(g (z) ) = Var (\mathbb{E} \left[ g( T(u,v,b) ) | v \right] ) + E \left[ Var ( g( T(u,v,b) ) | v ) \right] \\ Var (\mathbb{E} \left[ g( T(u,v,b) ) | v \right] ) = Var (\hat{g} (v) ) = Var(g (z) ) - E \left[ Var ( g( T(u,v,b) ) | v ) \right] $$ They go on to show that the variance is less than or equal to $Var(g(z)) \sum_{j \notin C_k} q_\eta (j)$. Finally, they note that the variance of a simple estimator can also be reduced by taking multiple samples and averaging. They then provide an equation to calculate the optimal $k$ number of elements of $z$ to evaluate depending on how concentrated the distribution being evaluated is, and a proof showing that this will have a lower variance than the naive estimator. $$ \hat{k} = \underset{k \in {0, ..., N}}{\operatorname{argmin}} \frac{\sum_{j \notin C_k} q_\eta (j)}{N-k} $$ I'm not very interested in the experiments right now, but skimming through them it's interesting to see that this method performs very well on a high dimensional hard attention task on MNIST. Particularly because a Gumbel-softmax estimator falls apart in the same experiment. It would be nice to see results on RL problems as were shown in the [RELAX][] paper. [eve]: https://en.wikipedia.org/wiki/Law_of_total_variance [relax]: https://arxiv.org/abs/1711.00123 |
[link]
This reinforcement learning paper starts with the constraints imposed an engineering problem - the need to scale up learning problems to operate across many GPUs - and ended up, as a result, needing to solve an algorithmic problem along with it. In order to massively scale up their training to be able to train multiple problem domains in a single model, the authors of this paper implemented a system whereby many “worker” nodes execute trajectories (series of actions, states, and reward) and then send those trajectories back to a “learner” node, that calculates gradients and updates a central policy model. However, because these updates are queued up to be incorporated into the central learner, it can frequently happen that the policy that was used to collect the trajectories is a few steps behind from the policy on the central learner to which its gradients will be applied (since other workers have updated the learner since this worker last got a policy download). This results in a need to modify the policy network model design accordingly. IMPALA (Importance Weighted Actor Learner Architectures) uses an “Actor Critic” model design, which means you learn both a policy function and a value function. The policy function’s job is to choose which actions to take at a given state, by making some higher probability than others. The value function’s job is to estimate the reward from a given state onward, if a certain policy p is followed. The value function is used to calculate the “advantage” of each action at a given state, by taking the reward you receive through action a (and reward you expect in the future), and subtracting out the value function for that state, which represents the average future reward you’d get if you just sampled randomly from the policy from that point onward. The policy network is then updated to prioritize actions which are higher-advantage. If you’re on-policy, you can calculate a value function without needing to explicitly calculate the probabilities of each action, because, by definition, if you take actions according to your policy probabilities, then you’re sampling each action with a weight proportional to its probability. However, if your actions are calculated off-policy, you need correct for this, typically by calculating an “importance sampling” ratio, that multiplies all actions by a probability under the desired policy divided by the probability under the policy used for sampling. This cancels out the implicit probability under the sampling policy, and leaves you with your actions scaled in proportion to their probability under the policy you’re actually updating. IMPALA shares the basic structure of this solution, but with a few additional parameters to dynamically trade off between the bias and variance of the model. The first parameter, rho, controls how much bias you allow into your model, where bias here comes from your model not being fully corrected to “pretend” that you were sampling from the policy to which gradients are being applied. The trade-off here is that if your policies are far apart, you might downweight its actions so aggressively that you don’t get a strong enough signal to learn quickly. However, the policy you learn might be statistically biased. Rho does this by weighting each value function update by: https://i.imgur.com/4jKVhCe.png where rho-bar is a hyperparameter. If rho-bar is high, then we allow stronger weighting effects, whereas if it’s low, we put a cap on those weights. The other parameter is c, and instead of weighting each value function update based on policy drift at that state, it weights each timestep based on how likely or unlikely the action taken at that timestep was under the true policy. https://i.imgur.com/8wCcAoE.png Timesteps that much likelier under the true policy are upweighted, and, once again, we use a hyperparameter, c-bar, to put a cap on the amount of allowed upweighting. Where the prior parameter controlled how much bias there was in the policy we learn, this parameter helps control the variance - the higher c-bar, the higher the amount of variance there will be in the updates used to train the model, and the longer it’ll take to converge. |