[link]
The paper provides derivations and intuitions about the learning dynamics for VAEs based on observations about [$\beta$VAEs][beta]. Using this they derive an alternative way to constrain the training of VAEs that doesn't require typical heuristics, such as warmup or adding noise to the data. How exactly would this change a typical implementation? Typically, SGD is used to [optimize the ELBO directly](https://github.com/pytorch/examples/blob/master/vae/main.py#L91L95). Using GECO, I keep a moving average of my constraint $C$ (chosen based on what I want the VAE to do, but it can be just the likelihood plus a tolerance parameter) and use that to calculate Lagrange multipliers, which control the weighting of the constraint to the loss. [This implementation](https://github.com/denproc/TamingVAEs/blob/master/train.py#L83L97) from a class project appears to be correct. With the stabilization of training, I can't help but think of this as batchnorm for VAEs. [beta]: https://openreview.net/forum?id=Sy2fzU9gl 
[link]
A very simple (but impractical) discrete model of subclonal evolution would include the following events: * Division of a cell to create two cells: * **Mutation** at a location in the genome of the new cells * Cell death at a new timestep * Cell survival at a new timestep Because measurements of mutations are usually taken at one time point, this is taken to be at the end of a time series of these events, where a tiny of subset of cells are observed and a **genotype matrix** $A$ is produced, in which mutations and cells are arbitrarily indexed such that $A_{i,j} = 1$ if mutation $j$ exists in cell $i$. What this matrix allows us to see is the proportion of cells which *both have mutation $j$*. Unfortunately, I don't get to observe $A$, in practice $A$ has been corrupted by IID binary noise to produce $A'$. This paper focuses on difference inference problems given $A'$, including *inferring $A$*, which is referred to as **`noise_elimination`**. The other problems involve inferring only properties of the matrix $A$, which are referred to as: * **`noise_inference`**: predict whether matrix $A$ would satisfy the *three gametes rule*, which asks if a given genotype matrix *does not describe a branching phylogeny* because a cell has inherited mutations from two different cells (which is usually assumed to be impossible under the infinite sites assumption). This can be computed exactly from $A$. * **Branching Inference**: it's possible that all mutations are inherited between the cells observed; in which case there are *no branching events*. The paper states that this can be computed by searching over permutations of the rows and columns of $A$. The problem is to predict from $A'$ if this is the case. In both problems inferring properties of $A$, the authors use fully connected networks with two hidden layers on simulated datasets of matrices. For **`noise_elimination`**, computing $A$ given $A'$, the authors use a network developed for neural machine translation called a [pointer network][pointer]. They also find it necessary to map $A'$ to a matrix $A''$, turning every element in $A'$ to a fixed length row containing the location, mutation status and false positive/false negative rate. Unfortunately, reported results on real datasets are reported only for branching inference and are limited by the restriction on input dimension. The inferred branching probability reportedly matches that reported in the literature. [pointer]: https://arxiv.org/abs/1409.0473 
[link]
It's a shame that the authors weren't able to continue their series of [great][reinforce] [paper][attention] [titles][beams], although it looks like they thought about calling this paper **"Put Replacement In Your Basement"**. Also, although they don't say it in the title or abstract, this paper introduces an estimator the authors call the **"unordered set estimator"** which, as a name, is not the best. However, this is one of the most exciting estimators for gradients of nondifferentiable expectations over structured discrete distributions (that sounds specific but includes a vast number of important problems, for example see [the first paragraph of this paper][mcts] and, less important but fun, adversarial examples or GANS on sequences). For a distribution $p_\theta(s)$, where $s$ can take some set of discrete states, this paper is concerned with how to compute the gradient of the expectation of a function, $f$, over $p_\theta(s)$: $$ \nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ f(s) \right] $$ To jump directly to the definition of the estimator I've got to define the *leaveoneout* ratio. Given a distribution $p$ over unordered sets $S^k$, the leaveoneout ratio is $$ R(S^k, s) = \frac{p^{D \backslash \{ s \}} (S^k \backslash \{ s \} )}{p(S^k)}. $$ Where $p^{D \backslash \{ s \}}$ is the distribution over sets that don't contain $s$, which is the value of the first sampled element. For a given element $s$, it's the ratio of probability of sampling the other elements *given $s$* in $S^k$ divided by the probability of sampling $S^k$. Then the *unordered set estimator* is $$ \nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ f(s) \right] = \nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ \sum_{s \in S^k} p(s) R(S^k, s) f(s) \right] \\ = \mathbb{E}_{s \sim p_\theta(s)} \left[ \nabla_\theta p_\theta (s) R(S^k, s) f(s) \right], $$ and they proceed to show that it's a RaoBlackwellization (so guaranteed as low or lower variance) than a single sample estimator, importance weighted estimators and the [the RB discrete estimator][rb]. The authors also show that they can use the multiple samples to compute a builtin baseline and further reduce variance. If we define the leaveoneout ratio on a distribution with one element already left out as $R^{D \backslash \{ s \} } (S^k, s')$ (the *secondorder* leaveoneout ratio): $$ \nabla_\theta \mathbb{E}_{s \sim p_\theta(s)} \left[ f(s) \right] \\ = \mathbb{E}_{s \sim p_\theta(s)} \left[ \sum_{s \in S^k} \nabla_\theta p_\theta (s) R(S^k, s) \left( f(s)  \sum_{s' \in S^k} p_{\theta} (s') R^{D \backslash \{ s \} } (S^k, s') f(s') \right) \right] $$ Thanks to [stochastic beam search][beams] these sets can be sampled quickly and easily, so it all adds up to a powerful generic method. If I need gradients in an SGD setting of an expectation involving discrete variables, and I can take multiple samples, this is the most promising method I know about at the moment. [reinforce]: https://openreview.net/forum?id=r1lgTGL5DE [attention]: https://arxiv.org/abs/1803.08475 [beams]: https://arxiv.org/abs/1903.06059 [mcts]: https://arxiv.org/pdf/1910.06862.pdf [rb]: https://arxiv.org/abs/1810.04777 
[link]
A common problem in phylogenetics is: 1. I have $p(\text{DNA sequences}  \text{tree})$ and $p(\text{tree})$. 2. I've used these to run an MCMC algorithm and generate many (approximate) samples from $p(\text{tree}  \text{DNA sequences})$. 3. I want to evaluate $p(\text{tree}  \text{DNA sequences})$. The first solution you might think of is to add up how many times you saw each *tree topology* and divide by the total number of MCMC samples; referred to in this paper as *simple sample relative frequencies* (SRF). An obvious problem with this method is that if you didn't happen to sample a tree topology you will assign it zero probability. This paper focuses on producing a better solution to this problem, by defining a distribution over trees that's easy to fit and provides support over the entire space of tree topologies. What is a Subsplit Bayesian Network? ============================== Phylogenies are leaflabeled bifurcating trees, binary trees with labeled leaves. **Clades** are nonempty subsets of the leaf labels; labeled in the following figure as $C_1 \to C_7$: https://i.imgur.com/khS3uSo.png To build a phylogeny, I can just take the full set of leaves and recursively split them into clades as described in the above diagram. This means that the distribution over phylogenies is equivalent to the distribution over clades. To make this distribution tractable, it is typically assumed that clades are conditionally independent given parent clades; called the *Conditional Clade Distribution* (CCD) assumption, attributed here to [Larget][] and [Hohna][]. So, a phylogeny may be described by its clades, this paper proposes that these clades may be described by their **subsplits**; the splitting process placing leaf nodes into one of two clades. Then, the authors note this process is a directed Bayesian network and that this Bayesian network must include all possible clade splits. It is therefore a complete binary tree with depth $N1$ (where $N$ is the number of leaf nodes). Fitting This Parameterisation to MCMC Samples ===================================== Under this parameterisation the likelihood of observing a given tree is: $$ p(\text{tree}) = p(S_1 = s_{1,k}) \prod_{i>1} p(S_i = s_{i,k}  S_{\pi_i} = s_{\pi_i, k}), $$ assuming a collection of subsplits $T_k = \{ S_i = s_{i,k}, i \geq 1 \}$. In this definition $S_{\pi_i}$ is index set of parent nodes of $S_i$; ie a subsplit can depend on any of the parent nodes. The maximum likelihood probabilities for possible subsplits can be solved in closed form; algorithmically involving counting the occurrences of subsplits and dividing by the number of trees observed. If this seems exactly the same as SRF, that's because it is; I [checked the published code to verify this][sbn_ds1]. The authors then consider the case of unrooted trees, where the log likelihood of observed trees can't be easily factorised. They then present a [simple averaging][sa] (not sure where this variational method is discussed in that paper, appears to be under a different name?) and an EM algorithm to fit SBN models over such distributions. They also discuss conditional probability sharing (parameterising conditional on parentchild relationships) immediately before this and it's not clear if this is used in the distribution fit by SA or EM. Experiments show the distributions fit using the EM algorithm perform well according to KL divergence from a "true" distribution defined by fitting a distribution using SRF to a larger dataset. They also show less dispersion estimating the probability of sampled trees versus that estimated by this "ground truth". [sa]: https://people.eecs.berkeley.edu/~wainwrig/Papers/WaiJor08_FTML.pdf [sbn_ds1]: https://github.com/morrislab/sbn/blob/master/experiments/DS1.ipynb [larget]: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3676676/ [hohna]: https://academic.oup.com/sysbio/article/61/1/1/1676649 
[link]
This paper compares methods to calculate the marginal likelihood, $p(D  \tau)$, when you have a tree topology $\tau$ and some data $D$ and you need to marginalise over the possible branch lengths $\mathbf{\theta}$ in the process of Bayesian inference. In other words, solving the following integral: $$ \int_{ [ 0, \infty ]^{2S  3} } p(D  \mathbf{\theta}, \tau ) p( \mathbf{\theta}  \tau) d \mathbf{\theta} $$ There are some details about this problem that are common to phylogenetic problems, such as an exponential prior on the branch lengths, but otherwise this is the common problem of approximate Bayesian inference. This paper compares the following methods: * ELBO (appears to be [BBVI][]) * Gamma Laplus Importance Sampling * Variational Bayes Importance Sampling * Beta' Laplus * Gamma Laplus * Maximum unnormalized posterior probability * Maximum likelihood * Naive Monte Carlo * Bridge Sampling * Conditional Predictive Ordinates * Harmonic Mean * Stabilized Harmonic Mean * Nested Sampling * Pointwise Predictive Density * Path Sampling * Modified Path Sampling * Stepping Stone * Generalized Stepping Stone I leave the in depth description of each algorithm to the paper and appendices, although it's worth mentioning that Laplus is a Laplace approximation where the approximating distribution is constrained to be positive. Some takeaways from the empirical results: * If runtime is not a concern power posterior methods are preferred: > The power posterior methods remain the best generalpurpose tools for phylogenetic modelcomparisons, though they are certainly too slow to explore the tree space produced by PT. * Bridge sampling is the next choice, if you need something faster. * Harmonic Mean is a bad estimator for phylogenetic tree problems. * Gamma Laplus is a good fast option. * Naive Monte Carlo is a poor estimator, which is probably to be expected. * Gamma Laplus is the best option for very fast algorithms: > Empirical posterior distributions on branch lengths are clearly not pointmasses, and yet simply normalizing the unnormalized posterior at the maximum outperforms 6 of the 19 tested methods. All methods were compared on metrics important to phylogenetic inference, such as *average standard deviation of split frequencies" (ASDSF), which is typically used to confirm whether parallel MCMC chains are sampling from the same distribution over tree topologies. Methods were also compared on KL divergence to the true posterior and RMSD (appears to be the mean squared error between CDFs?). [bbvi]: https://arxiv.org/abs/1401.0118 
[link]
This paper approaches the problem of optimizing parameters of a discrete distribution with respect to some loss function that is an expectation over that distribution. In other words, an experiment will probably be a variational autoencoder with discrete latent variables, but there are many real applications: $$ \mathcal{L} (\eta) : = \mathbb{E}_{z \sim q_{\eta} (z)} \left[ f_{\eta} (z) \right] $$ Using the [product rule of differentiation][product] the derivative of this loss function can be computed by enumerating all $1 \to K$ possible values of $z$: $$ \nabla_\eta \mathbb{E}_{z \sim q_{\eta} (z)} \left[ f_{\eta} (z) \right] = \nabla_\eta \sum_{k=1}^{K} q_\eta (k) f_\eta (k) \\ = \sum_{k=1}^{K} f_\eta (k) \nabla_\eta q_\eta (k) + q_\eta (k) \nabla_\eta f_\eta (k) $$ This expectation can also be expressed as the score function estimator (aka the REINFORCE estimator): $$ \nabla_\eta \mathbb{E}_{z \sim q_{\eta} (z)} \left[ f_{\eta} (z) \right] = \sum_{k=1}^{K} \left(f_\eta (k) \nabla_\eta q_\eta (k) + q_\eta (k) \nabla_\eta f_\eta (k)\right)\frac{q_\eta (k)}{q_\eta (k)} \\ = \sum_{k=1}^{K} q_\eta (k) f_\eta (k) \nabla_\eta \log q_\eta (k) + q_\eta (k) \nabla_\eta f_\eta (k) \\ = \mathbb{E}_{z \sim q_{\eta} (z)} \left[ f_\eta (k) \nabla_\eta \log q_\eta (k) + \nabla_\eta f_\eta (k) \right] \\ = \sum_{k=1}^{K} f_\eta (k) \nabla_\eta q_\eta (k) + q_\eta (k) \nabla_\eta f_\eta (k) = \mathbb{E}_{z \sim q_{\eta} (z)} \left[ g(z) \right] $$ In other words, both can be referred to as estimators $g(z)$. The authors note that this can be calculated over a subset of the $k$ most probable states (overloading their $k$ from possible values of $z$). Call this set $C_k$: $$ \nabla_\eta \mathbb{E}_{z \sim q_{\eta} (z)} \left[ f_{\eta} (z) \right] = \mathbb{E}_{z \sim q_{\eta} (z)} \left[ g(z) \right] \\ = \mathbb{E}_{z \sim q_{\eta} (z)} \left[ g(z) \mathbb{1}\{ z \in C_k\} + g(z) \mathbb{1} \{ z \notin C_k \} \right] \\ = \sum_{z \in C_k} q_\eta(z) g(z) + \mathbb{E}_{z \sim q_{\eta} (z)} \left[ g(z) \mathbb{1} \{ z \notin C_k \} \right] $$ As long as $k$ is small, it's easy to calculate the first term, and if most of the probability mass is contained in that set, then it shouldn't matter how well we approximate the second term. The authors choose an importancesampling for the second term, but this is where I get confused. They denote their importance weighting function $q_\eta (z \notin C_k)$ which *could* mean all of the probability mass *not* under the states in $C_k$? Later, they define a decision variable $b$ that expresses whether we are in this set or not, and it's sampled with probability $q_\eta (z \notin C_k)$, so I think my interpretation is correct. The gradient estimator then becomes: $$ \hat{g} (v) = \sum_{z \in C_k} q_\eta (z) g(z) + q_\eta (z \notin C_k) g(v)\\ v \sim q_\eta  v \notin C_k $$ [product]: https://en.wikipedia.org/wiki/Product_rule Showing this is RaoBlackwellization  Another way to express $z$ would be to sample a Bernoulli r.v. with probability $\sum_{j \notin C_k} q_\eta (j) $, then if it's $1$ sample from $z \in C_k$ and if it's $0$ sample from $z \notin C_k$. As long as those samples are drawn using $q_\eta$ then: $$ T(u,v,b) \stackrel{d}{=} z \\ T := u^{1b} v^b $$ where $u \sim q_\eta  C_k$, $v \sim q_\eta  v \notin C_k$ and $b \sim \text{Bernoulli}(\sum_{j \notin C_k} q_\eta (j))$. Expressing $z$ in this way means the gradient estimator from before can be written as: $$ \hat{g} (v) = \mathbb{E} \left[ g( T(u,v,b) )  v \right] $$ And they left it as an exercise for the reader to expand that out and show it's the same as equation 6: $$ \mathbb{E} \left[ g( T(u,v,b) )  v \right] = \mathbb{E} \left[ g( T(u,v,b)) \mathbb{1} \{ b=0 \} + g( T(u,v,b)) \mathbb{1} \{ b=1 \} \right] \\ = \mathbb{E} \left[ g(z) \mathbb{1} \{ z \in C_k \} + g( z) \mathbb{1} \{ z \notin C_k \} \right] = \mathbb{E} \left[ g(z) \right] $$ Writing the estimator as a conditional expectation of some statistic of the random variables under the distribution is sufficient to show that this is an instance of RaoBlackwellization. To be safe, the authors also apply the [conditional variance decomposition][eve] to reinforce the property that RB estimators always have lower variance: $$ Var(Y) = E\left[ Var (YX) \right] + Var(E \left[ Y  X \right] ) \\ Var(g (z) ) = Var (\mathbb{E} \left[ g( T(u,v,b) )  v \right] ) + E \left[ Var ( g( T(u,v,b) )  v ) \right] \\ Var (\mathbb{E} \left[ g( T(u,v,b) )  v \right] ) = Var (\hat{g} (v) ) = Var(g (z) )  E \left[ Var ( g( T(u,v,b) )  v ) \right] $$ They go on to show that the variance is less than or equal to $Var(g(z)) \sum_{j \notin C_k} q_\eta (j)$. Finally, they note that the variance of a simple estimator can also be reduced by taking multiple samples and averaging. They then provide an equation to calculate the optimal $k$ number of elements of $z$ to evaluate depending on how concentrated the distribution being evaluated is, and a proof showing that this will have a lower variance than the naive estimator. $$ \hat{k} = \underset{k \in {0, ..., N}}{\operatorname{argmin}} \frac{\sum_{j \notin C_k} q_\eta (j)}{Nk} $$ I'm not very interested in the experiments right now, but skimming through them it's interesting to see that this method performs very well on a high dimensional hard attention task on MNIST. Particularly because a Gumbelsoftmax estimator falls apart in the same experiment. It would be nice to see results on RL problems as were shown in the [RELAX][] paper. [eve]: https://en.wikipedia.org/wiki/Law_of_total_variance [relax]: https://arxiv.org/abs/1711.00123 
[link]
This paper directly relies on understanding [GumbelSoftmax][gs] ([Concrete][]) sampling. The place to start thinking about this paper is in terms of a distribution over possible permutations. You could, for example, have a categorical distribution over all the permutation matrices of a given size. Of size 2 this is easy: ``` p(M) 0.1 0.9 M: 1, 0 0, 1 0, 1 1, 0 ``` You could apply the GumbelSoftmax trick to this, and other selections of permutation matrices in small dimensions, but the number of possible permutations grows factorially with the dimension. If you want to infer the order of $100$ items, you now have a categorical over $100!$ variables, and you can't even store that number in floating point. By breaking up and applying a normalised weight to each permutation matrix, we are effectively doing a [Naive BirkhoffVon Neumann decomposition][naive] to return a sample from the space of Doubly Stochastic matrices. This paper proposes *much* more efficient ways to sample from this space in a way that is amenable to stochastic variational (read *SGD*) methods, such that they have an experiment working with permutations of 278 variables. There are two ingredients to a good stochastic variational recipe: 1. A differentiable sampling method; for example (most famously): $x = \mu + \sigma \epsilon \text{ where } \epsilon \sim \mathcal{N}(0,1)$ 2. A density over this sampling distribution; in the preceding example: $\mathcal{N}(\mu, \sigma)$ This paper presents two recipes: Stickbreaking transformations and Rounding towards permutation matrices. Stickbreaking  [Shakir Mohamed wrote a good blog post on stickbreaking methods.][sticks] One problem, which you could guess from the categoricaloverpermutations example above, is we can't possibly store a probability for each permutation matrix ($N!$ is a lot of numbers to store). Stickbreaking gives us another way to represent these probabilities implicitly through $B \in [0,1]^{(N1)\times (N1)}$. The row `x` gets recursively broken up like this: ``` B_11 B_12*(1x[:1].sum()) 1x[:2].sum() x:  x[0] x[1] x[2] <> 1.0 ``` For two dimensions, this becomes a little more complicated (and is actually a novel part of this paper) so I'll just refer you to the paper and say: *this is also possible in two dimensions*. OK, so now to sample from the distribution of Doubly Stochastic matrices, you just need to sample $(N1) \times (N1)$ values in the range $[0,1]$. The authors sample a Gaussian and pass it through a sigmoid. Along with a temperature parameter, the values get pushed closer to 0 or 1 and the result is a permutation matrix. To get the density, the authors *appear to* (this would probably be annoying in high dimensions) automatically differentiate through the $N^2$ steps of the stickbreaking transformation to get the Jacobian and use change of variables. Rounding  This method is more idiosyncratic, so I'll just copy the steps straight from the paper: > 1. Input $Z \in R^{N \times N} $, $M \in R_+^{N \times N}$, and $V \in R_+^{N \times N}$; > 2. Map $M \to \tilde{M}$, a point in the Birkhoff polytope, using the [SinkhornKnopp algorithm][sk]; > 3. Set $\Psi = \tilde{M} + V \odot Z$ where $\odot$ denotes elementwise multiplication; > 4. Find $\text{round}(\Psi)$, the nearest permutation matrix to $\Psi$, using the [Hungarian algorithm][ha]; > 5. Output $X = \tau \Psi + (1 \tau)\text{round}(\Psi)$. So you can just schedule $\tau \to 0$ and you'll be moving your distribution to be a distribution over permutation matrices. The big problem with this is we *can't easily get the density*. Step 4 is not differentiable. However, the authors argue the function is still piecewiselinear so we can just get around this. Once they've done that, it's possible to evaluate the density by change of variables again. Results  On a synthetic permutation matching problem, the rounding method gets a better match to the true posterior (the synthetic problem is small enough to enumerate the true posterior). It also performs better than competing methods on a real matching problem; matching the activations in neurons of C. Elegans to the location of neurons in the known connectome. [sticks]: http://blog.shakirm.com/2015/12/machinelearningtrickoftheday6trickswithsticks/ [naive]: https://en.wikipedia.org/wiki/Doubly_stochastic_matrix#Birkhoff_polytope_and_Birkhoff.E2.80.93von_Neumann_theorem [gs]: http://www.shortscience.org/paper?bibtexKey=journals/corr/JangGP16 [concrete]: http://www.shortscience.org/paper?bibtexKey=journals/corr/1611.00712 [sk]: https://en.wikipedia.org/wiki/Sinkhorn%27s_theorem#SinkhornKnopp_algorithm [ha]: https://en.wikipedia.org/wiki/Hungarian_algorithm 
[link]
The propagation rule used in this paper is the following: $$ H^l = \sigma \left(\tilde{D}^{\frac{1}{2}} \tilde{A} \tilde{D}^{\frac{1}{2}} H^{l1} W^l \right) $$ Where $\tilde{A}$ is the [adjacency matrix][adj] of the undirected graph (with self connections, so has a diagonal of 1s and is symmetric) and $H^l$ are the hidden activations at layer $l$. The $D$ matrices are performing row normalisation, $\tilde{D}^{\frac{1}{2}} \tilde{A} \tilde{D}^{\frac{1}{2}}$ is [equivalent to][pygcn] (with $\tilde{A}$ as `mx`): ``` rowsum = np.array(mx.sum(1)) # sum along rows r_inv = np.power(rowsum, 1).flatten() # 1./rowsum elementwise r_mat_inv = sp.diags(r_inv) # cast to sparse diagonal matrix mx = r_mat_inv.dot(mx) # sparse matrix multiply ``` The symmetric way to express this is part of the [symmetric normalized Laplacian][laplace]. This work, and the [other][hammond] [work][defferrard] on graph convolutional networks, is motivated by convolving a parametric filter over $x$. Convolution becomes easy if we can perform a *graph Fourier transform* (don't worry I don't understand this either), but that requires us having access to eigenvectors of the normalized graph Laplacian (which is expensive). [Hammond's early paper][hammond] suggested getting around this problem by using Chebyshev polynomials for the approximation. This paper takes the approximation even further, using only *first order* Chebyshev polynomial, on the assumption that this will be fine because the modeling capacity can be picked up by the deep neural network. That's how the propagation rule above is derived, but we don't really need to remember the details. In practice $\tilde{D}^{\frac{1}{2}} \tilde{A} \tilde{D}^{\frac{1}{2}} = \hat{A}$ is calculated prior and using a graph convolutional network involves multiplying your activations by this sparse matrix at every hidden layer. If you're thinking in terms of a batch with $N$ examples and $D$ features, this multiplication happens *over the examples*, mixing datapoints together (according to the graph structure). If you want to think of this in an orthodox deep learning way, it's the following: ``` activations = F.linear(H_lm1, W_l) # (N,D) activations = activations.permute(1,0) # (D,N) activations = F.linear(activations, hat_A) # (D,N) activations = activations.permute(1,0) # (N,D) H_l = F.relu(activations) ``` **Won't this be really slow though, $\hat{A}$ is $(N,N)$!** Yes, if you implemented it that way it would be very slow. But, many deep learning frameworks support sparse matrix operations ([although maybe not the backward pass][sparse]). Using that, a graph convolutional layer can be implemented [quite easily][pygcnlayer]. **Wait a second, these are batches, not minibatches?** Yup, minibatches are future work. **What are the experimental results, though?** There are experiments showing this works well for semisupervised experiments on graphs, as advertised. Also, the many approximations to get the propagation rule at the top are justified by experiment. **This summary is bad.** Fine, smarter people have written their own posts: [the author's][kipf], [Ferenc's][ferenc]. [adj]: https://en.wikipedia.org/wiki/Adjacency_matrix [pygcn]: https://github.com/tkipf/pygcn/blob/master/pygcn/utils.py#L56L63 [laplace]: https://en.wikipedia.org/wiki/Laplacian_matrix#Symmetric_normalized_Laplacian [hammond]: https://arxiv.org/abs/0912.3848 [defferrard]: https://arxiv.org/abs/1606.09375 [sparse]: https://discuss.pytorch.org/t/doespytorchsupportautogradonsparsematrix/6156/7 [pygcnlayer]: https://github.com/tkipf/pygcn/blob/master/pygcn/layers.py#L35L68 [kipf]: https://tkipf.github.io/graphconvolutionalnetworks/ [ferenc]: http://www.inference.vc/howpowerfularegraphconvolutionsreviewofkipfwelling20162/ 
[link]
Imagine you make a neural network mapping a scalar to a scalar. After you initialise this network in the traditional way, randomly with some given variance, you could take the gradient of the input with respect to the output for all reasonable values (between about  and 3 because networks typically assume standardised inputs). As the value increases, different rectified linear units in the network will randomly switch on, drawing a random walk in the gradients; another name for which is brown noise. ![](http://i.imgur.com/KMzfzMZ.png) However, do the same thing for deep networks, and any traditional initialisation you choose, and you'll see the random walk start to look like white noise. One intuition given in the paper is that as different rectifiers in the network switch off and on the input is taking a number of different paths though the network. The number of possible paths grows exponentially with the depth of the network, so as the input varies, the gradients become increasingly chaotic. **The explanations and derivations given in the paper are much better reasoned and thorough, please read those if you are interested**. Why should we care about this? Because the authors take the recent nonlinearity [CReLU][] (output is concatenation of `relu(x)` and `relu(x)`) and develop an initialisation that will avoid problems with gradient shattering. The initialisation is just to take your standard initialised weight matrix $\mathbf{W}$ and set the right half to be the negative of the left half ($\mathbf{W}_{\text{left}}$). As long as the input to the layer is also concatenated, the left half will be multiplied by `relu(x)` and the right by `relu(x)`. Then: $$ \mathbf{W}.\text{CReLU}(\mathbf{x}) = \begin{cases} \mathbf{W}_{\text{left}}\mathbf{x} & \text{ if } x > 0 \\ \mathbf{W}_{\text{left}}\mathbf{x} & \text{ if } x \leq 0\end{cases} $$ Doing this allows them to train deep networks without skip connections, and they show results on CIFAR10 with depths of up to 200 exceeding (slightly) a similar resnet. [crelu]: https://arxiv.org/abs/1603.05201 
[link]
Imagine you make a neural network mapping a scalar to a scalar. After you initialise this network in the traditional way, randomly with some given variance, you could take the gradient of the input with respect to the output for all reasonable values (between about 3 and 3 because networks typically assume standardised inputs). As the value increases, different rectified linear units in the network will randomly switch on, drawing a random walk in the gradients; another name for which is brown noise. ![](http://i.imgur.com/KMzfzMZ.png) However, do the same thing for deep networks, and any traditional initialisation you choose, and you'll see the random walk start to look like white noise. One intuition given in the paper is that as different rectifiers in the network switch off and on the input is taking a number of different paths though the network. The number of possible paths grows exponentially with the depth of the network, so as the input varies, the gradients become increasingly chaotic. **The explanations and derivations given in the paper are much better reasoned and thorough, please read those if you are interested**. Why should we care about this? Because the authors take the recent nonlinearity [CReLU][] (output is concatenation of `relu(x)` and `relu(x)`) and develop an initialisation that will avoid problems with gradient shattering. The initialisation is just to take your standard initialised weight matrix $\mathbf{W}$ and set the right half to be the negative of the left half ($\mathbf{W}_{\text{left}}$). As long as the input to the layer is also concatenated, the left half will be multiplied by `relu(x)` and the right by `relu(x)`. Then: $$ \mathbf{W}.\text{CReLU}(\mathbf{x}) = \begin{cases} \mathbf{W}_{\text{left}}\mathbf{x} & \text{ if } x > 0 \\ \mathbf{W}_{\text{left}}\mathbf{x} & \text{ if } x \leq 0\end{cases} $$ Doing this allows them to train deep networks without skip connections, and they show results on CIFAR10 with depths of up to 200 exceeding (slightly) a similar resnet. [crelu]: https://arxiv.org/abs/1603.05201
3 Comments

[link]
This paper described an algorithm of parametrically adding noise and applying a variational regulariser similar to that in ["Variational Dropout Sparsifies Deep Neural Networks"][vardrop]. Both have the same goal: make neural networks more efficient by removing parameters (and therefore the computation applied with those parameters). Although, this paper also has the goal of giving a prescription of how many bits to store each parameter with as well. There is a very nice derivation of the hierarchical variational approximation being used here, which I won't try to replicate here. In practice, the difference to prior work is that the stochastic gradient variational method uses hierarchical samples; ie it samples from a prior, then incorporates these samples when sampling over the weights (both applied through local reparameterization tricks). It's a powerful method, which allows them to test two different priors (although they are clearly not limited to just these), and compare both against competing methods. They are comparable, and the choice of prior offers some tradeoffs in terms of sparsity versus quantization. [vardrop]: http://www.shortscience.org/paper?bibtexKey=journals/corr/1701.05369 
[link]
The authors introduce their contribution as an alternative way to approximate the KL divergence between prior and variational posterior used in [Variational Dropout and the Local Reparameterization Trick][kingma] which allows unbounded variance on the multiplicative noise. When the noise variance parameter associated with a weight tends to infinity you can say that the weight is effectively being removed, and in their implementation this is what they do. There are some important details differing from the [original algorithm][kingma] on perweight variational dropout. For both methods we have the following initialization for each dense layer: ``` theta = initialize weight matrix with shape (number of input units, number of hidden units) log_alpha = initialize zero matrix with shape (number of input units, number of hidden units) b = biases initialized to zero with length the number of hidden units ``` Where `log_alpha` is going to parameterise the variational posterior variance. In the original paper the algorithm was the following: ``` mean = dot(input, theta) + b # standard dense layer # marginal variance over activations (eq. 10 in [original paper][kingma]) variance = dot(input^2, theta^2 * exp(log_alpha)) # sample from marginal distribution by scaling Normal activations = mean + sqrt(variance)*unit_normal(number of output units) ``` The final step is a standard [reparameterization trick][shakir], but since it is a marginal distribution this is referred to as a local reparameterization trick (directly inspired by the [fast dropout paper][fast]). The sparsifying algorithm starts with an alternative parameterisation for `log_alpha` ``` log_sigma2 = matrix filled with negative constant (default 8) with size (number of input units, number of hidden units) log_alpha = log_sigma2  log(theta^2) log_alpha = log_alpha clipped between 8 and 8 ``` The authors discuss this in section 4.1, the $\sigma_{ij}^2$ term corresponds to an additive noise variance on each weight with $\sigma_{ij}^2 = \alpha_{ij}\theta_{ij}^2$. Since this can then be reversed to define `log_alpha` the forward pass remains unchanged, but the variance of the gradient is reduced. It is quite a counterintuitive trick, so much so I can't quite believe it works. They then define a mask removing contributions to units where the noise variance has gone too high: ``` clip_mask = matrix shape of log_alpha, equals 1 if log_alpha is greater than thresh (default 3) ``` The clip mask is used to set elements of `theta` to zero, and then the forward pass is exactly the same as in the original paper. The difference in the approximation to the KL divergence is illustrated in figure 1 of the paper; the sparsifying version tends to zero as the variance increases, which matches the true KL divergence. In the [original paper][kingma] the KL divergence would explode, forcing them to clip the variances at a certain point. [kingma]: https://arxiv.org/abs/1506.02557 [shakir]: http://blog.shakirm.com/2015/10/machinelearningtrickoftheday4reparameterisationtricks/ [fast]: http://proceedings.mlr.press/v28/wang13a.html 
[link]
Normally, a Deep Convolutional Network (DCN) has a conditional probabilistic model where we have some parameterised function $f_{\theta}$, and some distribution over the targets to define the loss function (categorical for classification or Gaussian for regression). But, that conditional function is treated as a black box (probabilistically speaking, it's just fit by maximum likelihood). This paper breaks the entire network up into a number of latent factors. The latent factors are designed to represent familiar parts of DCNs; for example, maxpooling selects from a set of activations and we can model the uncertainty in this selection with a categorical random variable. To recreate every pixel you can imagine selecting a set of paths backwards through the network to reproduce that pixel activation. That's not how their model is parameterised, all I can do is point you to the paper for the real DRMM model definition. # Inference The big trick of this paper is that they have designed the probabilistic model so that maxsumproduct message passing (also introduced in this paper) inference equates to the forward prop in a DCN. What does that mean? Well, since the network structure is now a hierarchical probabilistic model we can hope to throw better learning algorithms at it, which is what they do. # Learning Using this probabilistic formulation, you can define a loss function that includes a reconstruction loss generating the image from responsibilities you can estimate during the forward pass. Then, you can have reconstruction gradients _and_ gradients wrt your target loss, so you can train semisupervised or unsupervised but still work with what is practically a DCN. In addition, it's possible to derive a full EM algorithm in this case, and the Mstep corresponds to just solving a generalised Least Squares problem. Which means gradientfree and better principled training of neural networks. # Experimental Results Not all of the theory presented in this paper is covered in the experiments. They do demonstrate their method working well on MNIST and CIFAR10 semisupervised (SOTA with additional variational tweaks). But, there are not yet any experiments using the full EM algorithm they describe (only reporting that results _appear promising_); all experiments use gradient optimisation. They report that their network will train 23x faster, but only demonstrate this on MNIST (and what about variation in the chosen optimizer?). Also, the model can be sampled from, but we don't have any visualisations of the kind of images we would get. 
[link]
This paper presents a way to differentiate through discrete random variables by replacing them with continuous random variables. Say you have a discrete [categorical variable][cat] and you're sampling it with the [Gumbel trick][gumbel] like this ($G_k$ is a Gumbel distributed variable and $\boldsymbol{\alpha}/\sum_k \alpha_k$ are our categorical probabilities): $$ z = \text{one_hot} \left( \underset{k}{\text{arg max}} [ G_k + \log \alpha_k ] \right) $$ This paper replaces the one hot and argmax with a softmax, and they add a $\lambda$ variable to control the "temperature". As $\lambda$ tends to zero it will equal the above equation. $$ z = \text{softmax} \left( \frac{ G_k + \log \alpha_k }{\lambda} \right) $$ I made [some notes][nb] on how this process works, if you'd like more intuition. Comparison to [Gumbelsoftmax][gs]  These papers are proposed precisely the same distribution with notation changes ([noted there][gs]). Both papers also reference each other and the differences. Although, they mention differences in the variatonal objectives to the Gumbelsoftmax. This paper also compares to [VIMCO][], which is probably a harder benchmark to compare against (multisample versus single sample). The results in both papers compare to SOTA score function based estimators and both report high scoring results (often the best). There are some details about implementations to consider though, such as scheduling and exactly how to define the variational objective. [cat]: https://en.wikipedia.org/wiki/Categorical_distribution [gumbel]: https://hips.seas.harvard.edu/blog/2013/04/06/thegumbelmaxtrickfordiscretedistributions/ [gs]: http://www.shortscience.org/paper?bibtexKey=journals/corr/JangGP16 [nb]: https://gist.github.com/gngdb/ef1999ce3a8e0c5cc2ed35f488e19748 [vimco]: https://arxiv.org/abs/1602.06725 
[link]
In [stochastic computation graphs][scg], like [variational autoencoders][vae], using discrete variables is hard because we can't just differentiate through Monte Carlo estimates. This paper introduces a distribution that is a smoothed version of the [categorical distribution][cat] and has a parameter that, as it goes to zero, will make it equal the categorical distribution. This distribution is continuous and can be reparameterised. In other words, the Gumbel trick way to sample a categorical $z$ looks like this ($g_i$ is gumbel distributed and $\boldsymbol{\pi}/\sum_j \pi_j$ are the categorical probabilties): $$ z = \text{one_hot} \left( \underset{i}{\text{arg max}} [ g_i + \log \pi_i ] \right) $$ This paper replaces the one hot and argmax with a [softmax][], and they introduce $\tau$ to control the "discreteness": $$ z = \text{softmax} \left( \frac{ g_i + \log \pi_i}{\tau} \right) $$ I made a [notebook that illustrates this][nb] while looking at another paper that came out at the same time, which I should probably compare against here. Comparison with [Concrete Distribution][concrete]  The concrete and Gumbelsoftmax distributions are exactly the same (notation switch: $\tau \to \lambda$, $\pi_i \to \alpha_k$, $G_k \to g_i$). Both papers have structured output prediction experiments (predict one half of MNIST digits from the other half). This paper shows Gumbelsoftmax always being better, but doesn't compare to VIMCO, which is sometimes better at test time in the concrete distribution paper. Sidenote  blog post  The authors posted a [nice blog post][blog] that is also a good short summary and explanation. [blog]: http://blog.evjang.com/2016/11/tutorialcategoricalvariational.html [scg]: https://arxiv.org/abs/1506.05254 [vae]: https://arxiv.org/abs/1312.6114 [cat]: https://en.wikipedia.org/wiki/Categorical_distribution [softmax]: https://en.wikipedia.org/wiki/Softmax_function [concrete]: http://www.shortscience.org/paper?bibtexKey=journals/corr/1611.00712 [nb]: https://gist.github.com/gngdb/ef1999ce3a8e0c5cc2ed35f488e19748 
[link]
When training a [VAE][] you will have an inference network $q(zx)$. If you have another source of information you'd like to base the approximate posterior on, like some labels $s$, then you would make $q(zx,s)$. But $q$ is a complicated function, and it can ignore $s$ if it wants, and still perform well. This paper describes an adversarial way to force $q$ to use $s$. This is made more complicated in the paper, because $s$ is not necessarily a label, and in fact _is real and continuous_ (because it's easier to backprop in that case). In fact, we're going to _learn_ the representation of $s$, but force it to contain the label information using the training procedure. To be clear, with $x$ as our input (image or whatever): $$ s = f_{s}(x) $$ $$ \mu, \sigma = f_{z}(x,s) $$ We sample $z$ using $\mu$ and $\sigma$ [according to the reparameterization trick, as this is a VAE][vae]: $$ z \sim \mathcal{N}(\mu, \sigma) $$ And then we use our decoder to turn these latent variables into images: $$ \tilde{x} = \text{Dec}(z,s) $$ Training Procedure  We are going to create four parallel loss functions, and incorporate a discriminator to train this: 1. Reconstruction loss plus variational regularizer; propagate a $x_1$ through the VAE to get $s_1$, $z_1$ (latent) and $\tilde{x}_{1}$. 2. Reconstruction loss with a different $s$: 1. Propagate $x_1'$, a different sample with the __same class__ as $x_1$ 2. Pass $z_1$ and $s_1'$ to your decoder. 3. As $s_1'$ _should_ include the label information, you should have reproduced $x_1$, so apply reconstruction loss to whatever your decoder has given you (call it $\tilde{x}_1'$). 3. Adversarial Loss encouraging realistic examples from the same class, regardless of $z$. 1. Propagate $x_2$ (totally separate example) through the network to get $s_2$. 2. Generate two $\tilde{x}_{2}$ variables, one with the prior by sampling from $p(z)$ and one using $z_{1}$. 3. Get the adversary to classify these as fake versus the real sample $x_{2}$. This is pretty well described in Figure 1 in the paper. Experiments show that $s$ ends up coding for the class, and $z$ codes for other stuff, like the angle of digits or line thickness. They also try to classify using $z$ and $s$ and show that $s$ is useful but $z$ is not (can only predict as well as chance). So, it works. [vae]: https://jaan.io/unreasonableconfusion/ 
[link]
This is heavily derived from the top voted summary above, but I had to review this paper for a lab meeting so I expanded on it a bit. I hope this doesn't offend anyone, but this should be, at worst, redundant to the summary above. This paper has two parts: five recommended techniques for training GANs and a thorough evaluation with visual Turing tests and semisupervised tasks. That is more concrete than the feature extraction and visualisation in, for example, Radford's [DCGAN paper][dcgan]. ### Feature Matching Problem: instability from overtraining on the current discriminator. Intuition is that the discriminator will have learnt the kind of useful representation we see in deep image models, and there is more information available by matching those than the single classifier output. Solution: Match activations at some hidden with an L2 loss. This is the same as the "content" loss in the [neural style paper][style]: $$ \newcommand{\aB}{\mathbf{a}} \newcommand{\bB}{\mathbf{b}} \newcommand{\cB}{\mathbf{c}} \newcommand{\dB}{\mathbf{d}} \newcommand{\eB}{\mathbf{e}} \newcommand{\fB}{\mathbf{f}} \newcommand{\gB}{\mathbf{g}} \newcommand{\hB}{\mathbf{h}} \newcommand{\iB}{\mathbf{i}} \newcommand{\jB}{\mathbf{j}} \newcommand{\kB}{\mathbf{k}} \newcommand{\lB}{\mathbf{l}} \newcommand{\mB}{\mathbf{m}} \newcommand{\nB}{\mathbf{n}} \newcommand{\oB}{\mathbf{o}} \newcommand{\pB}{\mathbf{p}} \newcommand{\qB}{\mathbf{q}} \newcommand{\rB}{\mathbf{r}} \newcommand{\sB}{\mathbf{s}} \newcommand{\tB}{\mathbf{t}} \newcommand{\uB}{\mathbf{u}} \newcommand{\vB}{\mathbf{v}} \newcommand{\wB}{\mathbf{w}} \newcommand{\xB}{\mathbf{x}} \newcommand{\yB}{\mathbf{y}} \newcommand{\zB}{\mathbf{z}} \newcommand{\Exp}{\mathbb{E}}  \Exp_{\xB \sim p_{\text{data}}} \fB (\xB)  \Exp_{\zB \sim p_{\zB}(\zB)} \fB (G(\zB)) _2^2 $$ Where $\fB (\xB)$ and $\fB (\zB)$ are the activations in some hidden layer corresponding to either a real or generated image. ### Minibatch Discrimination Problem: generators like to collapse to a single mode (ie just generate a single image), because it's a decent local optimum. Solution: make sure the discriminator can look at samples in combination, so it will know if it's getting the same (or similar) images more easily. Just give the discriminator features that tell it about the distance of each image to other images in the same batch. The diagram in the paper describes this best. They mention this tensor $T$ in the paper, but don't really explain what it is. In [the code][mbcode], it appears to be basically a weight matrix, which means that it is also learnt as part of the discriminator. ### Historical Averaging Problem: no guarantee with gradient descent that a two player game like this won't go into extended orbits. Solution: encourage parameters to revert to their historical mean, with an L2 penalty: $$ \newcommand{\thetaB}{\boldsymbol{\theta}}  \thetaB  \frac{1}{t} \sum_{i=1}^t \thetaB[i] ^2 $$ Orbits are penalised by always being far from their mean, and this is supposed to correspond to a "fictitious play" algorithm. I'm not sure if that's true, but maybe? ### Onesided label smoothing Problem: vulnerability of discriminator to adversarial examples? (Not explicitly stated). Solution: replace positive (ie probability that a sample is real?) labels with a target _smaller than 1_. ### Virtual Batch Normalisation Problem: batch normalisation is highly variable, as it is based on statistics of the current minibatch (enough so that you can sometimes avoid using dropout if you're using batchnorm). Solution: for every minibatch, use the statistics gathered from a _reference minibatch_ for your batch normalisation. For every minibatch, you'll have to first propagate through the reference minibatch with your current parameter settings, but then you can use the statistics you gather by doing this for the minibatch you're actually going to use for training. _Interesting sidenote_: in the [code][impcode], they are actually using [weight normalization][weightnorm] instead of batchnorm (not in all cases). Probably because both papers have Tim Salimans as first author. ### Assessing Image Quality Problem: noisy labelling from mechanical turk. Solution: aim for low entropy conditional categorical distribution when labelling samples with Google's inception model. The inception model gives you $p(y\xB)$, so you want to maximise: $$ \Exp_{\xB} KL (p(y\xB)p(y)) $$ Then they exponentiate the resulting value for no real reason, just to make values easier to compare. Since they say this matches human judgement in their experiments, this means we can all start using this measure and just cite this paper! ### Semisupervised Learning Problem: standard semisupervised, we have some data that is labelled and some that isn't, how to learn a conditional model that will give us $p(y\xB)$. Solution: make "generated" a class in your classification problem. Now you can put generated samples into your dataset, but even better you can produce a loss on unlabeled samples that you just _don't want them to be labeled as "generated"_. So we end up with the following two losses for supervised and unsupervised data: $$ L_{\text{supervised}} =  \Exp_{\xB,y \sim p_{\text{data}} (\xB, y)} \log p_{\text{model}} (y  \xB, y < K + 1) $$ $$ L_{\text{unsupervised}} =  \{ \Exp_{\xB \sim p_{\text{data}} (\xB)} \log [ 1 p_{\text{model}} (y = K+1  \xB) ] + \Exp_{\xB \sim G}\log [ p_{\text{model}} (y=K+1  \xB)] \} $$ With this method, and using feature matching but _not minibatch discrimination_, they show SOTA results for semisupervised learning on MNIST, SVHN and CIFAR10. [mbcode]: https://github.com/openai/improvedgan/blob/master/mnist_svhn_cifar10/nn.py#L132L170 [impcode]: https://github.com/openai/improvedgan/blob/master/mnist_svhn_cifar10/nn.py#L45L91 [weightnorm]: https://arxiv.org/abs/1602.07868 [short]: http://www.shortscience.org/paper?bibtexKey=journals/corr/SalimansGZCRC16#udibr [improved]: https://arxiv.org/abs/1606.03498 [dcgan]: https://arxiv.org/abs/1511.06434 [style]: https://arxiv.org/abs/1508.06576 