[link]
In the years before this paper came out in 2017, a number of different graph convolution architectures  which use weightsharing and orderinvariant operations to create representations at nodes in a graph that are contextualized by information in the rest of the graph  had been suggested for learning representations of molecules. The authors of this paper out of Google sought to pull all of these proposed models into a single conceptual framework, for the sake of better comparing and testing the design choices that went into them. All empirical tests were done using the QM9 dataset, where 134,000 molecules have predicted chemical properties attached to them, things like the amount of energy released if bombs are sundered and the energy of electrons at different electron shells. https://i.imgur.com/Mmp8KO6.png An interesting note is that these properties weren't measured empirically, but were simulated by a very expensive quantum simulation, because the former wouldn't be feasible for this large of a dataset. However, this is still a moderately interesting test because, even if we already have the capability to computationally predict these features, a neural network would do much more quickly. And, also, one might aspirationally hope that architectures which learn good representations of molecules for quantum predictions are also useful for tasks with a less available automated prediction mechanism. The framework assumes the existence of "hidden" feature vectors h at each node (atom) in the graph, as well as features that characterize the edges between nodes (whether that characterization comes through sorting into discrete bond categories or through a continuous representation). The features associated with each atom at the lowest input level of the moleculesummarizing networks trained here include: the element ID, the atomic number, whether it accepts electrons or donates them, whether it's in an aromatic system, and which shells its electrons are in. https://i.imgur.com/J7s0q2e.png Given these building blocks, the taxonomy lays out three broad categories of function, each of which different architectures implement in slightly different ways. 1. The Message function, M(). This function is defined with reference to a node w, that the message is coming from, and a node v, that it's being sent to, and is meant to summarize the information coming from w to inform the node representation that will be calculated at v. It takes into account the feature vectors of one or both nodes at the next level down, and sometimes also incorporates feature vectors attached to the edge connecting the two nodes. In a notable example of weight sharing, you'd use the same Message function for every combination of v and w, because you need to be able to process an arbitrary number of pairs, with each v having a different number of neighbors. The simplest example you might imagine here is a simple concatenation of incoming node and edge features; a more typical example from the architectures reviewed is a concatenation followed by a neural network layer. The aggregate message being sent to the receiver node is calculated by summing together the messages from each incoming vector (though it seems like other options are possible; I'm a bit confused why the paper presented summing as the only orderinvariant option). 2. The Update function, U(). This function governs how to take the aggregated message vector sent to a particular node, and combine that with the priorlayer representation at that node, to come up with a nextlayer representation at that node. Similarly, the same Update function weights are shared across all atoms. 3. The Readout function, R(), which takes the finallayer representation of each atom node and aggregates the representations into a final graphlevel representation an orderinvariant way Rather than following in the footsteps of the paper by describing each proposed model type and how it can be described in this framework, I'll instead try to highlight some of the more interesting ways in which design choices differed across previously proposed architectures.  Does the message function being sent from w to v depend on the feature value at both w and v, or just v? To put the question more colloquially, you might imagine w wanting to contextually send different information based on different values of the feature vector at node v, and this extra degree of expressivity (not present in the earliest 2015 paper), seems like a quite valuable addition (in that all subsequent papers include it)  Are the edge features static, categorical things, or are they feature vectors that get iteratively updated in the same way that the node vectors do? For most of the architectures reviewed, the former is true, but the authors found that the highest performance in their tests came from networks with continuous edge vectors, rather than just having different weights for different category types of edge  Is the Readout function something as simple as a summation of all toplevel feature vectors, or is it more complex? Again, the authors found that they got the best performance by using a more complex approach, a Set2Set aggregator, which uses itemtoitem attention within the set of finallayer atom representations to construct an aggregated graplevel embedding The empirical tests within the paper highlight a few more interestingly relevant design choices that are less directly captured by the framework. The first is the fact that it's quite beneficial to explicitly include Hydrogen atoms as part of the graph, rather than just "attaching" them to their nearestby atoms as a count that goes on that atom's feature vector. The second is that it's valuable to start out your edge features with a continuous representation of the spatial distance between atoms, along with an embedding of the bond type. This is particularly worth considering because getting spatial distance data for a molecule requires solving the freeenergy problem to determine its spatial conformation, a costly process. We might ideally prefer a network that can work on bond information alone. The authors do find a nonspatialinformation network that can perform reasonably well  reaching full accuracy on 5 of 13 targets, compared to 11 with spatial information. However, the difference is notable, which, at least from my perspective, begs the question of whether it'd ever be possible to learn representations that can match the performance of spatiallyinformed ones without explicitly providing that information. 
[link]
Dinh et al. show that it is unclear whether flat minima necessarily generalize better than sharp ones. In particular, they study several notions of flatness, both based on the local curvature and based on the notion of “low change in error”. The authors show that the parameterization of the network has a significant impact on the flatness; this means that functions leading to the same prediction function (i.e., being indistinguishable based on their test performance) might have largely varying flatness around the obtained minima, as illustrated in Figure 1. In conclusion, while networks that generalize well usually correspond to flat minima, it is not necessarily true that flat minima generalize better than sharp ones. https://i.imgur.com/gHfolEV.jpg Figure 1: Illustration of the influence of parameterization on the flatness of the obtained minima. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Kim et al. propose Concept Activation Vectors (CAV) that represent the direction of features corresponding to specific humaninterpretable concepts. In particular, given a network for a classification task, a concept is defined as a set of images with that concept. A linear classifier is then trained to distinguish images with concept from random images without the concept based on a chosen feature layer. The normal of the obtained linear classification boundary corresponds to the learned Concept Activation Vector (CAV). By considering the directional derivative along this direction for a given input allows to quantify how well the input aligns with the chosen concept. This way, images can be ranked and the model’ sensitivity to particular concepts can be quantified. The idea is also illustrated in Figure 1. https://i.imgur.com/KOqPeag.png Figure 1: Process of constructing Concept Activation Vectors (CAVs). Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
**Goal**: identifying training points most responsible for a given prediction. Given training points $z_1, \dots, z_n$, let loss function be $\frac{1}{n}\sum_{i=1}^nL(z_i, \theta)$ A function called influence function let us compute the parameter change if $z$ were upweighted by some small $\epsilon$. $$\hat{\theta}_{\epsilon, z} := \arg \min_{\theta \in \Theta} \frac{1}{n}\sum_{i=1}^n L(z_i, \theta) + \epsilon L(z, \theta)$$ $$\mathcal{I}_{\text{up, params}}(z) := \frac{d\hat{\theta}_{\epsilon, z}}{d\epsilon} = H_{\hat{\theta}}^{1} \nabla_\theta L(z, \hat{\theta})$$ $\mathcal{I}_{\text{up, params}}(z)$ shows how uplifting one point $z$ affect the estimate of the parameters $\theta$. Furthermore, we could determine how uplifting $z$ affect the loss estimate of a test point through chain rule. $$\mathcal{I}_{\text{up, loss}}(z, z_{\text{test}}) = \nabla_\theta L(z_{\text{test}}, \hat{\theta})^\top \mathcal{I}_{\text{up, params}}(z)$$ Apart from lifting one training point, change of the parameters with the change of a training point could also be estimated. $$\frac{d\hat{\theta}_{\epsilon, z_\delta, z}}{d\epsilon} = \mathcal{I}_{\text{up, params}}(z_\delta)  \mathcal{I}_{\text{up, params}}(z)$$ This measures how purturbation $\delta$ to training point $z$ affect the parameter estimation $\theta$. Section 3 describes some practicals about efficient implementing. This set of tool could be used for some interpretable machine learning tasks. 
[link]
## Task A neural network for classification typically has a **softmax** layer and outputs the class with the max probability. However, this probability does not represent the **confidence**. If the average confidence (average of max probs) for a dataset matches the accuracy, it is called **wellcalibrated**. Old models like LeNet (1998) was wellcalibrated, but modern networks like ResNet (2016) are no longer wellcalibrated. This paper explains what caused this and compares various calibration methods. ## Figure  Confidence Histogram https://i.imgur.com/dMtdWsL.png The bottom row: group the samples by confidence (max probailities) into bins, and calculates the accuracy (# correct / # bin size) within each bin.  ECE (Expected Calibration Error): average of accuracyconfidence of bins  MCE (Maximum Calibration Error): max of accuracyconfidence of bins ## Analysis  What The paper experiments how models are miscalibrated with different factors: (1) model capacity, (2) batch norm, (3) weight decay, (4) NLL. ## Solution  Calibration Methods Many calibration methods for binary classification and multiclass classification are evaluated. The method that performed the best is **temperature scailing**, which simply multiplies logits before the softmax by some constant. The paper used the validation set to choose the best constant. 
[link]
Cisse et al. propose parseval networks, deep neural networks regularized to learn orthonormal weight matrices. Similar to the work by Hein et al. [1], the mean idea is to constrain the Lipschitz constant of the network – which essentially means constraining the Lipschitz constants of each layer independently. For weight matrices, this can be achieved by constraining the matrixnorm. However, this (depending on the norm used) is often intractable during gradient descent training. Therefore, Cisse et al. propose to use a perlayer regularizer of the form: $R(W) = \W^TW – I\$ where $I$ is the identity matrix. During training, this regularizer is supposed to ensure that the learned weigth matrices are orthonormal – an efficient alternative to regular matrix manifold optimization techniques (see the paper). [1] Matthias Hein, Maksym Andriushchenko: Formal Guarantees on the Robustness of a Classifier against Adversarial Manipulation. CoRR abs/1705.08475 (2017) Also see this summary at [davidstutz.de](https://davidstutz.de/category/reading/). 
[link]
Modification of the 2level hierarchical softmax for better efficiency. An equation of computational complexity is used to find the optimal number of words in each class. In addition, the most common words are considered on the same level as other classes. https://i.imgur.com/dbKS3gh.png 
[link]
_Objective:_ Reduce learning time for [DQN](https://deepmind.com/research/dqn/)type architectures. They introduce a new network element, called DND (Differentiable Neural Dictionary) which is basically a dictionary that uses any key (especially embeddings) and computes the value by using kernel between keys. Plus it's differentiable. ## Architecture: They use basically a network in two steps: 1. A classical CNN network that computes and embedding for every image. 2. A DND for all possible actions (controller input) that stores the embedding as key and estimated reward as value. Also they use a buffer to store all tuples (previous image, action, reward, next image) and for training basic technique is used. [![screen shot 20170412 at 11 23 32 am](https://cloud.githubusercontent.com/assets/17261080/24951103/929300221f7311e797d2628e2f4b5a33.png)](https://cloud.githubusercontent.com/assets/17261080/24951103/929300221f7311e797d2628e2f4b5a33.png) ## Results: Clearly improves learning speed but in the end other techniques catchup and it gets outperformed. 
[link]
The authors propose an algorithm for metalearning that is compatible with any model trained with gradient descent, and show that it works on various domain including supervised learning and reinforcement learning. This is done by explicitly train the network such that a small number of gradient steps with a small amount of training data from a new task will produce good generalization performance on that task. ### Key Points  MAML is actually finding a good **initialization** of model parameters for several tasks.  Good initialization of parameters means that it can achieve good performance on several tasks with small number of gradient steps. ### Method  Simultaneously optimize the **initialization** of model parameters of different metatraining tasks, hoping that it can quickly adapt to new metatesting tasks. ![](https://cloud.githubusercontent.com/assets/7057863/25161911/46f2721e24f111e79fba8bc2f0782204.png)  Training procedure: ![](https://cloud.githubusercontent.com/assets/7057863/25161749/8d00902a24f011e793a86a9b74386f55.png) ### Exp  It acheived performance that is comparable to the stateoftheart on classification/regression/reinforcement learning tasks. ### Thought I think the experiments are thorough since they proved that this technique can be applied to both supervised and reinforcement learning. However, the method is not novel provided that [Optimization a A Midel For Fewshot Learning](https://openreview.net/pdf?id=rJY0Kcll) already proposed to learn initialization of parameters. 
[link]
### Summary The motivation of this paper is to make neural network interpretable so that they can be adopted in fields where interpretability is essential (ie: medical field). Thus, this paper present _DeepLIFT_, a method to interpret neural networks by **decomposing** the output prediction given a specific input by backpropagating the _contribution_ of all the neurons to every input features. The _contribution_ of a neuron is determined by comparing the activation of this neuron to a _reference activation_. This _reference activation_ is determined arbitrarily by a domain expert. Moreover, the authors argue that in some case, giving separate consideration to positive and negative contributions can reveal dependencies that are missed by other approaches. The authors show that their approaches can capture some dependencies that a gradientbased method cannot. ### Computing the contribution of a neuron Given the following notation: * $t$: Target output neuron * $t^0$: Reference activation of $t$ * $x_1, x_2, ..., x_n$: Set of neurons * $\Delta t$: The differencefromreference of a target * $\Delta x$: The differencefromreference of an input * $C_{\Delta x_i,\Delta t}$: Contributions scores of a neuron $$\Delta t = t  t^0$$ $$\Delta t = \sum_{i=1}^n C_{\Delta x_i \Delta t}$$ The advantage of the _difference from reference_ against purely gradient method is that the _diference from reference_ avoid all discontinuities as seen in the following figure https://i.imgur.com/vLZytJT.png ### "Backpropagating" the contribution to the input To compute the contribution to the input, the authors use a concept similar to the chain rule. Given a _multiplier_ $m_\Delta x _\Delta t$ computed as following: $$m_{\Delta x \Delta t} = \frac{C_{\Delta x \Delta t}} {\Delta x}$$ Given $z$ the output of a neuron, $y_j$ one neuron in the hidden layer before $z$ and $x_i$ one neuron at the input, before $y_j$. We can compute $m_{\Delta x_i \Delta z}$ as following: $$m_{\Delta x_i \Delta z}=\sum_j m_{\Delta x_i \Delta y_j} m_{\Delta y_j \Delta z}$$ ### Computing the contribution score The authors argues that it can be beneficial in some case to separate the positive and negative contributions. ie: $$\Delta _{x_i} = \Delta _{x_i}^+ + \Delta _{x_i}^$$ $$C_{\Delta _{x_i} \Delta _t} = C_{\Delta _{x_i}^+ \Delta _t} + C_{\Delta _{x_i}^ \Delta _t}$$ The authors propose three similars techniques to compute the contribution score 1. A linear rule where one does not take into consideration the nonlinearity function such that $C_{\Delta _{x_i} \Delta _t} = w_i \Delta _{x_i}$ 2. The _rescale rule_ applied to nonlinear function (ie: $y=f(x)$). If $\Delta _y = 0$ or is very close (less than $10^{7}$), then the authors use the gradient instead of the multiplier. 3. The _Reveal Cancel rule_ is similar than the _rescale rule_, but threat the positive and negative example differently. This allows to capture dependencies (ie: min/AND) that cannot be captured by _rescale rule_ or other method. The difference from reference can be computed as follow: $$\Delta y^+ = \frac{1}{2}(f(x^0 + \Delta x^+)  f(x^0)) + \frac{1}{2}(f(x^0 + \Delta x^+ + \Delta x^)  f(x^0+ \Delta x^)$$ $$\Delta y^ = \frac{1}{2}(f(x^0 + \Delta x^)  f(x^0)) + \frac{1}{2}(f(x^0 + \Delta x^+ + \Delta x^)  f(x^0+ \Delta x^+)$$
1 Comments

[link]
Imagine you make a neural network mapping a scalar to a scalar. After you initialise this network in the traditional way, randomly with some given variance, you could take the gradient of the input with respect to the output for all reasonable values (between about  and 3 because networks typically assume standardised inputs). As the value increases, different rectified linear units in the network will randomly switch on, drawing a random walk in the gradients; another name for which is brown noise. ![](http://i.imgur.com/KMzfzMZ.png) However, do the same thing for deep networks, and any traditional initialisation you choose, and you'll see the random walk start to look like white noise. One intuition given in the paper is that as different rectifiers in the network switch off and on the input is taking a number of different paths though the network. The number of possible paths grows exponentially with the depth of the network, so as the input varies, the gradients become increasingly chaotic. **The explanations and derivations given in the paper are much better reasoned and thorough, please read those if you are interested**. Why should we care about this? Because the authors take the recent nonlinearity [CReLU][] (output is concatenation of `relu(x)` and `relu(x)`) and develop an initialisation that will avoid problems with gradient shattering. The initialisation is just to take your standard initialised weight matrix $\mathbf{W}$ and set the right half to be the negative of the left half ($\mathbf{W}_{\text{left}}$). As long as the input to the layer is also concatenated, the left half will be multiplied by `relu(x)` and the right by `relu(x)`. Then: $$ \mathbf{W}.\text{CReLU}(\mathbf{x}) = \begin{cases} \mathbf{W}_{\text{left}}\mathbf{x} & \text{ if } x > 0 \\ \mathbf{W}_{\text{left}}\mathbf{x} & \text{ if } x \leq 0\end{cases} $$ Doing this allows them to train deep networks without skip connections, and they show results on CIFAR10 with depths of up to 200 exceeding (slightly) a similar resnet. [crelu]: https://arxiv.org/abs/1603.05201 
[link]
Here is a video overview: https://www.youtube.com/watch?v=tfow6GJepQ Here is an image of the poster: https://i.imgur.com/Ti9btj9.png
1 Comments

[link]
Short overview from ICML: https://youtube.com/watch?v=GMG5bFciuIA Long overview from ICML: https://youtu.be/o6dtDuldsEo 
[link]
The authors introduce their contribution as an alternative way to approximate the KL divergence between prior and variational posterior used in [Variational Dropout and the Local Reparameterization Trick][kingma] which allows unbounded variance on the multiplicative noise. When the noise variance parameter associated with a weight tends to infinity you can say that the weight is effectively being removed, and in their implementation this is what they do. There are some important details differing from the [original algorithm][kingma] on perweight variational dropout. For both methods we have the following initialization for each dense layer: ``` theta = initialize weight matrix with shape (number of input units, number of hidden units) log_alpha = initialize zero matrix with shape (number of input units, number of hidden units) b = biases initialized to zero with length the number of hidden units ``` Where `log_alpha` is going to parameterise the variational posterior variance. In the original paper the algorithm was the following: ``` mean = dot(input, theta) + b # standard dense layer # marginal variance over activations (eq. 10 in [original paper][kingma]) variance = dot(input^2, theta^2 * exp(log_alpha)) # sample from marginal distribution by scaling Normal activations = mean + sqrt(variance)*unit_normal(number of output units) ``` The final step is a standard [reparameterization trick][shakir], but since it is a marginal distribution this is referred to as a local reparameterization trick (directly inspired by the [fast dropout paper][fast]). The sparsifying algorithm starts with an alternative parameterisation for `log_alpha` ``` log_sigma2 = matrix filled with negative constant (default 8) with size (number of input units, number of hidden units) log_alpha = log_sigma2  log(theta^2) log_alpha = log_alpha clipped between 8 and 8 ``` The authors discuss this in section 4.1, the $\sigma_{ij}^2$ term corresponds to an additive noise variance on each weight with $\sigma_{ij}^2 = \alpha_{ij}\theta_{ij}^2$. Since this can then be reversed to define `log_alpha` the forward pass remains unchanged, but the variance of the gradient is reduced. It is quite a counterintuitive trick, so much so I can't quite believe it works. They then define a mask removing contributions to units where the noise variance has gone too high: ``` clip_mask = matrix shape of log_alpha, equals 1 if log_alpha is greater than thresh (default 3) ``` The clip mask is used to set elements of `theta` to zero, and then the forward pass is exactly the same as in the original paper. The difference in the approximation to the KL divergence is illustrated in figure 1 of the paper; the sparsifying version tends to zero as the variance increases, which matches the true KL divergence. In the [original paper][kingma] the KL divergence would explode, forcing them to clip the variances at a certain point. [kingma]: https://arxiv.org/abs/1506.02557 [shakir]: http://blog.shakirm.com/2015/10/machinelearningtrickoftheday4reparameterisationtricks/ [fast]: http://proceedings.mlr.press/v28/wang13a.html 