Welcome to ShortScience.org! 
[link]
They get multilingual alignments from dictionaries, then train a Bilstm pos tagger in source language, then automatically tag many tokens in the target language, then manually annotate 1000 tokens in target language, then train a system with combined loss over distant tagging and gold tagging. They add an additional output layer that is learned for the gold annotations. 
[link]
We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So $$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$ Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value. $$ V = \left[\begin{array}{c c c} 5 & 4 & 1 \\\\ 4 & 5 & 1 \\\\ 2 & 1 & 5 \end{array}\right] $$ We can decompose this into two matrices with $r = 1$. First lets do this without any nonnegative constraint using an SVD reshaping matrices based on removing eigenvalues: $$ W = \left[\begin{array}{c c c} 0.656 \\\ 0.652 \\\ 0.379 \end{array}\right], H = \left[\begin{array}{c c c} 6.48 & 6.26 & 3.20\\\\ \end{array}\right] $$ We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$): $$ W = \left[\begin{array}{c c c} 0.388 \\\\ 0.386 \\\\ 0.224 \end{array}\right], H = \left[\begin{array}{c c c} 11.22 & 10.57 & 5.41 \\\\ \end{array}\right] $$ Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. $$ V \approx WH = \left[\begin{array}{c c c} 4.36 & 4.11 & 2.10 \\\ 4.33 & 4.08 & 2.09 \\\ 2.52 & 2.37 & 1.21 \\\ \end{array}\right] $$ If they both yield the same reconstruction error then why is a nonnegativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better` #### Paper Contribution This paper discusses two approaches for iteratively creating a nonnegative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$. ### Still a draft 
[link]
Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. Advantages: * Learning the identity becomes learning 0 which is simpler * Loss in information flow in the forward pass is not a problem anymore * No vanishing / exploding gradient * Identities don't have parameters to be learned ## Evaluation The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{4}$), momentum of 0.9. They use minibatches of size 128. * ImageNet ILSVRC 2015: 3.57% (ensemble) * CIFAR10: 6.43% * MS COCO: 59.0% mAp@0.5 (ensemble) * PASCAL VOC 2007: 85.6% mAp@0.5 * PASCAL VOC 2012: 83.8% mAp@0.5 ## See also * [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993) 
[link]
This paper from 2016 introduced a new kmer based method to estimate isoform abundance from RNASeq data called kallisto. The method provided a significant improvement in speed and memory usage compared to the previously used methods while yielding similar accuracy. In fact, kallisto is able to quantify expression in a matter of minutes instead of hours. The standard (previous) methods for quantifying expression rely on mapping, i.e. on the alignment of a transcriptome sequenced reads to a genome of reference. Reads are assigned to a position in the genome and the gene or isoform expression values are derived by counting the number of reads overlapping the features of interest. The idea behind kallisto is to rely on a pseudoalignment which does not attempt to identify the positions of the reads in the transcripts, only the potential transcripts of origin. Thus, it avoids doing an alignment of each read to a reference genome. In fact, kallisto only uses the transcriptome sequences (not the whole genome) in its first step which is the generation of the kallisto index. Kallisto builds a colored de Bruijn graph (TDBG) from all the kmers found in the transcriptome. Each node of the graph corresponds to a kmer (a short sequence of k nucleotides) and retains the information about the transcripts in which they can be found in the form of a color. Linear stretches having the same coloring in the graph correspond to transcripts. Once the TDBG is built, kallisto stores a hash table mapping each kmer to its transcript(s) of origin along with the position within the transcript(s). This step is done only once and is dependent on a provided annotation file (containing the sequences of all the transcripts in the transcriptome). Then for a given sequenced sample, kallisto decomposes each read into its kmers and uses those kmers to find a path covering in the TDBG. This path covering of the transcriptome graph, where a path corresponds to a transcript, generates kcompatibility classes for each kmer, i.e. sets of potential transcripts of origin on the nodes. The potential transcripts of origin for a read can be obtained using the intersection of its kmers kcompatibility classes. To make the pseudoalignment faster, kallisto removes redundant kmers since neighboring kmers often belong to the same transcripts. Figure1, from the paper, summarizes these different steps. https://i.imgur.com/eNH2kuO.png **Figure1**. Overview of kallisto. The input consists of a reference transcriptome and reads from an RNAseq experiment. (a) An example of a read (in black) and three overlapping transcripts with exonic regions as shown. (b) An index is constructed by creating the transcriptome de Bruijn Graph (TDBG) where nodes (v1, v2, v3, ... ) are kmers, each transcript corresponds to a colored path as shown and the path cover of the transcriptome induces a kcompatibility class for each kmer. (c) Conceptually, the kmers of a read are hashed (black nodes) to find the kcompatibility class of a read. (d) Skipping (black dashed lines) uses the information stored in the TDBG to skip kmers that are redundant because they have the same kcompatibility class. (e) The kcompatibility class of the read is determined by taking the intersection of the kcompatibility classes of its constituent kmers.[From Bray et al. Nearoptimal probabilistic RNAseq quantification, Nature Biotechnology, 2016.] Then, kallisto optimizes the following RNASeq likelihood function using the expectationmaximization (EM) algorithm. $$L(\alpha) \propto \prod_{f \in F} \sum_{t \in T} y_{f,t} \frac{\alpha_t}{l_t} = \prod_{e \in E}\left( \sum_{t \in e} \frac{\alpha_t}{l_t} \right )^{c_e}$$ In this function, $F$ is the set of fragments (or reads), $T$ is the set of transcripts, $l_t$ is the (effective) length of transcript $t$ and **y**$_{f,t}$ is a compatibility matrix defined as 1 if fragment $f$ is compatible with $t$ and 0 otherwise. The parameters $α_t$ are the probabilities of selecting reads from a transcript $t$. These $α_t$ are the parameters of interest since they represent the isoforms abundances or relative expressions. To make things faster, the compatibility matrix is collapsed (factorized) into equivalence classes. An equivalent class consists of all the reads compatible with the same subsets of transcripts. The EM algorithm is applied to equivalence classes (not to reads). Each $α_t$ will be optimized to maximise the likelihood of transcript abundances given observations of the equivalence classes. The speed of the method makes it possible to evaluate the uncertainty of the abundance estimates for each RNASeq sample using a bootstrap technique. For a given sample containing $N$ reads, a bootstrap sample is generated from the sampling of $N$ counts from a multinomial distribution over the equivalence classes derived from the original sample. The EM algorithm is applied on those sampled equivalence class counts to estimate transcript abundances. The bootstrap information is then used in downstream analyses such as determining which genes are differentially expressed. Practically, we can illustrate the different steps involved in kallisto using a small example. Starting from a tiny genome with 3 transcripts, assume that the RNASeq experiment produced 4 reads as depicted in the image below. https://i.imgur.com/5JDpQO8.png The first step is to build the TDBG graph and the kallisto index. All transcript sequences are decomposed into kmers (here k=5) to construct the colored de Bruijn graph. Not all nodes are represented in the following drawing. The idea is that each different transcript will lead to a different path in the graph. The strand is not taken into account, kallisto is strandagnostic. https://i.imgur.com/4oW72z0.png Once the index is built, the four reads of the sequenced sample can be analysed. They are decomposed into kmers (k=5 here too) and the prebuilt index is used to determine the kcompatibility class of each kmer. Then, the kcompatibility class of each read is computed. For example, for read 1, the intersection of all the kcompatibility classes of its kmers suggests that it might come from transcript 1 or transcript 2. https://i.imgur.com/woektCH.png This is done for the four reads enabling the construction of the compatibility matrix **y**$_{f,t}$ which is part of the RNASeq likelihood function. In this equation, the $α_t$ are the parameters that we want to estimate. https://i.imgur.com/Hp5QJvH.png The EM algorithm being too slow to be applied on millions of reads, the compatibility matrix **y**$_{f,t}$ is factorized into equivalence classes and a count is computed for each class (how many reads are represented by this equivalence class). The EM algorithm uses this collapsed information to maximize the new equivalent RNASeq likelihood function and optimize the $α_t$. https://i.imgur.com/qzsEq8A.png The EM algorithm stops when for every transcript $t$, $α_tN$ > 0.01 changes less than 1%, where $N$ is the total number of reads. 
[link]
**Goal**: identifying training points most responsible for a given prediction. Given training points $z_1, \dots, z_n$, let loss function be $\frac{1}{n}\sum_{i=1}^nL(z_i, \theta)$ A function called influence function let us compute the parameter change if $z$ were upweighted by some small $\epsilon$. $$\hat{\theta}_{\epsilon, z} := \arg \min_{\theta \in \Theta} \frac{1}{n}\sum_{i=1}^n L(z_i, \theta) + \epsilon L(z, \theta)$$ $$\mathcal{I}_{\text{up, params}}(z) := \frac{d\hat{\theta}_{\epsilon, z}}{d\epsilon} = H_{\hat{\theta}}^{1} \nabla_\theta L(z, \hat{\theta})$$ $\mathcal{I}_{\text{up, params}}(z)$ shows how uplifting one point $z$ affect the estimate of the parameters $\theta$. Furthermore, we could determine how uplifting $z$ affect the loss estimate of a test point through chain rule. $$\mathcal{I}_{\text{up, loss}}(z, z_{\text{test}}) = \nabla_\theta L(z_{\text{test}}, \hat{\theta})^\top \mathcal{I}_{\text{up, params}}(z)$$ Apart from lifting one training point, change of the parameters with the change of a training point could also be estimated. $$\frac{d\hat{\theta}_{\epsilon, z_\delta, z}}{d\epsilon} = \mathcal{I}_{\text{up, params}}(z_\delta)  \mathcal{I}_{\text{up, params}}(z)$$ This measures how purturbation $\delta$ to training point $z$ affect the parameter estimation $\theta$. Section 3 describes some practicals about efficient implementing. This set of tool could be used for some interpretable machine learning tasks. 
[link]
Basically they observe a pattern they call The Filter Lottery (TFL) where the random seed causes a high variance in the training accuracy: ![](http://i.imgur.com/5rWig0H.png) They use the convolutional gradient norm ($CGN$) \cite{conf/fgr/LoC015} to determine how much impact a filter has on the overall classification loss function by taking the derivative of the loss function with respect each weight in the filter. $$CGN(k) = \sum_{i} \left\frac{\partial L}{\partial w^k_i}\right$$ They use the CGN to evaluate the impact of a filter on error, and reinitialize filters when the gradient norm of its weights falls below a specific threshold. 
[link]
* They suggest a new stochastic optimization method, similar to the existing SGD, Adagrad or RMSProp. * Stochastic optimization methods have to find parameters that minimize/maximize a stochastic function. * A function is stochastic (nondeterministic), if the same set of parameters can generate different results. E.g. the loss of different minibatches can differ, even when the parameters remain unchanged. Even for the same minibatch the results can change due to e.g. dropout. * Their method tends to converge faster to optimal parameters than the existing competitors. * Their method can deal with nonstationary distributions (similar to e.g. SGD, Adadelta, RMSProp). * Their method can deal with very sparse or noisy gradients (similar to e.g. Adagrad). ### How * Basic principle * Standard SGD just updates the parameters based on `parameters = parameters  learningRate * gradient`. * Adam operates similar to that, but adds more "cleverness" to the rule. * It assumes that the gradient values have means and variances and tries to estimate these values. * Recall here that the function to optimize is stochastic, so there is some randomness in the gradients. * The mean is also called "the first moment". * The variance is also called "the second (raw) moment". * Then an update rule very similar to SGD would be `parameters = parameters  learningRate * means`. * They instead use the update rule `parameters = parameters  learningRate * means/sqrt(variances)`. * They call `means/sqrt(variances)` a 'Signal to Noise Ratio'. * Basically, if the variance of a specific parameter's gradient is high, it is pretty unclear how it should be changend. So we choose a small step size in the update rule via `learningRate * mean/sqrt(highValue)`. * If the variance is low, it is easier to predict how far to "move", so we choose a larger step size via `learningRate * mean/sqrt(lowValue)`. * Exponential moving averages * In order to approximate the mean and variance values you could simply save the last `T` gradients and then average the values. * That however is a pretty bad idea, because it can lead to high memory demands (e.g. for millions of parameters in CNNs). * A simple average also has the disadvantage, that it would completely ignore all gradients before `T` and weight all of the last `T` gradients identically. In reality, you might want to give more weight to the last couple of gradients. * Instead, they use an exponential moving average, which fixes both problems and simply updates the average at every timestep via the formula `avg = alpha * avg + (1  alpha) * avg`. * Let the gradient at timestep (batch) `t` be `g`, then we can approximate the mean and variance values using: * `mean = beta1 * mean + (1  beta1) * g` * `variance = beta2 * variance + (1  beta2) * g^2`. * `beta1` and `beta2` are hyperparameters of the algorithm. Good values for them seem to be `beta1=0.9` and `beta2=0.999`. * At the start of the algorithm, `mean` and `variance` are initialized to zerovectors. * Bias correction * Initializing the `mean` and `variance` vectors to zero is an easy and logical step, but has the disadvantage that bias is introduced. * E.g. at the first timestep, the mean of the gradient would be `mean = beta1 * 0 + (1  beta1) * g`, with `beta1=0.9` then: `mean = 0.9 * g`. So `0.9g`, not `g`. Both the mean and the variance are biased (towards 0). * This seems pretty harmless, but it can be shown that it lowers the convergence speed of the algorithm by quite a bit. * So to fix this pretty they perform biascorrections of the mean and the variance: * `correctedMean = mean / (1beta1^t)` (where `t` is the timestep). * `correctedVariance = variance / (1beta2^t)`. * Both formulas are applied at every timestep after the exponential moving averages (they do not influence the next timestep). ![Algorithm](https://raw.githubusercontent.com/aleju/papers/master/neuralnets/images/Adam__algorithm.png?raw=true "Algorithm") 
[link]
The main contribution of [Understanding the difficulty of training deep feedforward neural networks](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) by Glorot et al. is a **normalized weight initialization** $$W \sim U \left [  \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}}, \frac{\sqrt{6}}{\sqrt{n_j + n_{j+1}}} \right ]$$ where $n_j \in \mathbb{N}^+$ is the number of neurons in the layer $j$. Showing some ways **how to debug neural networks** might be another reason to read the paper. The paper analyzed standard multilayer perceptrons (MLPs) on a artificial dataset of $32 \text{px} \times 32 \text{px}$ images with either one or two of the 3 shapes: triangle, parallelogram and ellipse. The MLPs varied in the activation function which was used (either sigmoid, tanh or softsign). However, no regularization was used and many minibatch epochs were learned. It might be that batch normalization / dropout might change the influence of initialization very much. Questions that remain open for me: * [How is weight initialization done today?](https://www.reddit.com/r/MLQuestions/comments/4jsge9) * Figure 4: Why is this plot not simply completely dependent on the data? * Is softsign still used? Why not? * If the only advantage of softsign is that is has the plateau later, why doesn't anybody use $\frac{1}{1+e^{0.1 \cdot x}}$ or something similar instead of the standard sigmoid activation function?
1 Comments

[link]
The authors have a dataset of 780 electronic health records and they use it to detect various medical events such as adverse drug events, drug dosage, etc. The task is done by assigning a label to each word in the document. https://i.imgur.com/bZ7yM0z.png Annotation statistics for the corpus of health records. They look at CRFs, LSTMs and GRUs. Both LSTMs and GRUs outperform the CRF, but the best performance is achieved by a GRU trained on whole documents. 
[link]
Proposes a twostage approach for continual learning. An active learning phase and a consolidation phase. The active learning stage optimizes for a specific task that is then consolidated into the knowledge base network via Elastic Weight Consolidation (Kirkpatrick et al., 2016). The active learning phases uses a separate network than the knowledge base, but is not always trained from scratch  authors suggest a heuristic based on tasksimilarity. Improves EWC by deriving a new online method so parameters don’t increase linearly with the number of tasks. Desiderata for a continual learning solution:  A continual learning method should not suffer from catastrophic forgetting. That is, it should be able to perform reasonably well on previously learned tasks.  It should be able to learn new tasks while taking advantage of knowledge extracted from previous tasks, thus exhibiting positive forward transfer to achieve faster learning and/or better final performance.  It should be scalable, that is, the method should be trainable on a large number of tasks.  It should enable positive backward transfer as well, which means gaining improved performance on previous tasks after learning a new task which is similar or relevant.  Finally, it should be able to learn without requiring task labels, and ideally, it should even be applicable in the absence of clear task boundaries. Experiments:  Sequential learning of handwritten characters of 50 alphabets taken from the Omniglot dataset.  Sequential learning of 6 games in the Atari suite (Bellemare et al., 2012) (“Space Invaders”, “Krull”, “Beamrider”, “Hero”, “Stargunner” and “Ms. Pacman”).  8 navigation tasks in 3D environments inspired by experiments with Distral (Teh et al., 2017). 