![]() |
Welcome to ShortScience.org! |
![]() ![]() ![]() |
[link]
We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So $$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$ Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value. $$ V = \left[\begin{array}{c c c} 5 & 4 & 1 \\\\ 4 & 5 & 1 \\\\ 2 & 1 & 5 \end{array}\right] $$ We can decompose this into two matrices with $r = 1$. First lets do this without any non-negative constraint using an SVD reshaping matrices based on removing eigenvalues: $$ W = \left[\begin{array}{c c c} -0.656 \\\ -0.652 \\\ -0.379 \end{array}\right], H = \left[\begin{array}{c c c} -6.48 & -6.26 & -3.20\\\\ \end{array}\right] $$ We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$): $$ W = \left[\begin{array}{c c c} 0.388 \\\\ 0.386 \\\\ 0.224 \end{array}\right], H = \left[\begin{array}{c c c} 11.22 & 10.57 & 5.41 \\\\ \end{array}\right] $$ Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. $$ V \approx WH = \left[\begin{array}{c c c} 4.36 & 4.11 & 2.10 \\\ 4.33 & 4.08 & 2.09 \\\ 2.52 & 2.37 & 1.21 \\\ \end{array}\right] $$ If they both yield the same reconstruction error then why is a non-negativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better` #### Paper Contribution This paper discusses two approaches for iteratively creating a non-negative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$. ### Still a draft ![]() |
[link]
Deeper networks should never have a higher **training** error than smaller ones. In the worst case, the layers should "simply" learn identities. It seems as this is not so easy with conventional networks, as they get much worse with more layers. So the idea is to add identity functions which skip some layers. The network only has to learn the **residuals**. Advantages: * Learning the identity becomes learning 0 which is simpler * Loss in information flow in the forward pass is not a problem anymore * No vanishing / exploding gradient * Identities don't have parameters to be learned ## Evaluation The learning rate starts at 0.1 and is divided by 10 when the error plateaus. Weight decay of 0.0001 ($10^{-4}$), momentum of 0.9. They use mini-batches of size 128. * ImageNet ILSVRC 2015: 3.57% (ensemble) * CIFAR-10: 6.43% * MS COCO: 59.0% mAp@0.5 (ensemble) * PASCAL VOC 2007: 85.6% mAp@0.5 * PASCAL VOC 2012: 83.8% mAp@0.5 ## See also * [DenseNets](http://www.shortscience.org/paper?bibtexKey=journals/corr/1608.06993) ![]() |
[link]
They get multilingual alignments from dictionaries, then train a Bilstm pos tagger in source language, then automatically tag many tokens in the target language, then manually annotate 1000 tokens in target language, then train a system with combined loss over distant tagging and gold tagging. They add an additional output layer that is learned for the gold annotations. ![]() |
[link]
## General stuff about face recognition Face recognition has 4 main tasks: * **Face detection**: Given an image, draw a rectangle around every face * **Face alignment**: Transform a face to be in a canonical pose * **Face representation**: Find a representation of a face which is suitable for follow-up tasks (small size, computationally cheap to compare, invariant to irrelevant changes) * **Face verification**: Images of two faces are given. Decide if it is the same person or not. The face verification task is sometimes (more simply) a face classification task (given a face, decide which of a fixed set of people it is). Datasets being used are: * **LFW** (Labeled Faces in the Wild): 97.35% accuracy; 13 323 web photos of 5 749 celebrities * **YTF** (YouTube Faces): 3425 YouTube videos of 1 595 subjects * **SFC** (Social Face Classification): 4.4 million labeled faces from 4030 people, each 800 to 1200 faces * **USF** (Human-ID database): 3D scans of faces ## Ideas in this paper This paper deals with face alignment and face representation. **Face Alignment** They made an average face with the USF dataset. Then, for each new face, they apply the following procedure: * Find 6 points in a face (2 eyes, 1 nose tip, 2 corners of the lip, 1 middle point of the bottom lip) * Crop according to those * Find 67 points in the face / apply them to a normalized 3D model of a face * Transform (=align) face to a normalized position **Representation** Train a neural network on 152x152 images of faces to classify 4030 celebrities. Remove the softmax output layer and use the output of the second-last layer as the transformed representation. The network is: * C1 (convolution): 32 filters of size $11 \times 11 \times 3$ (RGB-channels) (returns $142\times 142$ "images") * M2 (max pooling): $3 \times 3$, stride of 2 (returns $71\times 71$ "images") * C3 (convolution): 16 filters of size $9 \times 9 \times 16$ (returns $63\times 63$ "images") * L4 (locally connected): $16\times9\times9\times16$ (returns $55\times 55$ "images") * L5 (locally connected): $16\times7\times7\times16$ (returns $25\times 25$ "images") * L6 (locally connected): $16\times5\times5\times16$ (returns $21\times 21$ "images") * F7 (fully connected): ReLU, 4096 units * F8 (fully connected): softmax layer with 4030 output neurons The training was done with: * Stochastic Gradient Descent (SGD) * Momentum of 0.9 * Performance scheduling (LR starting at 0.01, ending at 0.0001) * Weight initialization: $w \sim \mathcal{N}(\mu=0, \sigma=0.01)$, $b = 0.5$ * ~15 epochs ($\approx$ 3 days) of training ## Evaluation results * **Quality**: * 97.35% accuracy (or mean accuracy?) with an Ensemble of DNNs for LFW * 91.4% accuracy with a single network on YTF * **Speed**: DeepFace runs in 0.33 seconds per image (I'm not sure which size). This includes image decoding, face detection and alignment, **the** feed forward network (why only one? wasn't this the best performing Ensemble?) and final classification output ## See also * Andrew Ng: [C4W4L03 Siamese Network](https://www.youtube.com/watch?v=6jfw8MuKwpI) ![]() |
[link]
Sinha et al. introduce a variant of adversarial training based on distributional robust optimization. I strongly recommend reading the paper for understanding the introduced theoretical framework. The authors also provide guarantees on the obtained adversarial loss – and show experimentally that this guarantee is a realistic indicator. The adversarial training variant itself follows the general strategy of training on adversarially perturbed training samples in a min-max framework. In each iteration, an attacker crafts an adversarial examples which the network is trained on. In a nutshell, their approach differs from previous ones (apart from the theoretical framework) in the used attacker. Specifically, their attacker optimizes $\arg\max_z l(\theta, z) - \gamma \|z – z^t\|_p^2$ where $z^t$ is a training sample chosen randomly during training. On a side note, I also recommend reading the reviews of this paper: https://openreview.net/forum?id=Hk6kPgZA- Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). ![]() |
[link]
This paper deals with an important problem where a deep classification system is made explainable. After the (continuing) success of Deep Networks, researchers are trying to open the blackbox and this work is one of the foremosts. The authors explored the strength of a deep learning method (vision-language model) to explain the performance of another deep learning model (image classification). The approach jointly predicts a class label and explains why it predicted so in natural language. The paper starts with a very important differentiation between two basic schools of *explnation* systems - the *introspection* explanation system and the *justification* explanation system. The introspection system looks into the model to get an explanation (e.g., "This is a Western Grebe because filter 2 has a high activation..."). On the other hand, a justification system justifies the decision by producing sentence details on how visual evidence is compatible with the system output (e.g., "This is a Western Grebe because it has red eyes..."). The paper focuses on *justification* explanation system and proposes a novel one. The authors argue that unlike a description of an image or a sentence defining a class (not necessarily in presence of an image), visual explanation, conditioned on an input image, provides much more of an explanatory text on why the image is classified as a certain category mentioning only image relevant features. The broad outline of the approach is given in Fig (2) of the paper. https://i.imgur.com/tta2qDp.png The first stage consists of a deep convolutional network for classification which generates a softmax distribution over the classes. As the task handles fine-grained bird species classification, it uses a compact bilinear feature representation known to work well for the fine-grained classification tasks. The second stage is a stacked LSTM which generates natural language sentences or explanations justifying the decision of the first stage. The first LSTM of the stack receives the previously generated word. The second LSTM receives the output of the first LSTM along with image features and predicted label distribution from the classification network. This LSTM produces the sequence of output words until an "end-of-sentence" token is generated. The intuition behind using predicted label distribution for explanation is that it would inform the explanation generation model which words and attributes are more likely to occur in the description. Two kinds of losses are used for the second stage *i.e.*, the language model. The first one is termed as the *Relevance Loss* which is the typical sentence generation loss that is seen in literature. This is the sum of cross-entropy losses of the generated words with respect to the ground truth words. Its role is to optimize the alignment between generated and ground truth sentences. However, this loss is not very effective in producing sentences which include class discriminative information. class specificity is a global sentence property. This is illustrated with the following example - *whereas a sentence "This is an all black bird with a bright red eye" is class specific to a "Bronzed Cowbird", words and phrases in the sentence, such as "black" or "red eye" are less class discriminative on their own.* As a result, cross entropy loss on individual words turns out to be less effective in capturing the global sentence property of which class specifity is an example. The authors address this issue by proposing an addiitonal loss, termed as the *Discriminative Loss* which is based on a reinforcement learning paradigm. Before computing the loss, a sentence is sampled. The sentence is passed through a LSTM-based classification network whose task is to produce the ground truth category $C$ given only the sampled sentence. The reward for this operation is simply the probability of the ground truth category $C$ given only the sentence. The intuition is - for the model to produce an output with a large reward, the generated sentence must include enough information to classify the original image properly. The *Discriminative Loss* is the expectation of the negative of this reward and a wieghted linear combination of the two losses is optimized during training. My experience in reinforcement learning is limited. However, I must say I did not quite get why is sampling of the sentences required (which called for the special algorithm for backpropagation). If the idea is to see whether a generated sentence can be used to get at the ground truth category, could the last internal state of one of the stacked LSTM not be used? It would have been better to get some more intution behind the sampling operation. Another thing which (is fairly obvious but still I felt) is missing is not mentioning the loss used in the fine grained classification network. The experimentation is rigorous. The proposed method is compared with four different baseline and ablation models - description, definition, explanation-label, explanation-discriminative with different permutation and combinations of the presence of two types losses, class precition informations etc. Also the evaluation metrics measure different qualities of the generated exlanations, specifically image and class relevances. To measure image relevance METEOR/CIDEr scores of the generated sentences with the ground truth (image based) explanations are computed. On the other hand, to measure the class relevance, CIDEr scores with class definition (not necessarily based on the images from the dataset) sentences are computed. The proposed approach has continuously shown better performance than any of the baseline or ablation methods. I'd specifically mention about one experiment where the effect of class conditioning is studies (end of Sec 5.2). The finding is quite interesting as it shows that providing or not providing correct class information has drastic effect at the generated explanations. It is seen that giving incorrect class information makes the explanation model hallucinate colors or attributes which are not present in the image but are specific to the class. This raises the question whether it is worth giving the class information when the classifier is poor on the first hand? But, I think the answer lies in the observation that row 5 (with class prediction information) in table 1 is always better than row 4 (no class prediction information). Since, row 5 is better than row 4, this means the classifier is also reasonable and this in turn implies that end-to-end training can improve all the stages of a pipeline which ultimately improves the overall performance of the system too! In summary, the paper is a very good first step to explain intelligent systems and should encourage a lot more effort in this direction. ![]() |
[link]
This paper from 2016 introduced a new k-mer based method to estimate isoform abundance from RNA-Seq data called kallisto. The method provided a significant improvement in speed and memory usage compared to the previously used methods while yielding similar accuracy. In fact, kallisto is able to quantify expression in a matter of minutes instead of hours. The standard (previous) methods for quantifying expression rely on mapping, i.e. on the alignment of a transcriptome sequenced reads to a genome of reference. Reads are assigned to a position in the genome and the gene or isoform expression values are derived by counting the number of reads overlapping the features of interest. The idea behind kallisto is to rely on a pseudoalignment which does not attempt to identify the positions of the reads in the transcripts, only the potential transcripts of origin. Thus, it avoids doing an alignment of each read to a reference genome. In fact, kallisto only uses the transcriptome sequences (not the whole genome) in its first step which is the generation of the kallisto index. Kallisto builds a colored de Bruijn graph (T-DBG) from all the k-mers found in the transcriptome. Each node of the graph corresponds to a k-mer (a short sequence of k nucleotides) and retains the information about the transcripts in which they can be found in the form of a color. Linear stretches having the same coloring in the graph correspond to transcripts. Once the T-DBG is built, kallisto stores a hash table mapping each k-mer to its transcript(s) of origin along with the position within the transcript(s). This step is done only once and is dependent on a provided annotation file (containing the sequences of all the transcripts in the transcriptome). Then for a given sequenced sample, kallisto decomposes each read into its k-mers and uses those k-mers to find a path covering in the T-DBG. This path covering of the transcriptome graph, where a path corresponds to a transcript, generates k-compatibility classes for each k-mer, i.e. sets of potential transcripts of origin on the nodes. The potential transcripts of origin for a read can be obtained using the intersection of its k-mers k-compatibility classes. To make the pseudoalignment faster, kallisto removes redundant k-mers since neighboring k-mers often belong to the same transcripts. Figure1, from the paper, summarizes these different steps. https://i.imgur.com/eNH2kuO.png **Figure1**. Overview of kallisto. The input consists of a reference transcriptome and reads from an RNA-seq experiment. (a) An example of a read (in black) and three overlapping transcripts with exonic regions as shown. (b) An index is constructed by creating the transcriptome de Bruijn Graph (T-DBG) where nodes (v1, v2, v3, ... ) are k-mers, each transcript corresponds to a colored path as shown and the path cover of the transcriptome induces a k-compatibility class for each k-mer. (c) Conceptually, the k-mers of a read are hashed (black nodes) to find the k-compatibility class of a read. (d) Skipping (black dashed lines) uses the information stored in the T-DBG to skip k-mers that are redundant because they have the same k-compatibility class. (e) The k-compatibility class of the read is determined by taking the intersection of the k-compatibility classes of its constituent k-mers.[From Bray et al. Near-optimal probabilistic RNA-seq quantification, Nature Biotechnology, 2016.] Then, kallisto optimizes the following RNA-Seq likelihood function using the expectation-maximization (EM) algorithm. $$L(\alpha) \propto \prod_{f \in F} \sum_{t \in T} y_{f,t} \frac{\alpha_t}{l_t} = \prod_{e \in E}\left( \sum_{t \in e} \frac{\alpha_t}{l_t} \right )^{c_e}$$ In this function, $F$ is the set of fragments (or reads), $T$ is the set of transcripts, $l_t$ is the (effective) length of transcript $t$ and **y**$_{f,t}$ is a compatibility matrix defined as 1 if fragment $f$ is compatible with $t$ and 0 otherwise. The parameters $α_t$ are the probabilities of selecting reads from a transcript $t$. These $α_t$ are the parameters of interest since they represent the isoforms abundances or relative expressions. To make things faster, the compatibility matrix is collapsed (factorized) into equivalence classes. An equivalent class consists of all the reads compatible with the same subsets of transcripts. The EM algorithm is applied to equivalence classes (not to reads). Each $α_t$ will be optimized to maximise the likelihood of transcript abundances given observations of the equivalence classes. The speed of the method makes it possible to evaluate the uncertainty of the abundance estimates for each RNA-Seq sample using a bootstrap technique. For a given sample containing $N$ reads, a bootstrap sample is generated from the sampling of $N$ counts from a multinomial distribution over the equivalence classes derived from the original sample. The EM algorithm is applied on those sampled equivalence class counts to estimate transcript abundances. The bootstrap information is then used in downstream analyses such as determining which genes are differentially expressed. Practically, we can illustrate the different steps involved in kallisto using a small example. Starting from a tiny genome with 3 transcripts, assume that the RNA-Seq experiment produced 4 reads as depicted in the image below. https://i.imgur.com/5JDpQO8.png The first step is to build the T-DBG graph and the kallisto index. All transcript sequences are decomposed into k-mers (here k=5) to construct the colored de Bruijn graph. Not all nodes are represented in the following drawing. The idea is that each different transcript will lead to a different path in the graph. The strand is not taken into account, kallisto is strand-agnostic. https://i.imgur.com/4oW72z0.png Once the index is built, the four reads of the sequenced sample can be analysed. They are decomposed into k-mers (k=5 here too) and the pre-built index is used to determine the k-compatibility class of each k-mer. Then, the k-compatibility class of each read is computed. For example, for read 1, the intersection of all the k-compatibility classes of its k-mers suggests that it might come from transcript 1 or transcript 2. https://i.imgur.com/woektCH.png This is done for the four reads enabling the construction of the compatibility matrix **y**$_{f,t}$ which is part of the RNA-Seq likelihood function. In this equation, the $α_t$ are the parameters that we want to estimate. https://i.imgur.com/Hp5QJvH.png The EM algorithm being too slow to be applied on millions of reads, the compatibility matrix **y**$_{f,t}$ is factorized into equivalence classes and a count is computed for each class (how many reads are represented by this equivalence class). The EM algorithm uses this collapsed information to maximize the new equivalent RNA-Seq likelihood function and optimize the $α_t$. https://i.imgur.com/qzsEq8A.png The EM algorithm stops when for every transcript $t$, $α_tN$ > 0.01 changes less than 1%, where $N$ is the total number of reads. ![]() |
[link]
Meng and Chen propose MagNet, a combination of adversarial example detection and removal. At test time, given a clean or adversarial test image, the proposed defense works as follows: First, the input is passed through one or multiple detectors. If one of these detectors fires, the input is rejected. To this end, the authors consider detection based on the reconstruction error of an auto-encoder or detection based on the divergence between probability predictions (on adversarial vs. clean example). Second, if not rejected, the input is passed through a reformed. The reformer reconstructs the input, e.g., through an auto-encoder, to remove potentially undetected adversarial noise. Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/). ![]() |
[link]
**Goal**: identifying training points most responsible for a given prediction. Given training points $z_1, \dots, z_n$, let loss function be $\frac{1}{n}\sum_{i=1}^nL(z_i, \theta)$ A function called influence function let us compute the parameter change if $z$ were upweighted by some small $\epsilon$. $$\hat{\theta}_{\epsilon, z} := \arg \min_{\theta \in \Theta} \frac{1}{n}\sum_{i=1}^n L(z_i, \theta) + \epsilon L(z, \theta)$$ $$\mathcal{I}_{\text{up, params}}(z) := \frac{d\hat{\theta}_{\epsilon, z}}{d\epsilon} = -H_{\hat{\theta}}^{-1} \nabla_\theta L(z, \hat{\theta})$$ $\mathcal{I}_{\text{up, params}}(z)$ shows how uplifting one point $z$ affect the estimate of the parameters $\theta$. Furthermore, we could determine how uplifting $z$ affect the loss estimate of a test point through chain rule. $$\mathcal{I}_{\text{up, loss}}(z, z_{\text{test}}) = \nabla_\theta L(z_{\text{test}}, \hat{\theta})^\top \mathcal{I}_{\text{up, params}}(z)$$ Apart from lifting one training point, change of the parameters with the change of a training point could also be estimated. $$\frac{d\hat{\theta}_{\epsilon, z_\delta, -z}}{d\epsilon} = \mathcal{I}_{\text{up, params}}(z_\delta) - \mathcal{I}_{\text{up, params}}(z)$$ This measures how purturbation $\delta$ to training point $z$ affect the parameter estimation $\theta$. Section 3 describes some practicals about efficient implementing. This set of tool could be used for some interpretable machine learning tasks. ![]() |
[link]
This paper tackles a challenging task of hand shape and continuous Sign Language Recognition (SLR) directly from images obtained from a common RGB camera (rather than utilizing motion sensors like Kinect). The basic idea is to create a network that is end-to-end trainable with input (i.e. images) and output (i.e. hand shape labels, word labels) sequences. The network is composed of three parts: - CNN as a feature extractor - Bidirectional LSTMs for temporal modeling - Connectionist Temporal Classification as a loss layer  Results: - Observed state-of-art results (at the time of publishing) on "One-Million Hands" and "RWTH-PHOENIX-Weather-2014" datasets. - Utilizing full images rather than hand patches provides better performance for continuous SLR. - A network that recognizes hand shape and a network that recognizes word sequence can be combined and trained together to recognize word sequences. Finetuning combined system from for all layers works better than fixing "feature extraction" layers. - Combination of two networks where each network trained on separate task performs slightly better than training each network on word sequences. - Marginal difference in performance observed for different decoding and post-processing techniques during sequence-to-sequence predictions. ![]() |