Welcome to ShortScience.org! |

- ShortScience.org is a platform for post-publication discussion aiming to improve accessibility and reproducibility of research ideas.
- The website has 1197 public summaries, mostly in machine learning, written by the community and organized by paper, conference, and year.
- Reading summaries of papers is useful to obtain the perspective and insight of another reader, why they liked or disliked it, and their attempt to demystify complicated sections.
- Also, writing summaries is a good exercise to understand the content of a paper because you are forced to challenge your assumptions when explaining it.
- Finally, you can keep up to date with the flood of research by reading the latest summaries on our Twitter and Facebook pages.

Improving MMD-GAN Training with Repulsive Loss Function

Wei Wang and Yuan Sun and Saman Halgamuge

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.CV, stat.ML

**First published:** 2018/12/24 (3 weeks ago)

**Abstract:** Generative adversarial nets (GANs) are widely used to learn the data sampling
process and their performance may heavily depend on the loss functions, given a
limited computational budget. This study revisits MMD-GAN that uses the maximum
mean discrepancy (MMD) as the loss function for GAN and makes two
contributions. First, we argue that the existing MMD loss function may
discourage the learning of fine details in data as it attempts to contract the
discriminator outputs of real data. To address this issue, we propose a
repulsive loss function to actively learn the difference among the real data by
simply rearranging the terms in MMD. Second, inspired by the hinge loss, we
propose a bounded Gaussian kernel to stabilize the training of MMD-GAN with the
repulsive loss function. The proposed methods are applied to the unsupervised
image generation tasks on CIFAR-10, STL-10, CelebA, and LSUN bedroom datasets.
Results show that the repulsive loss function significantly improves over the
MMD loss at no additional computational cost and outperforms other
representative loss functions. The proposed methods achieve an FID score of
16.21 on the CIFAR-10 dataset using a single DCGAN network and spectral
normalization.
more
less

Wei Wang and Yuan Sun and Saman Halgamuge

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.CV, stat.ML

**TL;DR**: Rearranging the terms in Maximum Mean Discrepancy yields a much better loss function for the discriminator of Generative Adversarial Nets. **Keywords**: Generative adversarial nets, Maximum Mean Discrepancy, spectral normalization, convolutional neural networks, Gaussian kernel, local stability. **Summary** Generative adversarial nets (GANs) are widely used to learn the data sampling process and are notoriously difficult to train. The training of GANs may be improved from three aspects: loss function, network architecture, and training process. This study focuses on a loss function called the Maximum Mean Discrepancy (MMD), defined as: $$ MMD^2(P_X,P_G)=\mathbb{E}_{P_X}k_{D}(x,x')+\mathbb{E}_{P_G}k_{D}(y,y')-2\mathbb{E}_{P_X,P_G}k_{D}(x,y) $$ where $G,D$ are the generator and discriminator networks, $x,x'$ are real samples, $y,y'$ are generated samples, $k_D=k\circ D$ is a learned kernel that calculates the similariy between two samples. Overall, MMD calculates the distance between the real and the generated sample distributions. Thus, traditionally, the generator is trained to minimize $L_G=MMD^2(P_X,P_G)$, while the discriminator minimizes $L_D=-MMD^2(P_X,P_G)$. This study makes three contributions: - It argues that $L_D$ encourages the discriminator to ignores the fine details in real data. By minimizing $L_D$, $D$ attempts to maximize $\mathbb{E}_{P_X}k_{D}(x,x')$, the similarity between real samples scores. Thus, $D$ has to focus on common features shared by real samples rather than fine details that separate them. This may slow down training. Instead, a repulsive loss is proposed, with no additional computational cost to MMD: $$ L_D^{rep}=\mathbb{E}_{P_X}k_{D}(x,x')-\mathbb{E}_{P_G}k_{D}(y,y') $$ - Inspired by the hinge loss, this study proposes a bounded Gaussian kernel for the discriminator to facilitate stable training of MMD-GAN. - The spectral normalization method divides the weight matrix at each layer by its spectral norm to enforce that each layer is Lipschitz continuous. This study proposes a simple method to calculate the spectral norm of a convolutional kernel. The results show the efficiency of proposed methods on CIFAR-10, STL-10, CelebA and LSUN-bedroom datasets. In Appendix, we prove that MMD-GAN training using gradient method is locally exponentially stable (a property that the Wasserstein loss does not have), and show that the repulsive loss works well with gradient penalty. The paper has been accepted at ICLR 2019 ([OpenReview link](https://openreview.net/forum?id=HygjqjR9Km)). The code is available at [GitHub link](https://github.com/richardwth/MMD-GAN). |

Learning a SAT Solver from Single-Bit Supervision

Selsam, Daniel and Lamm, Matthew and Bünz, Benedikt and Liang, Percy and de Moura, Leonardo and Dill, David L.

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

Selsam, Daniel and Lamm, Matthew and Bünz, Benedikt and Liang, Percy and de Moura, Leonardo and Dill, David L.

arXiv e-Print archive - 2018 via Local Bibsonomy

Keywords: dblp

The goal is to solve SAT problems with weak supervision: In that case a model is trained only to predict ***the satisfiability*** of a formula in conjunctive normal form. As a byproduct, when the formula is satisfiable, an actual satisfying assignment can be worked out by clustering the network's activations in most cases. * **Pros (+):** Weak supervision, interesting structured architecture, seems to generalize nicely to harder problems by increasing the number message passing iterations. * **Cons (-):** Limited practical applicability since it is outperfomed by classical SAT solvers. --- # NeuroSAT ## Inputs We consider Boolean logic formulas in their ***conjunctive normal form*** (CNF), i.e. each input formula is represented as a conjunction ($\land$) of **clauses**, which are themselves disjunctions ($\lor$) of litterals (positive or negative instances of variables). The goal is to learn a classifier to predict whether such a formula is satisfiable. A first problem is how to encode the input formula in such a way that it preserves the CNF invariances (invariance to negating a litteral in all clauses, invariance to permutations in $\lor$ and $\land$ etc.). The authors use an ***undirected graph representation*** where: * $\mathcal V$: vertices are the litterals (positive and negative form of variables, denoted as $x$ and $\bar x$) and the clauses occuring in the input formula * $\mathcal E$: Edges are added to connect (i) the litterals with clauses they appear in and (ii) each litteral to its negative counterpart. The graph relations are encoded as an ***adjacency matrix***, $A$, with as many rows as there are litterals and as many columns as there are clauses. In particular, this structure does not constrain the vertices ordering, and does not make any preferential treatment between positive or negative litterals. However it still has some caveats, which can be avoided by pre-processing the formula. For instance when there are disconnected components in the graph, the averaging decision rule (see next paragraph) can lead to false positives. ## Message-passing model In a high-level view, the model keeps track of an embedding for each vertex (litterals, $L^t$ and clauses, $C^t$), updated via ***message-passing on the graph***, and combined via a Multi Layer perceptrion (MLP) to output the model prediction of the formula's satisfiability. The model updates are as follow: $$ \begin{align} C^t, h_C^t &= \texttt{LSTM}_\texttt{C}(h_C^{t - 1}, A^T \texttt{MLP}_{\texttt{L}}(L^{t - 1}) )\ \ \ \ \ \ \ \ \ \ \ (1)\\ L^t, h_L^t &= \texttt{LSTM}_\texttt{L}(h_L^{t - 1}, \overline{L^{t - 1}}, A\ \texttt{MLP}_{\texttt{C}}(C^{t }) )\ \ \ \ \ \ (2) \end{align} $$ where $h$ designates a hidden context vector for the LSTMs. The operator $L \mapsto \bar{L}$ returns $\overline{L}$, the embedding matrix $L$ where the row of each litteral is swapped with the one corresponding to the litteral's negation. In other words, in **(1)** each clause embedding is updated based on the litteral that composes it, while in **(2)** each litteral embedding is updated based on the clauses it appears in and its negated counterpart. After $T$ iterations of this message-passing scheme, the model computes a ***logit for the satisfiability classification problem***, which is trained via sigmoid cross-entropy: $$ \begin{align} L^t_{\mbox{vote}} &= \texttt{MLP}_{\texttt{vote}}(L^t)\\ y^t &= \mbox{mean}(L^t_{\mbox{vote}}) \end{align} $$ --- # Training and Inference ## Training Set The training set is built such that for any satisfiable training formula $S$, it also includes an unsatisfiable counterpart $S'$ which differs from $S$ ***only by negating one litteral in one clause***. These carefully curated samples should constrain the model to pick up substantial characteristics of the formula. In practice, the model is trained on formulas containing up to ***40 variables***, and on average ***200 clauses***. At this size, the SAT problem can still be solved by state-of-the-art solvers (yielding the supervision) but are large enough they prove challenging for Machine Learning models. ## Inferring the SAT assignment When a formula is satisfiable, one often also wants to know a ***valuation*** (variable assignment) that satisfies it. Recall that $L^t_{\mbox{vote}}$ encodes a "vote" for every litteral and its negative counterpart. Qualitative experiments show that thoses scores cannot be directly used for inferring the variable assignment, however they do induce a nice clustering of the variables (once the message passing has converged). Hence an assignment can be found as follows: * (1) Reshape $L^T_{\mbox{vote}}$ to size $(n, 2)$ where $n$ is the number of litterals. * (2) Cluster the litterals into two clusters with centers $\Delta_1$ and $\Delta_2$ using the following criterion: \begin{align} \|x_i - \Delta_1\|^2 + \|\overline{x_i} - \Delta_2\|^2 \leq \|x_i - \Delta_2\|^2 + \|\overline{x_i} - \Delta_1\|^2 \end{align} * (3) Try the two resulting assignments (set $\Delta_1$ to true and $\Delta_2$ to false, or vice-versa) and choose the one that yields satisfiability if any. In practice, this method retrieves a satistifiability assignment for over 70% of the satisfiable test formulas. --- # Experiments In practice, the ***NeuroSAT*** model is trained with embeddings of dimension 128 and 26 message passing iterations using standard MLPs: 3 layers followed by ReLU activations. The final model obtains 85% accuracy in predicting a formula's satisfiability on the test set. It also can generalize to ***larger problems***, requiring to increase the number of message passing iterations, although the classification performance decreases as the problem size grows (e.g. 25% for 200 variables). Interestingly, the model also generalizes well to other classes of problems that were first ***reduced to SAT***, although they have different structure than the random formulas generated for training, which seems to show that the model does learn some general structural characteristics of Boolean formulas. |

Data Augmentation for Skin Lesion Analysis

Fábio Perez and Cristina Vasconcelos and Sandra Avila and Eduardo Valle

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV

**First published:** 2018/09/05 (4 months ago)

**Abstract:** Deep learning models show remarkable results in automated skin lesion
analysis. However, these models demand considerable amounts of data, while the
availability of annotated skin lesion images is often limited. Data
augmentation can expand the training dataset by transforming input images. In
this work, we investigate the impact of 13 data augmentation scenarios for
melanoma classification trained on three CNNs (Inception-v4, ResNet, and
DenseNet). Scenarios include traditional color and geometric transforms, and
more unusual augmentations such as elastic transforms, random erasing and a
novel augmentation that mixes different lesions. We also explore the use of
data augmentation at test-time and the impact of data augmentation on various
dataset sizes. Our results confirm the importance of data augmentation in both
training and testing and show that it can lead to more performance gains than
obtaining new images. The best scenario results in an AUC of 0.882 for melanoma
classification without using external data, outperforming the top-ranked
submission (0.874) for the ISIC Challenge 2017, which was trained with
additional data.
more
less

Fábio Perez and Cristina Vasconcelos and Sandra Avila and Eduardo Valle

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV

_Disclaimer: I'm the first author of this paper._ The code for this paper can be found at https://github.com/fabioperez/skin-data-augmentation. In this work, we wanted to compare different data augmentation scenarios for skin lesion analysis. We tried 13 scenarios, including commonly used augmentation techniques (color and geometry transformations), unusual ones (random erasing, elastic transformation, and a novel lesion mix to simulate collision lesions), and a combination of those. Examples of the augmentation scenarios: https://i.imgur.com/TpgxzLZ.png a) no augmentation b) color (saturation, contrast, and brightness) c) color (saturation, contrast, brightness, and hue) d) affine (rotation, shear, scaling) e) random flips f) random crops g) random erasing h) elastic i) lesion mix j) basic set (f, d, e, c) k) basic set + erasing (f, g, d, e, c) l) basic set + elastic (f, d, h, e, c) m) basic set + mix (i, f, d, e, c) --- We used the ISIC 2017 Challenge dataset (2000 training images, 150 validation images, and 600 test images). We tried three network architectures: Inception-v4, ResNet-152, and DenseNet-161. We also compared different test-time data augmentation methods: a) no augmentation; b) 144-crops; c) same data augmentation as training (64 augmented copies of the original image). Final prediction was the average of all augmented predictions. ## Results https://i.imgur.com/WK5VKUf.png * Basic set (combination of commonly used augmentations) is the best scenario. * Data augmentation at test-time is very beneficial. * Elastic is better than no augmentation, but when compared incorporated to the basic set, decreases the performance. * The best result was better than the winner of the challenge in 2017, without using ensembling. * Test data augmentation is very similar with 144-crop, but takes less images during prediction (64 vs 144), so it's faster. # Impact of data augmentation on dataset sizes We also used the basic set scenarios on different dataset sizes by sampling random subsets of the original dataset, with sizes 1500, 1000, 500, 250 and 125. https://i.imgur.com/m3Ut6ht.png ## Results * Using data augmentation can be better than using more data (but you should always use more data since the model can benefit from both). For instance, using 500 images with data augmentation on training and test for Inception is better than training with no data augmentation with 2000 images. * ResNet and DenseNet works better than Inception for less data. * Test-time data augmentation is always better than not augmenting on test-time. * Using data augmentation on train only was worse than not augmenting at all in some cases. |

Collective Motion of Moshers at Heavy Metal Concerts

Jesse L. Silverberg and Matthew Bierbaum and James P. Sethna and Itai Cohen

arXiv e-Print archive - 2013 via Local arXiv

Keywords: physics.soc-ph, cs.SI, physics.bio-ph

**First published:** 2013/02/07 (5 years ago)

**Abstract:** Human collective behavior can vary from calm to panicked depending on social
context. Using videos publicly available online, we study the highly energized
collective motion of attendees at heavy metal concerts. We find these extreme
social gatherings generate similarly extreme behaviors: a disordered gas-like
state called a mosh pit and an ordered vortex-like state called a circle pit.
Both phenomena are reproduced in flocking simulations demonstrating that human
collective behavior is consistent with the predictions of simplified models.
more
less

Jesse L. Silverberg and Matthew Bierbaum and James P. Sethna and Itai Cohen

arXiv e-Print archive - 2013 via Local arXiv

Keywords: physics.soc-ph, cs.SI, physics.bio-ph

ProjectionNet: Learning Efficient On-Device Deep Networks Using Neural Projections

Ravi, Sujith

arXiv e-Print archive - 2017 via Local Bibsonomy

Keywords: dblp

Ravi, Sujith

arXiv e-Print archive - 2017 via Local Bibsonomy

Keywords: dblp

*Note: This is a review of both Self Governing Neural Networks and ProjectionNet.* # [Self Governing Neural Networks (SGNN): the Projection Layer](https://github.com/guillaume-chevalier/SGNN-Self-Governing-Neural-Networks-Projection-Layer) > A SGNN's word projections preprocessing pipeline in scikit-learn In this notebook, we'll use T=80 random hashing projection functions, each of dimensionnality d=14, for a total of 1120 features per projected word in the projection function P. Next, we'll need feedforward neural network (dense) layers on top of that (as in the paper) to re-encode the projection into something better. This is not done in the current notebook and is left to you to implement in your own neural network to train the dense layers jointly with a learning objective. The SGNN projection created hereby is therefore only a preprocessing on the text to project words into the hashing space, which becomes spase 1120-dimensional word features created dynamically hereby. Only the CountVectorizer needs to be fitted, as it is a char n-gram term frequency prior to the hasher. This one could be computed dynamically too without any fit, as it would be possible to use the [power set](https://en.wikipedia.org/wiki/Power_set) of the possible n-grams as sparse indices computed on the fly as (indices, count_value) tuples, too. ```python import sklearn from sklearn.feature_extraction.text import CountVectorizer from sklearn.pipeline import Pipeline, FeatureUnion from sklearn.random_projection import SparseRandomProjection from sklearn.base import BaseEstimator, TransformerMixin from sklearn.metrics.pairwise import cosine_similarity from collections import Counter from pprint import pprint ``` ## Preparing dummy data for demonstration: ```python class SentenceTokenizer(BaseEstimator, TransformerMixin): # char lengths: MINIMUM_SENTENCE_LENGTH = 10 MAXIMUM_SENTENCE_LENGTH = 200 def fit(self, X, y=None): return self def transform(self, X): return self._split(X) def _split(self, string_): splitted_string = [] sep = chr(29) # special separator character to split sentences or phrases. string_ = string_.strip().replace(".", "." + sep).replace("?", "?" + sep).replace("!", "!" + sep).replace(";", ";" + sep).replace("\n", "\n" + sep) for phrase in string_.split(sep): phrase = phrase.strip() while len(phrase) > SentenceTokenizer.MAXIMUM_SENTENCE_LENGTH: # clip too long sentences. sub_phrase = phrase[:SentenceTokenizer.MAXIMUM_SENTENCE_LENGTH].lstrip() splitted_string.append(sub_phrase) phrase = phrase[SentenceTokenizer.MAXIMUM_SENTENCE_LENGTH:].rstrip() if len(phrase) >= SentenceTokenizer.MINIMUM_SENTENCE_LENGTH: splitted_string.append(phrase) return splitted_string with open("./data/How-to-Grow-Neat-Software-Architecture-out-of-Jupyter-Notebooks.md") as f: raw_data = f.read() test_str_tokenized = SentenceTokenizer().fit_transform(raw_data) # Print text example: print(len(test_str_tokenized)) pprint(test_str_tokenized[3:9]) ``` 168 ["Have you ever been in the situation where you've got Jupyter notebooks " '(iPython notebooks) so huge that you were feeling stuck in your code?', 'Or even worse: have you ever found yourself duplicating your notebook to do ' 'changes, and then ending up with lots of badly named notebooks?', "Well, we've all been here if using notebooks long enough.", 'So how should we code with notebooks?', "First, let's see why we need to be careful with notebooks.", "Then, let's see how to do TDD inside notebook cells and how to grow a neat " 'software architecture out of your notebooks.'] ## Creating a SGNN preprocessing pipeline's classes ```python class WordTokenizer(BaseEstimator, TransformerMixin): def fit(self, X, y=None): return self def transform(self, X): begin_of_word = "<" end_of_word = ">" out = [ [ begin_of_word + word + end_of_word for word in sentence.replace("//", " /").replace("/", " /").replace("-", " -").replace(" ", " ").split(" ") if not len(word) == 0 ] for sentence in X ] return out ``` ```python char_ngram_range = (1, 4) char_term_frequency_params = { 'char_term_frequency__analyzer': 'char', 'char_term_frequency__lowercase': False, 'char_term_frequency__ngram_range': char_ngram_range, 'char_term_frequency__strip_accents': None, 'char_term_frequency__min_df': 2, 'char_term_frequency__max_df': 0.99, 'char_term_frequency__max_features': int(1e7), } class CountVectorizer3D(CountVectorizer): def fit(self, X, y=None): X_flattened_2D = sum(X.copy(), []) super(CountVectorizer3D, self).fit_transform(X_flattened_2D, y) # can't simply call "fit" return self def transform(self, X): return [ super(CountVectorizer3D, self).transform(x_2D) for x_2D in X ] def fit_transform(self, X, y=None): return self.fit(X, y).transform(X) ``` ```python import scipy.sparse as sp T = 80 d = 14 hashing_feature_union_params = { # T=80 projections for each of dimension d=14: 80 * 14 = 1120-dimensionnal word projections. **{'union__sparse_random_projection_hasher_{}__n_components'.format(t): d for t in range(T) }, **{'union__sparse_random_projection_hasher_{}__dense_output'.format(t): False # only AFTER hashing. for t in range(T) } } class FeatureUnion3D(FeatureUnion): def fit(self, X, y=None): X_flattened_2D = sp.vstack(X, format='csr') super(FeatureUnion3D, self).fit(X_flattened_2D, y) return self def transform(self, X): return [ super(FeatureUnion3D, self).transform(x_2D) for x_2D in X ] def fit_transform(self, X, y=None): return self.fit(X, y).transform(X) ``` ## Fitting the pipeline Note: at fit time, the only thing done is to discard some unused char n-grams and to instanciate the random hash, the whole thing could be independent of the data, but here because of discarding the n-grams, we need to "fit" the data. Therefore, fitting could be avoided all along, but we fit here for simplicity of implementation using scikit-learn. ```python params = dict() params.update(char_term_frequency_params) params.update(hashing_feature_union_params) pipeline = Pipeline([ ("word_tokenizer", WordTokenizer()), ("char_term_frequency", CountVectorizer3D()), ('union', FeatureUnion3D([ ('sparse_random_projection_hasher_{}'.format(t), SparseRandomProjection()) for t in range(T) ])) ]) pipeline.set_params(**params) result = pipeline.fit_transform(test_str_tokenized) print(len(result), len(test_str_tokenized)) print(result[0].shape) ``` 168 168 (12, 1120) ## Let's see the output and its form. ```python print(result[0].toarray().shape) print(result[0].toarray()[0].tolist()) print("") # The whole thing is quite discrete: print(set(result[0].toarray()[0].tolist())) # We see that we could optimize by using integers here instead of floats by counting the occurence of every entry. print(Counter(result[0].toarray()[0].tolist())) ``` (12, 1120) [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, -2.005715251142432, 0.0, 2.005715251142432, 0.0, 0.0, 2.005715251142432, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] {0.0, 2.005715251142432, -2.005715251142432} Counter({0.0: 1069, -2.005715251142432: 27, 2.005715251142432: 24}) ## Checking that the cosine similarity before and after word projection is kept Note that this is a yet low-quality test, as the neural network layers above the projection are absent, so the similary is not yet semantic, it only looks at characters. ```python word_pairs_to_check_against_each_other = [ # Similar: ["start", "started"], ["prioritize", "priority"], ["twitter", "tweet"], ["Great", "great"], # Dissimilar: ["boat", "cow"], ["orange", "chewbacca"], ["twitter", "coffee"], ["ab", "ae"], ] before = pipeline.named_steps["char_term_frequency"].transform(word_pairs_to_check_against_each_other) after = pipeline.named_steps["union"].transform(before) for i, word_pair in enumerate(word_pairs_to_check_against_each_other): cos_sim_before = cosine_similarity(before[i][0], before[i][1])[0,0] cos_sim_after = cosine_similarity( after[i][0], after[i][1])[0,0] print("Word pair tested:", word_pair) print("\t - similarity before:", cos_sim_before, "\t Are words similar?", "yes" if cos_sim_before > 0.5 else "no") print("\t - similarity after :", cos_sim_after , "\t Are words similar?", "yes" if cos_sim_after > 0.5 else "no") print("") ``` Word pair tested: ['start', 'started'] - similarity before: 0.8728715609439697 Are words similar? yes - similarity after : 0.8542062410985866 Are words similar? yes Word pair tested: ['prioritize', 'priority'] - similarity before: 0.8458888522202895 Are words similar? yes - similarity after : 0.8495862181305898 Are words similar? yes Word pair tested: ['twitter', 'tweet'] - similarity before: 0.5439282932204212 Are words similar? yes - similarity after : 0.4826046482460216 Are words similar? no Word pair tested: ['Great', 'great'] - similarity before: 0.8006407690254358 Are words similar? yes - similarity after : 0.8175049752615363 Are words similar? yes Word pair tested: ['boat', 'cow'] - similarity before: 0.1690308509457033 Are words similar? no - similarity after : 0.10236537810666581 Are words similar? no Word pair tested: ['orange', 'chewbacca'] - similarity before: 0.14907119849998599 Are words similar? no - similarity after : 0.2019908169580899 Are words similar? no Word pair tested: ['twitter', 'coffee'] - similarity before: 0.09513029883089882 Are words similar? no - similarity after : 0.1016460166230715 Are words similar? no Word pair tested: ['ab', 'ae'] - similarity before: 0.408248290463863 Are words similar? no - similarity after : 0.42850530886130067 Are words similar? no ## Next up So we have created the sentence preprocessing pipeline and the sparse projection (random hashing) function. We now need a few feedforward layers on top of that. Also, a few things could be optimized, such as using the power set of the possible n-gram values with a predefined character set instead of fitting it, and the Hashing's fit function could be avoided as well by passing the random seed earlier, because the Hasher doesn't even look at the data and it only needs to be created at some point. This would yield a truly embedding-free approach. Free to you to implement this. I wanted to have something that worked first, leaving optimization for later. ## License BSD 3-Clause License Copyright (c) 2018, Guillaume Chevalier All rights reserved. ## Extra links ### Connect with me - [LinkedIn](https://ca.linkedin.com/in/chevalierg) - [Twitter](https://twitter.com/guillaume_che) - [GitHub](https://github.com/guillaume-chevalier/) - [Quora](https://www.quora.com/profile/Guillaume-Chevalier-2) - [YouTube](https://www.youtube.com/c/GuillaumeChevalier) - [Dev/Consulting](http://www.neuraxio.com/en/) ### Liked this piece of code? Did it help you? Leave a [star](https://github.com/guillaume-chevalier/SGNN-Self-Governing-Neural-Networks-Projection-Layer/stargazers), [fork](https://github.com/guillaume-chevalier/SGNN-Self-Governing-Neural-Networks-Projection-Layer/network/members) and share the love! # ProjectionNets **Notes are from [Issue 1](https://github.com/guillaume-chevalier/SGNN-Self-Governing-Neural-Networks-Projection-Layer/issues/1)**: Very interesting. I've finally read the [previous supporting paper](https://arxiv.org/pdf/1708.00630.pdf), thanks for the shootout. Here are my thoughts after reading it. To sum up, I think that the projections are at word-level instead of at sentence level. This is for two reasons: 1. they use a hidden layer size of only 256 to represent words neurally (whereas sentence representations would be quite bigger), and 2. they seem to use an LSTM on top of the ProjectionNet (SGNN) to model short sentences in their benchmarks, which would mean the ProjectionNet doesn't encode at sentence-level but at least at a lower level (probably words). Here is my full review: On 80\*14 v.s. 1\*1120 projections: - I thought the set of 80 projection functions was not for time performance, but rather to make the union of potentially different features. I think that either way, if one projection function of 1120 entries would take as much time to compute as 80 functions of 14 entries (80\*14=1120) - please correct me if I'm wrong. On the hidden layer size of 256: - I find peculiar that the size of their FullyConnected layers is only of 256. I'd expect 300 for word-level neural representations and rather 2000 for sentence-level neural representations. This leads me to think that the projection layer is at the word-level with char features and not at the sentence-level with char features. On the benchmark against a nested RNN (see section "4.3 Semantic Intent Classification") in the previous supporting paper: - They say "We use an RNN sequence model with multilayer LSTM architecture (2 layers, 100 dimensions) as the baseline and trainer network. The LSTM model and its ProjectionNet variant are also compared against other baseline systems [...]". The fact they phrase their experiment as "The LSTM model and its ProjectionNet" leads me to think that they pre-tokenized texts on words and that the projection layer is applied at word-level from skip-gram char features. This would seem to go in the same direction of the fact they use a hidden layer (FullyConnected) size of only 256 rather than something over or equal to like 1000. On [teacher-student model training](https://www.quora.com/What-is-a-teacher-student-model-in-a-Convolutional-neural-network/answer/Guillaume-Chevalier-2): - They seem to use a regular NN like a crutch to assist the projection layer's top-level layer to reshape information correctly. They even train the teacher at the same time that they train the student SGNN, which is something I hadn't yet seen compared to regular teacher-student setups. I'd find simpler to use a Matching Networks directly which would be quite simpler than setting up student-teacher learning. I'm not sure how their "graph structured loss functions" works - I yet still assume that they'd need train the whole thing like in word2vec with skip-gram or CBOW (but here with the new type of skip-gram training procedure instead of the char feature-extraction skip-gram). I wonder why they did things in a so complicated way. Matching Networks (a.k.a. cosine similarity loss, a.k.a. self-attention queries dotted with either attention keys or values before a softmax) directly with negative sampling seems so much simpler. |

Investigating Human Priors for Playing Video Games

Rachit Dubey and Pulkit Agrawal and Deepak Pathak and Thomas L. Griffiths and Alexei A. Efros

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.AI, cs.LG

**First published:** 2018/02/28 (10 months ago)

**Abstract:** What makes humans so good at solving seemingly complex video games? Unlike
computers, humans bring in a great deal of prior knowledge about the world,
enabling efficient decision making. This paper investigates the role of human
priors for solving video games. Given a sample game, we conduct a series of
ablation studies to quantify the importance of various priors on human
performance. We do this by modifying the video game environment to
systematically mask different types of visual information that could be used by
humans as priors. We find that removal of some prior knowledge causes a drastic
degradation in the speed with which human players solve the game, e.g. from 2
minutes to over 20 minutes. Furthermore, our results indicate that general
priors, such as the importance of objects and visual consistency, are critical
for efficient game-play. Videos and the game manipulations are available at
https://rach0012.github.io/humanRL_website/
more
less

Rachit Dubey and Pulkit Agrawal and Deepak Pathak and Thomas L. Griffiths and Alexei A. Efros

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.AI, cs.LG

Authors investigated why humans play some video games better than machines. That is the case for games that do not have continuous rewards (e.g., scores). They experimented with a game -- inspired by _Montezuma's Revenge_ -- in which the player has to climb stairs, collect keys and jump over enemies. RL algorithms can only know if they succeed if they finish the game, as there is no rewards during the gameplay, so they tend to do much worse than humans in these games. To compare between humans and machines, they set up RL algorithms and recruite players from Amazon Mechanical Turk. Humans did much better than machines for the original game setup. However, authors wanted to check the impact of semantics and prior knowledge on humans performance. They set up scenarios with different levels of reduced semantic information, as shown in Figure 2. https://i.imgur.com/e0Dq1WO.png This is what the game originally looked like: https://rach0012.github.io/humanRL_website/main.gif And this is the version with lesser semantic clues: https://rach0012.github.io/humanRL_website/random2.gif You can try yourself in the [paper's website](https://rach0012.github.io/humanRL_website/). Not surprisingly, humans took much more time to complete the game in scenarios with less semantic information, indicating that humans strongly rely on prior knowledge to play video games. The authors argue that this prior knowledge should also be somehow included into RL algorithms in order to move their efficiency towards the human level. ## Additional reading [Why humans learn faster than AI—for now](https://www.technologyreview.com/s/610434/why-humans-learn-faster-than-ai-for-now/). [OpenReview submission](https://openreview.net/forum?id=Hk91SGWR-) |

Progressive Growing of GANs for Improved Quality, Stability, and Variation

Karras, Tero and Aila, Timo and Laine, Samuli and Lehtinen, Jaakko

arXiv e-Print archive - 2017 via Local Bibsonomy

Keywords: dblp

Karras, Tero and Aila, Timo and Laine, Samuli and Lehtinen, Jaakko

arXiv e-Print archive - 2017 via Local Bibsonomy

Keywords: dblp

## **Keywords** Progressive GAN , High resolution generator --- ## **Summary** 1. **Introduction** 1. **Goal of the paper** 1. Generation of very high quality images using progressively increasing size of the generator and discriminator. 1. Improved training and stability of GANs. 1. New metric for evaluating GAN results. 1. A high quality version of CELEBA-HQ dataset. 1. **Previous Research** 1. Generative methods help to produce new samples from higher-dimensional data distributions such as images . 1. The common approaches for generative methods are : 1. Autoregressive models : Produce sharp images and are slow to evaluate. eg PixelCNN 1. Variational Autoencoders : Easy to train but produces blurry images. 1. Generative Adversarial Neural Network : Produces sharp images at small resolutions but are highly unstable. 1. **Method** 1. **Basic GAN architecture** 1. Gan consists of two major parts : 1. _Generator_ : Creates a sample image from latent code which look very close to the training images. 1. _Discriminator_: Discriminator is trained to assess how close the sample image looks to the training image. 1. To measure the overlap between the training and the generated distributions many methods are used like Jensen-Shannon divergence , least-squares divergence and Wasserstein Distance. 1. Larger resolution generations cause problems because it becomes difficult for both the training and the generated networks amplifying the gradient problem. Larger resolutions also require large memory and can cause problems. 1. A mechanism is also proposed to stop the generator from participating in escalation that causes mode collapse problem. 1. **Progressive growing of GANs** 1. The primary method for the GAN training is to start off from a low resolution image and add extra layers in each step of the training process. 1. Lower resolution images are more stable as they have very less class information and as the resolution of the image increases further smaller details and features are added to the image. 1. This leads to a smooth increase in the quality of image instead of the network learning lot of details in one single step. 1. **Mini-batch separation** 1. GANs tend to capture only a very small set of features from the image. 1. "Minibatch discrimination" is used to generate feature vector for each individual image along with one for the the mini batch of images also. ![alt_text](https://i.imgur.com/dHFl5OV.png "image_tooltip") 1. **Conclusion** 1. Higher resolution images are able to be generated which are robust and efficient. 1. Improved quality of the generated images is given. 1. Reduced training time for a comparable result and output quality and resolution. --- ## **Notes** * Gradient Problem : At higher resolutions it becomes easier to tell the differences between the training and the testing images [1]. This is referred to as the gradient problem. * Mode Collapse : The generator is incapable of creating a large variety of samples and get stuck. ## **Open research questions** 1. Improved methods for a true photorealism generation of images. 1. Improved semantic sensibility and improved understanding of the dataset. ## **References** 1. [https://blog.acolyer.org/2018/05/10/progressive-growing-of-gans-for-improved-quality-stability-and-variation/](https://blog.acolyer.org/2018/05/10/progressive-growing-of-gans-for-improved-quality-stability-and-variation/) 1. [https://medium.com/@jonathan_hui/gan-why-it-is-so-hard-to-train-generative-advisory-networks-819a86b3750b](https://medium.com/@jonathan_hui/gan-why-it-is-so-hard-to-train-generative-advisory-networks-819a86b3750b) |

Do Deep Generative Models Know What They Don't Know?

Eric Nalisnick and Akihiro Matsukawa and Yee Whye Teh and Dilan Gorur and Balaji Lakshminarayanan

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

**First published:** 2018/10/22 (2 months ago)

**Abstract:** A neural network deployed in the wild may be asked to make predictions for
inputs that were drawn from a different distribution than that of the training
data. A plethora of work has demonstrated that it is easy to find or synthesize
inputs for which a neural network is highly confident yet wrong. Generative
models are widely viewed to be robust to such mistaken confidence as modeling
the density of the input features can be used to detect novel,
out-of-distribution inputs. In this paper we challenge this assumption. We find
that the density learned by flow-based models, VAEs, and PixelCNNs cannot
distinguish images of common objects such as dogs, trucks, and horses (i.e.
CIFAR-10) from those of house numbers (i.e. SVHN), assigning a higher
likelihood to the latter when the model is trained on the former. Moreover, we
find evidence of this phenomenon when pairing several popular image data sets:
FashionMNIST vs MNIST, CelebA vs SVHN, ImageNet vs CIFAR-10 / CIFAR-100 / SVHN.
To investigate this curious behavior, we focus analysis on flow-based
generative models in particular since they are trained and evaluated via the
exact marginal likelihood. We find such behavior persists even when we restrict
the flow models to constant-volume transformations. These transformations admit
some theoretical analysis, and we show that the difference in likelihoods can
be explained by the location and variances of the data and the model curvature.
Our results caution against using the density estimates from deep generative
models to identify inputs similar to the training distribution until their
behavior for out-of-distribution inputs is better understood.
more
less

Eric Nalisnick and Akihiro Matsukawa and Yee Whye Teh and Dilan Gorur and Balaji Lakshminarayanan

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

CNNs predictions are known to be very sensitive to adversarial examples, which are samples generated to be wrongly classifiied with high confidence. On the other hand, probabilistic generative models such as `PixelCNN` and `VAEs` learn a distribution over the input domain hence could be used to detect ***out-of-distribution inputs***, e.g., by estimating their likelihood under the data distribution. This paper provides interesting results showing that distributions learned by generative models are not robust enough yet to employ them in this way. * **Pros (+):** convincing experiments on multiple generative models, more detailed analysis in the invertible flow case, interesting negative results. * **Cons (-):** It would be interesting to provide further results for different datasets / domain shifts to observe if this property can be quanitfied as a characteristics of the model or of the input data. --- ## Experimental negative result Three classes of generative models are considered in this paper: * **Auto-regressive** models such as `PixelCNN` [1] * **Latent variable** models, such as `VAEs` [2] * Generative models with **invertible flows** [3], in particular `Glow` [4]. The authors train a generative model $G$ on input data $\mathcal X$ and then use it to evaluate the likelihood on both the training domain $\mathcal X$ and a different domain $\tilde{\mathcal X}$. Their main (negative) result is showing that **a model trained on the CIFAR-10 dataset yields a higher likelihood when evaluated on the SVHN test dataset than on the CIFAR-10 test (or even train) split**. Interestingly, the converse, when training on SVHN and evaluating on CIFAR, is not true. This result was consistantly observed for various architectures including [1], [2] and [4], although it is of lesser effect in the `PixelCNN` case. Intuitively, this could come from the fact that both of these datasets contain natural images and that CIFAR-10 is strictly more diverse than SVHN in terms of semantic content. Nonetheless, these datasets vastly differ in appearance, and this result is counter-intuitive as it goes against the direction that generative models can reliably be use to detect out-of-distribution samples. Furthermore, this observation also confirms the general idea that higher likelihoods does not necessarily coincide with better generated samples [5]. --- ## Further analysis for invertible flow models The authors further study this phenomenon in the invertible flow models case as they provide a more rigorous analytical framework (exact likelihood inference unlike VAE which only provide a bound on the true likelihood). More specifically invertible flow models are characterized with a ***diffeomorphism*** (invertible function), $f(x; \phi)$, between input space $\mathcal X$ and latent space $\mathcal Z$, and choice of the latent distribution $p(z; \psi)$. The ***change of variable formula*** links the density of $x$ and $z$ as follows: $$ \int_x p_x(x)d_x = \int_x p_z(f(x)) \left| \frac{\partial f}{\partial x} \right| dx $$ And the training objective under this transformation becomes $$ \arg\max_{\theta} \log p_x(\mathbf{x}; \theta) = \arg\max_{\phi, \psi} \sum_i \log p_z(f(x_i; \phi); \psi) + \log \left| \frac{\partial f_{\phi}}{\partial x_i} \right| $$ Typically, $p_z$ is chosen to be Gaussian, and samples are build by inverting $f$, i.e.,$z \sim p(\mathbf z),\ x = f^{-1}(z)$. And $f_{\phi}$ is build such that computing the log determinant of the Jacabian in the previous equation can be done efficiently. First, they observe that contribution of the flow can be decomposed in a ***density*** element (left term) and a ***volume*** element (right term), resulting from the change of variables formula. Experiment results with Glow [4] show that the higher density on SVHN mostly comes from the ***volume element contribution***. Secondly, they try to directly analyze the difference in likelihood between two domains $\mathcal X$ and $\tilde{\mathcal X}$; which can be done by a second-order expansion of the log-likelihood locally around the expectation of the distribution (assuming $\mathbb{E} (\mathcal X) \sim \mathbb{E}(\tilde{\mathcal X})$). For the constant volume Glow module, the resulting analytical formula indeed confirms that the log-likelihood of SVHN should be higher than CIFAR's, as observed in practice. --- ## References * [1] Conditional Image Generation with PixelCNN Decoders, van den Oord et al, 2016 * [2] Auto-Encoding Variational Bayes, Kingma and Welling, 2013 * [3] Density estimation using Real NVP, Dinh et al., ICLR 2015 * [4] Glow: Generative Flow with Invertible 1x1 Convolutions, Kingma and Dhariwal * [5] A Note on the Evaluation of Generative Models, Theis et al., ICLR 2016 |

Neural Ordinary Differential Equations

Ricky T. Q. Chen and Yulia Rubanova and Jesse Bettencourt and David Duvenaud

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

**First published:** 2018/06/19 (6 months ago)

**Abstract:** We introduce a new family of deep neural network models. Instead of
specifying a discrete sequence of hidden layers, we parameterize the derivative
of the hidden state using a neural network. The output of the network is
computed using a black-box differential equation solver. These continuous-depth
models have constant memory cost, adapt their evaluation strategy to each
input, and can explicitly trade numerical precision for speed. We demonstrate
these properties in continuous-depth residual networks and continuous-time
latent variable models. We also construct continuous normalizing flows, a
generative model that can train by maximum likelihood, without partitioning or
ordering the data dimensions. For training, we show how to scalably
backpropagate through any ODE solver, without access to its internal
operations. This allows end-to-end training of ODEs within larger models.
more
less

Ricky T. Q. Chen and Yulia Rubanova and Jesse Bettencourt and David Duvenaud

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

Summary by senior author [duvenaud on hackernews](https://news.ycombinator.com/item?id=18678078). A few years ago, everyone switched their deep nets to "residual nets". Instead of building deep models like this: h1 = f1(x) h2 = f2(h1) h3 = f3(h2) h4 = f3(h3) y = f5(h4) They now build them like this: h1 = f1(x) + x h2 = f2(h1) + h1 h3 = f3(h2) + h2 h4 = f4(h3) + h3 y = f5(h4) + h4 Where f1, f2, etc are neural net layers. The idea is that it's easier to model a small change to an almost-correct answer than to output the whole improved answer at once. In the last couple of years a few different groups noticed that this looks like a primitive ODE solver (Euler's method) that solves the trajectory of a system by just taking small steps in the direction of the system dynamics and adding them up. They used this connection to propose things like better training methods. We just took this idea to its logical extreme: What if we _define_ a deep net as a continuously evolving system? So instead of updating the hidden units layer by layer, we define their derivative with respect to depth instead. We call this an ODE net. Now, we can use off-the-shelf adaptive ODE solvers to compute the final state of these dynamics, and call that the output of the neural network. This has drawbacks (it's slower to train) but lots of advantages too: We can loosen the numerical tolerance of the solver to make our nets faster at test time. We can also handle continuous-time models a lot more naturally. It turns out that there is also a simpler version of the change of variables formula (for density modeling) when you move to continuous time. |

Learning Confidence for Out-of-Distribution Detection in Neural Networks

Terrance DeVries and Graham W. Taylor

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

**First published:** 2018/02/13 (11 months ago)

**Abstract:** Modern neural networks are very powerful predictive models, but they are
often incapable of recognizing when their predictions may be wrong. Closely
related to this is the task of out-of-distribution detection, where a network
must determine whether or not an input is outside of the set on which it is
expected to safely perform. To jointly address these issues, we propose a
method of learning confidence estimates for neural networks that is simple to
implement and produces intuitively interpretable outputs. We demonstrate that
on the task of out-of-distribution detection, our technique surpasses recently
proposed techniques which construct confidence based on the network's output
distribution, without requiring any additional labels or access to
out-of-distribution examples. Additionally, we address the problem of
calibrating out-of-distribution detectors, where we demonstrate that
misclassified in-distribution examples can be used as a proxy for
out-of-distribution examples.
more
less

Terrance DeVries and Graham W. Taylor

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.LG

## Summary In a prior work 'On Calibration of Modern Nueral Networks', temperature scailing is used for outputing confidence. This is done at inference stage, and does not change the existing classifier. This paper considers the confidence at training stage, and directly outputs the confidence from the network. ## Architecture An additional branch for confidence is added after the penultimate layer, in parallel to logits and probs (Figure 2). https://i.imgur.com/vtKq9g0.png ## Training The network outputs the prob $p$ and the confidence $c$ which is a single scalar. The modified prob $p'=c*p+(1-c)y$ where $y$ is the label (hint). The confidence loss is $\mathcal{L}_c=-\log c$, the NLL is $\mathcal{L}_t= -\sum \log(p'_i)y_i$. ### Budget Parameter The authors introduced the confidence loss weight $\lambda$ and a budget $\beta$. If $\mathcal{L}_c>\beta$, increase $\lambda$, if $\mathcal{L}_c<\beta$, decrease $\lambda$. $\beta$ is found reasonable in [0.1,1.0]. ### Hinting with 50% Sometimes the model relies on the free label ($c=0$) and does not fit the complicated structure of data. The authors give hints with 50% so the model cannot rely 100% on the hint. They used $p'$ for only half of the bathes for each epoch. ### Misclassified Examples A high-capacity network with small dataset overfits well, and mis-classified samples are required to learn the confidence. The network likely assigns low confidence to samples. The paper used an aggressive data augmentation to create difficult examples. ## Inference Reject if $c\le\delta$. For out-of-distribution detection, they used the same input perturbation as in ODIN (2018). ODIN used temperature scailing and used the max prob, while this paper does not need temperature scailing since it directly outputs $c$. In evaluation, this paper outperformed ODIN. ## Reference ODIN: [Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks](http://www.shortscience.org/paper?bibtexKey=journals/corr/1706.02690#elbaro) |

Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks

Shiyu Liang and Yixuan Li and R. Srikant

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG, stat.ML

**First published:** 2017/06/08 (1 year ago)

**Abstract:** We consider the problem of detecting out-of-distribution images in neural
networks. We propose ODIN, a simple and effective method that does not require
any change to a pre-trained neural network. Our method is based on the
observation that using temperature scaling and adding small perturbations to
the input can separate the softmax score distributions between in- and
out-of-distribution images, allowing for more effective detection. We show in a
series of experiments that ODIN is compatible with diverse network
architectures and datasets. It consistently outperforms the baseline approach
by a large margin, establishing a new state-of-the-art performance on this
task. For example, ODIN reduces the false positive rate from the baseline 34.7%
to 4.3% on the DenseNet (applied to CIFAR-10) when the true positive rate is
95%.
more
less

Shiyu Liang and Yixuan Li and R. Srikant

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG, stat.ML

## Task Add '**rejection**' output to an existing classification model with softmax layer. ## Method 1. Choose some threshold $\delta$ and temperature $T$ 2. Add a perturbation to the input x (eq 2), let $\tilde x = x - \epsilon \text{sign}(-\nabla_x \log S_{\hat y}(x;T))$ 3. If $p(\tilde x;T)\le \delta$, rejects 4. If not, return the output of the original classifier $p(\tilde x;T)$ is the max prob with temperature scailing for input $\tilde x$ $\delta$ and $T$ are manually chosen. |

On Calibration of Modern Neural Networks

Chuan Guo and Geoff Pleiss and Yu Sun and Kilian Q. Weinberger

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG

**First published:** 2017/06/14 (1 year ago)

**Abstract:** Confidence calibration -- the problem of predicting probability estimates
representative of the true correctness likelihood -- is important for
classification models in many applications. We discover that modern neural
networks, unlike those from a decade ago, are poorly calibrated. Through
extensive experiments, we observe that depth, width, weight decay, and Batch
Normalization are important factors influencing calibration. We evaluate the
performance of various post-processing calibration methods on state-of-the-art
architectures with image and document classification datasets. Our analysis and
experiments not only offer insights into neural network learning, but also
provide a simple and straightforward recipe for practical settings: on most
datasets, temperature scaling -- a single-parameter variant of Platt Scaling --
is surprisingly effective at calibrating predictions.
more
less

Chuan Guo and Geoff Pleiss and Yu Sun and Kilian Q. Weinberger

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG

## Task A neural network for classification typically has a **softmax** layer and outputs the class with the max probability. However, this probability does not represent the **confidence**. If the average confidence (average of max probs) for a dataset matches the accuracy, it is called **well-calibrated**. Old models like LeNet (1998) was well-calibrated, but modern networks like ResNet (2016) are no longer well-calibrated. This paper explains what caused this and compares various calibration methods. ## Figure - Confidence Histogram https://i.imgur.com/dMtdWsL.png The bottom row: group the samples by confidence (max probailities) into bins, and calculates the accuracy (# correct / # bin size) within each bin. - ECE (Expected Calibration Error): average of |accuracy-confidence| of bins - MCE (Maximum Calibration Error): max of |accuracy-confidence| of bins ## Analysis - What The paper experiments how models are mis-calibrated with different factors: (1) model capacity, (2) batch norm, (3) weight decay, (4) NLL. ## Solution - Calibration Methods Many calibration methods for binary classification and multi-class classification are evaluated. The method that performed the best is **temperature scailing**, which simply multiplies logits before the softmax by some constant. The paper used the validation set to choose the best constant. |

Learning Latent Dynamics for Planning from Pixels

Danijar Hafner and Timothy Lillicrap and Ian Fischer and Ruben Villegas and David Ha and Honglak Lee and James Davidson

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

**First published:** 2018/11/12 (2 months ago)

**Abstract:** Planning has been very successful for control tasks with known environment
dynamics. To leverage planning in unknown environments, the agent needs to
learn the dynamics from interactions with the world. However, learning dynamics
models that are accurate enough for planning has been a long-standing
challenge, especially in image-based domains. We propose the Deep Planning
Network (PlaNet), a purely model-based agent that learns the environment
dynamics from pixels and chooses actions through online planning in latent
space. To achieve high performance, the dynamics model must accurately predict
the rewards ahead for multiple time steps. We approach this problem using a
latent dynamics model with both deterministic and stochastic transition
function and a generalized variational inference objective that we name latent
overshooting. Using only pixel observations, our agent solves continuous
control tasks with contact dynamics, partial observability, and sparse rewards.
PlaNet uses significantly fewer episodes and reaches final performance close to
and sometimes higher than top model-free algorithms.
more
less

Danijar Hafner and Timothy Lillicrap and Ian Fischer and Ruben Villegas and David Ha and Honglak Lee and James Davidson

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.LG, cs.AI, stat.ML

**Summary**: This paper presents three tricks that make model-based reinforcement more reliable when tested in tasks that require walking and balancing. The tricks are 1) are planning based on features, 2) using a recursive network that mixes probabilistic and deterministic information, and 3) looking forward multiple steps. **Longer summary** Imagine playing pool, armed with a tablet that can predict exactly where the ball will bounce, and the next bounce, and so on. That would be a huge advantage to someone learning pool, however small inaccuracies in the model could mislead you especially when thinking ahead to the 2nd and third bounce. The tablet is analogous to the dynamics model in model-based reinforcement learning (RL). Model based RL promises to solve a lot of the open problems with RL, letting the agent learn with less experience, transfer well, dream, and many others advantages. Despite the promise, dynamics models are hard to get working: they often suffer from even small inaccuracies, and need to be redesigned for specific tasks. Enter PlaNet, a clever name, and a net that plans well in range of environments. To increase the challenge the model must predict directly from pixels in fairly difficult tasks such as teaching a cheetah to run or balancing a ball in a cup. How do they do this? Three main tricks. - Planning in latest space: this means that the policy network doesn't need to look at the raw image, but looks at a summary of it as represented by a feature vector. - Recurrent state space models: They found that probabilistic information helps describe the space of possibilities but makes it harder for their RNN based model to look back multiple steps. However mixing probabilistic information and deterministic information gives it the best of both worlds, and they have results that show a starting performance increase when both compared to just one. - Latent overshooting: They train the model to look more than one step ahead, this helps prevent errors that build up over time Overall this paper shows great results that tackle the shortfalls of model based RL. I hope the results remain when tested on different and more complex environments. |

Discovery of Latent 3D Keypoints via End-to-end Geometric Reasoning

Supasorn Suwajanakorn and Noah Snavely and Jonathan Tompson and Mohammad Norouzi

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.LG, stat.ML

**First published:** 2018/07/05 (6 months ago)

**Abstract:** This paper presents KeypointNet, an end-to-end geometric reasoning framework
to learn an optimal set of category-specific 3D keypoints, along with their
detectors. Given a single image, KeypointNet extracts 3D keypoints that are
optimized for a downstream task. We demonstrate this framework on 3D pose
estimation by proposing a differentiable objective that seeks the optimal set
of keypoints for recovering the relative pose between two views of an object.
Our model discovers geometrically and semantically consistent keypoints across
viewing angles and instances of an object category. Importantly, we find that
our end-to-end framework using no ground-truth keypoint annotations outperforms
a fully supervised baseline using the same neural network architecture on the
task of pose estimation. The discovered 3D keypoints on the car, chair, and
plane categories of ShapeNet are visualized at http://keypointnet.github.io/.
more
less

Supasorn Suwajanakorn and Noah Snavely and Jonathan Tompson and Mohammad Norouzi

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV, cs.LG, stat.ML

What the paper is about: KeypointNet learns the optimal set of 3D keypoints and their 2D detectors for a specified downstream task. The authors demonstrate this by extracting 3D keypoints and their 2D detectors for the task of relative pose estimation across views. They show that, using keypoints extracted by KeypointNet, relative pose estimates are superior to ones that are obtained from a supervised set of keypoints. Approach: Training samples for KeypointNet comprise two views (images) of an object. The task is to then produce an ordered list of 3D keypoints that, upon orthogonal procrustes alignment, produce the true relative 3D pose across those views. The network has N heads, each of which extracts one (3D) keypoint (from a 2D image). There are two primary loss terms. A multi-view consistency loss measures the discrepancy between the two sets of extracted keypoints under the ground-truth transform. A relative-pose estimation loss penalizes the angular discrepency (under orthogonal procrustes) of the estimated transform using the extracted keypoints vs the GT transform. Additionally, they require keypoints to be distant from each other, and to lie within the object silhouette. |

The Reversible Residual Network: Backpropagation Without Storing Activations.

Aidan N. Gomez and Mengye Ren and Raquel Urtasun and Roger B. Grosse

Neural Information Processing Systems Conference - 2017 via Local dblp

Keywords:

Aidan N. Gomez and Mengye Ren and Raquel Urtasun and Roger B. Grosse

Neural Information Processing Systems Conference - 2017 via Local dblp

Keywords:

Residual Networks (ResNets) have greatly advanced the state-of-the-art in Deep Learning by making it possible to train much deeper networks via the addition of skip connections. However, in order to compute gradients during the backpropagation pass, all the units' activations have to be stored during the feed-forward pass, leading to high memory requirements for these very deep networks. Instead, the authors propose a **reversible architecture** based on ResNets, in which activations at one layer can be computed from the ones of the next. Leveraging this invertibility property, they design a more efficient implementation of backpropagation, effectively trading compute power for memory storage. * **Pros (+): ** The change does not negatively impact model accuracy (for equivalent number of model parameters) and it only requires a small change in the backpropagation algorithm. * **Cons (-): ** Increased number of parameters, thus need to change the unit depth to match the "equivalent" ResNet --- # Proposed Architecture ## RevNet This paper proposes to incorporate idea from previous reversible architectures, such as NICE [1], into a standard ResNet. The resulting model is called **RevNet** and is composed of reversible blocks, inspired from *additive coupling* [1, 2]: $ \begin{array}{r|r} \texttt{RevNet Block} & \texttt{Inverse Transformation}\\ \hline \mathbf{input }\ x & \mathbf{input }\ y \\ x_1, x_2 = \mbox{split}(x) & y1, y2 = \mbox{split}(y)\\ y_1 = x_1 + \mathcal{F}(x_2) & x_2 = y_2 - \mathcal{G}(y_1) \\ y_2 = x_2 + \mathcal{G}(y_1) & x_1 = y_1 - \mathcal{F}(x_2)\\ \mathbf{output}\ y = (y_1, y_2) & \mathbf{output}\ x = (x_1, x_2) \end{array} $ where $\mathcal F$ and $\mathcal G$ are residual functions, composed of sequences of convolutions, ReLU and Batch Normalization layers, analoguous to the ones in a standard ResNet block, although operations in the reversible blocks need to have a stride of 1 to avoid information loss and preserve invertibility. Finally, for the `split` operation, the authors consider spliting the input Tensor across the channel dimension as in [1, 2]. Similarly to ResNet, the final RevNet architecture is composed of these invertible residual blocks, as well as non-reversible subsampling operations (e.g., pooling) for which activations have to be stored. However the number of such operations is much smaller than the number of residual blocks in a typical ResNet architecture. ## Backpropagation ### Standard The backpropagaton algorithm is derived from the chain rule and is used to compute the total gradients of the loss with respect to the parameters in a neural network: given a loss function $L$, we want to compute the gradients of $L$ with respect to the parameters of each layer, indexed by $n \in [1, N]$, i.e., the quantities $ \overline{\theta_{n}} = \partial L /\ \partial \theta_n$. (where $\forall x, \bar{x} = \partial L / \partial x$). We roughly summarize the algorithm in the left column of **Table 1**: In order to compute the gradients for the $n$-th block, backpropagation requires the input and output activation of this block, $y_{n - 1}$ and $y_{n}$, which have been stored, and the derivative of the loss respectively to the output, $\overline{y_{n}}$, which has been computed in the backpropagation iteration of the upper layer; Hence the name backpropagation ### RevNet Since activations are not stored in RevNet, the algorithm needs to be slightly modified, which we describe in the right column of **Table 1**. In summary, we first need to recover the input activations of the RevNet block using its invertibility. These activations will be propagated to the earlier layers for further backpropagation. Secondly, we need to compute the gradients of the loss with respect to the inputs, i.e. $\overline{y_{n - 1}} = (\overline{y_{n -1, 1}}, \overline{y_{n - 1, 2}})$, using the fact that: $ \begin{align} \overline{y_{n - 1, i}} = \overline{y_{n, 1}}\ \frac{\partial y_{n, 1}}{y_{n - 1, i}} + \overline{y_{n, 2}}\ \frac{\partial y_{n, 2}}{y_{n - 1, i}} \end{align} $ Once again, this result will be propagated further down the network. Finally, once we have computed both these quantities we can obtain the gradients with respect to the parameters of this block, $\theta_n$. $ \begin{array}{|c|l|l|} \hline & \mathbf{ResNet} & \mathbf{RevNet} \\ \hline \mathbf{Block} & y_{n} = y_{n - 1} + \mathcal F(y_{n - 1}) & y_{n - 1, 1}, y_{n - 1, 2} = \mbox{split}(y_{n - 1})\\ && y_{n, 1} = y_{n - 1, 1} + \mathcal{F}(y_{n - 1, 2})\\ && y_{n, 2} = y_{n - 1, 2} + \mathcal{G}(y_{n, 1})\\ && y_{n} = (y_{n, 1}, y_{n, 2})\\ \hline \mathbf{Params} & \theta = \theta_{\mathcal F} & \theta = (\theta_{\mathcal F}, \theta_{\mathcal G})\\ \hline \mathbf{Backprop} & \mathbf{in:}\ y_{n - 1}, y_{n}, \overline{ y_{n}} & \mathbf{in:}\ y_{n}, \overline{y_{n }}\\ & \overline{\theta_n} =\overline{y_n} \frac{\partial y_n}{\partial \theta_n} &\texttt{# recover activations} \\ &\overline{y_{n - 1}} = \overline{y_{n}}\ \frac{\partial y_{n}}{\partial y_{n-1}} &y_{n, 1}, y_{n, 2} = \mbox{split}(y_{n}) \\ &\mathbf{out:}\ \overline{\theta_n}, \overline{y_{n -1}} & y_{n - 1, 2} = y_{n, 2} - \mathcal{G}(y_{n, 1})\\ &&y_{n - 1, 1} = y_{n, 1} - \mathcal{F}(y_{n - 1, 2})\\ &&\texttt{# gradients wrt. inputs} \\ &&\overline{y_{n -1, 1}} = \overline{y_{n, 1}} + \overline{y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \\ &&\overline{y_{n -1, 2}} = \overline{y_{n, 1}} \frac{\partial \mathcal F}{\partial y_{n,2}} + \overline{y_{n,2}} \left(1 + \frac{\partial \mathcal F}{\partial y_{n,2}} \frac{\partial \mathcal G}{\partial y_{n,1}} \right) \\ &&\texttt{ gradients wrt. parameters} \\ &&\overline{\theta_{n, \mathcal G}} = \overline{y_{n, 2}} \frac{\partial \mathcal G}{\partial \theta_{n, \mathcal G}}\\ &&\overline{\theta_{n, \mathcal F}} = \overline{y_{n,1}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} + \overline{y_{n, 2}} \frac{\partial F}{\partial \theta_{n, \mathcal F}} \frac{\partial \mathcal G}{\partial y_{n,1}}\\ &&\mathbf{out:}\ \overline{\theta_{n}}, \overline{y_{n -1}}, y_{n - 1}\\ \hline \end{array} $ **Table 1:** Backpropagation in the standard case and for Reversible blocks --- ## Experiments ** Computational Efficiency.** RevNets trade off memory requirements, by avoiding storing activations, against computations. Compared to other methods that focus on improving memory requirements in deep networks, RevNet provides the best trade-off: no activations have to be stored, the spatial complexity is $O(1)$. For the computation complexity, it is linear in the number of layers, i.e. $O(L)$. One small disadvantage is that RevNets introduces additional parameters, as each block is composed of two residuals, $\mathcal F$ and $\mathcal G$, and their number of channels is also halved as the input is first split into two. **Results.** In the experiments section, the author compare ResNet architectures to their RevNets "counterparts": they build a RevNet with roughly the same number of parameters by halving the number of residual units and doubling the number of channels. Interestingly, RevNets achieve **similar performances** to their ResNet counterparts, both in terms of final accuracy, and in terms of training dynamics. The authors also analyze the impact of floating errors that might occur when reconstructing activations rather than storing them, however it appears these errors are of small magnitude and do not seem to negatively impact the model. To summarize, reversible networks seems like a very promising direction to efficiently train very deep networks with memory budget constraints. --- ## References * [1] NICE: Non-linear Independent Components Estimation, Dinh et al., ICLR 2015 * [2] Density estimation using Real NVP, Dinh et al., ICLR 2017 |

Visualizing the Loss Landscape of Neural Nets

Hao Li and Zheng Xu and Gavin Taylor and Christoph Studer and Tom Goldstein

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG, cs.CV, stat.ML

**First published:** 2017/12/28 (1 year ago)

**Abstract:** Neural network training relies on our ability to find "good" minimizers of
highly non-convex loss functions. It is well-known that certain network
architecture designs (e.g., skip connections) produce loss functions that train
easier, and well-chosen training parameters (batch size, learning rate,
optimizer) produce minimizers that generalize better. However, the reasons for
these differences, and their effects on the underlying loss landscape, are not
well understood. In this paper, we explore the structure of neural loss
functions, and the effect of loss landscapes on generalization, using a range
of visualization methods. First, we introduce a simple "filter normalization"
method that helps us visualize loss function curvature and make meaningful
side-by-side comparisons between loss functions. Then, using a variety of
visualizations, we explore how network architecture affects the loss landscape,
and how training parameters affect the shape of minimizers.
more
less

Hao Li and Zheng Xu and Gavin Taylor and Christoph Studer and Tom Goldstein

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.LG, cs.CV, stat.ML

- Presents a simple visualization method based on “filter normalization.” - Observed that __the deeper networks become, neural loss landscapes become more chaotic__; causes a dramatic drop in generalization error, and ultimately to a lack of trainability. - Observed that __skip connections promote flat minimizers and prevent the transition to chaotic behavior__; helps explain why skip connections are necessary for training extremely deep networks. - Quantitatively measures non-convexity. - Studies the visualization of SGD optimization trajectories. |

TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals

Jiyang Gao and Zhenheng Yang and Chen Sun and Kan Chen and Ram Nevatia

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.CV

**First published:** 2017/03/17 (1 year ago)

**Abstract:** Temporal Action Proposal (TAP) generation is an important problem, as fast
and accurate extraction of semantically important (e.g. human actions) segments
from untrimmed videos is an important step for large-scale video analysis. We
propose a novel Temporal Unit Regression Network (TURN) model. There are two
salient aspects of TURN: (1) TURN jointly predicts action proposals and refines
the temporal boundaries by temporal coordinate regression; (2) Fast computation
is enabled by unit feature reuse: a long untrimmed video is decomposed into
video units, which are reused as basic building blocks of temporal proposals.
TURN outperforms the state-of-the-art methods under average recall (AR) by a
large margin on THUMOS-14 and ActivityNet datasets, and runs at over 880 frames
per second (FPS) on a TITAN X GPU. We further apply TURN as a proposal
generation stage for existing temporal action localization pipelines, it
outperforms state-of-the-art performance on THUMOS-14 and ActivityNet.
more
less

Jiyang Gao and Zhenheng Yang and Chen Sun and Kan Chen and Ram Nevatia

arXiv e-Print archive - 2017 via Local arXiv

Keywords: cs.CV

## Temporal unit regression network keyword: temporal action proposal; computing efficiency **Summary**: In this paper, Jiyang et al designed a proposal generation and refinement network with high computation efficiency by reusing unit feature on coordinated regression and classification network. Especially, a new metric against temporal proposal called AR-F is raised to meet 2 metric criteria: 1. evaluate different method on the same dataset efficiently. 2. capable to evaluate same method's performance across several datasets(generalization capability) **Model**: * decompose video and extract feature to form clip pyramid: 1. A video is first decomposed into short units where each unit has 16/32 frames. 2. extract each unit's feature using C3D/Two-stream CNN model. 3. several units' features are average pooled to compose clip level feature. In order to provide context and adaptive for different length action, clip level feature also concatenate surround feature and scaled to different length by concatenating more or fewer clips. Feature for a slip is $f_c = P(\{u_j\}_{s_u-n_{ctx}}^{s_u})||P(\{u_j\}_{s_u}^{e_u})||P(\{u_j\}_{e_u}^{e_u+n_{ctx}}) $ 4. for each proposal pyramid, a classifier is used to judge if the proposal contains an action and a regressor is used to provide an offset for each proposal to refine proposal's temporal boundary. 5. finally, during prediction, NMS is used to remove redundant proposal thus provide high accuracy without changing the recall rate. https://i.imgur.com/zqvHOxj.png **Training**: There are two output need to be optimized, the classification result and the regression offset. Intuitively, the distance between the proposal and corresponding ground truth should be measured. In this paper, the authors used the L-1 metric for regressor targeted the positive proposals. total loss is measured as follow: $L = \lambda L_{reg}+L_{cls}$ $L_{reg} = \frac{1}{N_{pos}}\sum_{i = 1}^N*l_s^*|o_{s,i} - o_{s,i}^*+o_{e,i} - o_{e,i}^*|$ During training, the ratio between positive samples and negative samples is set to 1:10. And for each positive proposal, its ground truth is the one with which it has the highest IOU or which it has IOU more than 0.5. **result**: 1. Computation complexity: 880 fps using the C3D feature on TITAN X GPU, while 260 FPS using flow CNN feature on the same machine. 2. Accuracy: mAP@0.5 = 25.6% on THUMOS14 **Conclusion**: Within this paper, it generates proposals by generate candidate at each unit with different scale and then using regression to refine the boundary. *However, there are a lot of redundant proposals for each unit which is an unnecessary waste of computing source; Also, proposals are generated with the pre-defined length which restricted its adaptivity to different length action; Finally the proposals are generated on the unit level which will suffer granularity problem* |

Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs

Zheng Shou and Dongang Wang and Shih-Fu Chang

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.CV

**First published:** 2016/01/09 (3 years ago)

**Abstract:** We address temporal action localization in untrimmed long videos. This is
important because videos in real applications are usually unconstrained and
contain multiple action instances plus video content of background scenes or
other activities. To address this challenging issue, we exploit the
effectiveness of deep networks in temporal action localization via three
segment-based 3D ConvNets: (1) a proposal network identifies candidate segments
in a long video that may contain actions; (2) a classification network learns
one-vs-all action classification model to serve as initialization for the
localization network; and (3) a localization network fine-tunes on the learned
classification network to localize each action instance. We propose a novel
loss function for the localization network to explicitly consider temporal
overlap and therefore achieve high temporal localization accuracy. Only the
proposal network and the localization network are used during prediction. On
two large-scale benchmarks, our approach achieves significantly superior
performances compared with other state-of-the-art systems: mAP increases from
1.7% to 7.4% on MEXaction2 and increases from 15.0% to 19.0% on THUMOS 2014,
when the overlap threshold for evaluation is set to 0.5.
more
less

Zheng Shou and Dongang Wang and Shih-Fu Chang

arXiv e-Print archive - 2016 via Local arXiv

Keywords: cs.CV

## Segmented SNN **Summary**: this paper use 3-stage 3D CNN to identify candidate proposals, recognize actions and localize temporal boundaries. **Models**: this network can be mainly divided into 3 parts: generate proposals, select proposal and refine temporal boundaries, and using NMS to remove redundant proposals. 1. generate multiscale(16,32,64,128,256.512) segment using sliding window with 75% overlap. high computing complexity! 2. network: Each stage of the three-stage network is using 3D convNets concatenating with 3 FC layers. * the proposal network is basically a classifier which will judge if each proposal contains action or not. * the classification network is used to classify each proposal which the proposal network think is valid into background and K action categories * the localization network functioned as a scoring system which raises scores of proposals that have high overlap with corresponding ground truth while decreasing the others. . |

BSN: Boundary Sensitive Network for Temporal Action Proposal Generation

Tianwei Lin and Xu Zhao and Haisheng Su and Chongjing Wang and Ming Yang

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV

**First published:** 2018/06/08 (7 months ago)

**Abstract:** Temporal action proposal generation is an important yet challenging problem,
since temporal proposals with rich action content are indispensable for
analysing real-world videos with long duration and high proportion irrelevant
content. This problem requires methods not only generating proposals with
precise temporal boundaries, but also retrieving proposals to cover truth
action instances with high recall and high overlap using relatively fewer
proposals. To address these difficulties, we introduce an effective proposal
generation method, named Boundary-Sensitive Network (BSN), which adopts "local
to global" fashion. Locally, BSN first locates temporal boundaries with high
probabilities, then directly combines these boundaries as proposals. Globally,
with Boundary-Sensitive Proposal feature, BSN retrieves proposals by evaluating
the confidence of whether a proposal contains an action within its region. We
conduct experiments on two challenging datasets: ActivityNet-1.3 and THUMOS14,
where BSN outperforms other state-of-the-art temporal action proposal
generation methods with high recall and high temporal precision. Finally,
further experiments demonstrate that by combining existing action classifiers,
our method significantly improves the state-of-the-art temporal action
detection performance.
more
less

Tianwei Lin and Xu Zhao and Haisheng Su and Chongjing Wang and Ming Yang

arXiv e-Print archive - 2018 via Local arXiv

Keywords: cs.CV

## Boundary sensitive network ### **keyword**: action detection in video; accurate proposal **Summary**: In order to generate precise temporal boundaries and improve recall with lesses proposals, Tianwei Lin et al use BSN which first combine temporal boundaries with high probability to form proposals and then select proposals by evaluating whether a proposal contains an action(confidence score+ boundary probability). **Model**: 1. video feature encoding: use the two-stream extractor to form the input of BSN. $F = \{f_{tn}\}_{n=1}^{l_s} = \{(f_{S,Tn}, f_{T,t_n}\}_{n=1}^{l_s)} $ 2. BSN: * temporal evaluation: input feature sequence, using 3-layer CNN+3 fiter with sigmoid, to generate start, end, and actioness probability * proposal generation: 1.combine bound with high start/end probability or if probility peak to form proposal; 2. use actioness probability to generate proposal feature for each proposal by sampling the actioness probability during proposal region. * proposal evaluation: using 1 hidden layer perceptron to evaluate confidence score based on proposal features. proposal $\varphi =(t_s,t_e,p_{conf},p_{t_s}^s,p_{t_e}^e) $ $p_{t_e}^e$ is the end probability,and $p_{conf}$ is confidence score https://i.imgur.com/VjJLQDc.png **Training**: * **Learn to generate probility curve**: In order to calculate the accuracy of proposals the loss in the temporal evaluation is calculated as following: $L_{TEM} = \lambda L^{action} + L ^{start} + L^{end}$; $L = \frac{1}{l_w} \sum_{i =1}^{l_w}(\frac{l_w}{l_w-\sum_i g_i} b_i*log(p_i)+\frac{l_w}{\sum_i g_i} (1-b_i)*log(1-p_i))$ $ b_i = sign(g_i-\theta_{IoP})$ Thus, if start region proposal is highly overlapped with ground truth, the start point probability should increase to lower the loss, after training, the information of ground truth region could be leveraged to predict the accurate probability for start. actions and end probability could apply the same rule. * **Learn to choose right proposal**: In order to choose the right proposal based on confidence score, push confidence score to match with IOU of the groud truth and proposal is important. So the loss to do this is described as follow: $L_p = \frac{1}{N_{train}} \sum_{i=1}^{N_{train}} (p_{conf,i}-g_{iou,i})^2$. $N_{train}$ is number of training proposals and among it the ratio of positive to negative proposal is 1:2.$g_{iou,i}$ is the ith proposal's overlap with its corresponding ground truth. During test and prediction, the final confidence is calculated to fetch and suppress proposals using gaussian decaying soft-NMS. and final confidence score for each proposal is $p_f = p_{conf}p_{ts}^sp_{te}^e$ Thus, after training, the confidence score should reveal the iou between the proposal and its corresponding ground truth based on the proposal feature which is generated through actionness probability, whereas final proposal is achieved by ranking final confidence score. **Conclusion**: Different with segment proposal or use RNN to decide where to look next, this paper generate proposals with boundary probability and select them using the confidence score-- the IOU between the proposal and corresponding ground truth. with sufficient data, it can provide right bound probability and confidence score. and the highlight of the paper is it can be very accurate within feature sequence. *However, it only samples part of the video for feature sequence. so it is possible it will jump over the boundary point. if an accurate policy to decide where to sample is used, accuracy should be further boosted. * * **computation complexity**: within this network, computation includes 1. two-stream feature extractor for video samples 2. probility generation module: 3-layers cnn for the generated sequence 3. proposal generation using combination 4. sampler to generate proposal feature 5. 1-hidden layer perceptron to generate confidence score. major computing complexity should attribute to feature extractor(1') and proposal relate module if lots of proposals are generated(3',4') **Performance**: when combined with SCNN-classifier, it reach map@0.5 = 36.9 on THUMOS14 dataset |

Gradient Reversal Against Discrimination

Edward Raff and Jared Sylvester

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.AI, cs.LG

**First published:** 2018/07/01 (6 months ago)

**Abstract:** No methods currently exist for making arbitrary neural networks fair. In this
work we introduce GRAD, a new and simplified method to producing fair neural
networks that can be used for auto-encoding fair representations or directly
with predictive networks. It is easy to implement and add to existing
architectures, has only one (insensitive) hyper-parameter, and provides
improved individual and group fairness. We use the flexibility of GRAD to
demonstrate multi-attribute protection.
more
less

Edward Raff and Jared Sylvester

arXiv e-Print archive - 2018 via Local arXiv

Keywords: stat.ML, cs.AI, cs.LG

Given some input data $x$ and attribute $a_p$, the task is to predict label $y$ from $x$ while making $a_p$ *protected*, in other words, such that the model predictions are invariant to changes in $a_p$. * **Pros (+)**: Simple and intuitive idea, easy to train, naturally extended to protecting multiple attributes. * **Cons (-)**: Comparison to baselines could be more detailed / comprehensive, in particular the comparison to ALFR [4] which also relies on adversarial training. --- ## Proposed Method **Domain adversarial networks.** The proposed model builds on the *Domain Adversarial Network* [1], originally introduced for unsupervised domain adaptation. Given some labeled data $(x, y) \sim \mathcal X \times \mathcal Y$, and some unlabeled data $\tilde x \sim \tilde{\mathcal X}$, the goal is to learn a network that solves both classification tasks $\mathcal X \rightarrow \mathcal Y$ and $\tilde{\mathcal X} \rightarrow \mathcal Y$ while learning a shared representation between $\mathcal X$ and $\tilde{\mathcal X}$. The model is composed of a feature extractor $G_f$ which then branches off into a *target* branch, $G_t$, to predict the target label, and a *domain* branch, $G_d$, predicting whether the input data comes either from domain $\mathcal X$ or $\tilde{\mathcal X}$. The model parameters are trained with the following objective: $$ \begin{align} (\theta_{G_f}, \theta_{G_t} ) &= \arg\min \mathbb E_{(x, y) \sim \mathcal X \times \mathcal Y}\ \ell_t \left( G_t \circ G_f(x), y \right)\\ \theta_{G_d} &= \arg\max \mathbb E_{x \sim \mathcal X} \ \ell_d\left( G_d \circ G_f(x), 1 \right) + \mathbb E_{\tilde x \sim \tilde{\mathcal X}}\ \ell_d \left(G_d \circ G_f(\tilde x), 0\right)\\ \mbox{where } &\ell_t \mbox{ and } \ell_d \mbox{ are classification losses} \end{align} $$ The gradient updates for this saddle point problem can be efficiently implemented using the Gradient Reversal Layer introduced in [1] **GRAD-pred.** In **G**radient **R**eversal **A**gainst **D**iscrimination, samples come only from one domain $\mathcal X$, and the domain classifier $G_d$ is replaced by an *attribute* classifier, $G_p$, whose goal is to predict the value of the protected attribute $a_p$. In other words, the training objective strives to build a feature representation of $x$ that is good enough to predict the correct label $y$ but such that $a_p$ cannot easily be deduced from it. On the contrary, directly learning classification network $G_y \circ G_f$ penalized when predicting the correct value of attribute $a_p$ could instead lead to a model that learns $a_p$ and trivially outputs an incorrect value. This situation is prevented by the adversarial training scheme here. **GRAD-auto.** The authors also consider a variant of the described model where the target branch $G_t$ instead solves the auto-encoding/reconstruction task. The features learned by the encoder $G_f$ can then later be used as entry point of a smaller network for classification or any other task. --- ## Experiments **Evaluation metrics.** The model is evaluated on four metrics to qualify both accuracy and fairness, following the protocol in [2]: * *Accuracy*, the proportion of correct classifications * *Discrimination*, the average score differences (logits of the ground-truth class) between samples with $a_p = + 1$ and $a_p = -1 $ (assuming a binary attribute) * *Consistency*, the average difference between a sample score and the mean of its nearest neighbors' score. * *Delta = Accuracy - Discrimination*, a penalized version of accuracy **Baselines.** * **Vanilla** CNN trained without the protected attribute protection branch * **LFR** [2]: A classifier with an intermediate latent code $Z \in \{1 \dots K\}$ is trained with an objective that combines a classification loss (the model should accurately classify $x$), a reconstruction loss (the learned representation should encode enough information about the input to reconstruct it accurately) and a parity loss (estimate the probability $P(Z=z | x)$ for both populations with $a_p = 1$ and $a_p = -1$ and strive to make them equal) * **VFA** [3]: A VAE where the protected attribute $a_p$ is factorized out of the latent code $z$, and additional invariance is imposed via a MMD objective which tries to match the moments of the posterior distributions $q(z|a_p = -1)$ and $q(z| a_p = 1)$. * **ALFR** [4] : As in LFR, this paper proposes a model trained with a reconstruction loss and a classification loss. Additionally, they propose to quantify the dependence between the learned representation and the protected attribute by adding an adversary classifier that tries to extract the attribute value from the representation, formulated and trained as in the Generative Adversarial Network (GAN) setting. **Results.** GRAD always reaches highest consistency compared to baselines. For the other metrics, the results are more mitigated, although it usually achieves best or second best results. It's also not clear how to choose between GRAD-pred and GRAD-auto as there does not seem to be a clear winner, although GRAD-pred seems more intuitive when supervision is available, as it directly solves the classification task. Authors also report a small experiment showing that protecting several attributes at the same time can be more beneficial than protecting a single attribute. This can be expected as some attributes are highly correlated or interact in meaningful way. In particular, protecting several attributes at once can easily be done in the GRAD framework by making the attribute prediction branch multi-class for instance: however it is not clear in the paper how it is actually done in practice, nor whether the same idea could also be integrated in the baselines for further comparison. --- ## References * [1] Domain-Adversarial Training of Neural Networks, Ganin et al, JMRL 2016 * [2] Learning Fair Representations, Zemel et al, ICML 2013 * [3] The Variational Fair Autoencoder, Louizos et al, 2016 * [4] Censoring Representations with an Adversary, Edwards and Storkey, ICLR 2016 |

About