ShortScience.org - Making Science Accessible!

Welcome to ShortScience.org!

Exploring the Origins and Prevalence of Texture Bias in Convolutional Neural Networks
Hermann, Katherine L. and Kornblith, Simon
arXiv e-Print archive - 2019 via Local Bibsonomy
Keywords: dblp

[link] Summary by CodyWild 3 years ago

When humans classify images, we tend to use high-level information about the shape and position of the object. However, when convolutional neural networks classify images,, they tend to use low-level, or textural, information more than high-level shape information. This paper tries to understand what factors lead to higher shape bias or texture bias.

To investigate this, the authors look at three datasets with disagreeing shape and texture labels. The first is GST, or Geirhos Style Transfer. In this dataset, style transfer is used to render the content of one class in the style of another (for example, a cat shape in the texture of an elephant). In the Navon dataset, a large-scale letter is rendered by tiling smaller letters. And, in the ImageNet-C dataset, a given class is rendered with a particular kind of distortion; here the distortion is considered to be the "texture label". In the rest of the paper, "shape bias" refers to the extent to which a model trained on normal images will predict the shape label rather than the texture label associated with a GST image. The other datasets are used in experiments where a model explicitly tries to learn either shape or texture.

https://i.imgur.com/aw1MThL.png

To start off, the authors try to understand whether CNNs are inherently more capable of learning texture information rather than shape information. To do this, they train models on either the shape or the textural label on each of the three aforementioned datasets. On GST and Navon, shape labels can be learned faster and more efficiently than texture ones. On ImageNet-C (i.e. distorted ImageNet), it seems to be easier to learn texture than texture, but recall here that texture corresponds to the type of noise, and I imagine that the cardinality of noise types is far smaller than that of ImageNet images, so I'm not sure how informative this comparison is. Overall, this experiment suggests that CNNs are able to learn from shape alone without low-level texture as a clue, in cases where the two sources of information disagree

The paper moves on to try to understand what factors about a normal ImageNet model give it higher or lower shape bias - that is, a higher or lower likelihood of classifying a GST image according to its shape rather than texture. Predictably, data augmentations have an effect here. When data is augmented with aggressive random cropping, this increases texture bias relative to shape bias, presumably because when large chunks of an object are cropped away, its overall shape becomes a less useful feature. Center cropping is better for shape bias, probably because objects are likely to be at the center of the image, so center cropping has less of a chance of distorting them. On the other hand, more "naturalistic" augmentations like adding Gaussian noise or distorting colors lead to a higher shape bias in the resulting networks, up to 60% with all the modifications. However, the authors also find that pushing the shape bias up has the result of dropping final test accuracy.

https://i.imgur.com/Lb6RMJy.png

Interestingly, while the techniques that increase shape bias seem to also harm performance, the authors also find that higher-performing models tend to have higher shape bias (though with texture bias still outweighing shape) suggesting that stronger models learn how to use shape more effectively, but also that handicapping models' ability to use texture in order to incentivize them to use shape tends to hurt performance overall.

Overall, my take from this paper is that texture-level data is actually statistically informative and useful for classification - even in terms of generalization - even if is too high-resolution to be useful as a visual feature for humans. CNNs don't seem inherently incapable of learning from shape, but removing their ability to rely on texture seems to lead to a notable drop in accuracy, suggesting there was real signal there that we're losing out on.

1 Comments

proceedings.mlr.press
scholar.google.com

Off-Policy Deep Reinforcement Learning without Exploration
Fujimoto, Scott and Meger, David and Precup, Doina
International Conference on Machine Learning - 2019 via Local Bibsonomy
Keywords: dblp

[link] Summary by Robert Müller 4 years ago

Interacting with the environment comes sometimes at a high cost, for example in high stake scenarios like health care or teaching. Thus instead of learning online, we might want to learn from a fixed buffer $B$ of transitions, which is filled in advance from a behavior policy.
The authors show that several so called off-policy algorithms, like DQN and DDPG fail dramatically in this pure off-policy setting. 

They attribute this to the extrapolation error, which occurs in the update of a value estimate $Q(s,a)$, where the target policy selects an unfamiliar action $\pi(s')$ such that $(s', \pi(s'))$ is unlikely or not present in $B$. Extrapolation error is caused by the mismatch between the true state-action visitation distribution of the current policy and the state-action distribution in $B$ due to:
- state-action pairs (s,a) missing  in $B$, resulting in arbitrarily bad estimates of $Q_{\theta}(s, a)$ without sufficient data close to (s,a). 
- the finiteness of the batch of transition tuples $B$, leading to a biased estimate of the transition dynamics in the Bellman operator $T^{\pi}Q(s,a) \approx \mathbb{E}_{\boldsymbol{s' \sim B}}\left[r + \gamma Q(s', \pi(s')) \right]$
- transitions are sampled uniformly from $B$, resulting in a loss weighted w.r.t the frequency of data in the batch: $\frac{1}{\vert B \vert} \sum_{\boldsymbol{(s, a, r, s') \sim B}} \Vert r + \gamma Q(s', \pi(s')) - Q(s, a)\Vert^2$

The proposed algorithm Batch-Constrained deep Q-learning (BCQ) aims to choose actions that:
 1. minimize distance of taken actions to actions in the batch
 2. lead to states contained in the buffer
 3.  maximizes the value function,

where 1. is prioritized over the other two goals to mitigate the extrapolation error. 

Their proposed algorithm (for continuous environments) consists informally of the following steps that are repeated at each time $t$:
 1. update generator model of the state conditional marginal likelihood $P_B^G(a \vert s)$
 2. sample n actions form the generator model
 3. perturb each of the sampled actions to lie in a range $\left[-\Phi, \Phi \right]$
 4. act according to the argmax of respective Q-values of perturbed actions
 5. update value function

The experiments considers Mujoco tasks with four scenarios of batch data creation:
 - 1 million time steps from training a DDPG agent with exploration noise $\mathcal{N}(0,0.5)$ added to the action.This aims for a diverse set of states and actions. 
 - 1 million time steps from training a DDPG agent with an exploration noise $\mathcal{N}(0,0.1)$ added to the actions as behavior policy. The batch-RL agent and the behavior DDPG are trained concurrently from the same buffer. 
 - 1 million transitions from rolling out a already trained DDPG agent
 - 100k transitions from a behavior policy that acts with probability 0.3 randomly and follows otherwise an expert demonstration with added exploration noise  $\mathcal{N}(0,0.3)$

I like the fourth choice of behavior policy the most as this captures high stake scenarios like education or medicine the closest, in which training data would be acquired by human experts that are by the nature of humans not optimal but significantly better than learning from scratch. 

The proposed BCQ algorithm is the only algorithm that is successful across all experiments. It matches or outperforms the behavior policy. Evaluation of the value estimates showcases unstable and diverging value estimates for all algorithms but BCQ that exhibits a stable value function. 

The paper outlines a very important issue that needs to be tackled in order to use reinforcement learning in real world applications.

papers.nips.cc
scholar.google.com

Algorithms for Non-negative Matrix Factorization
Lee, Daniel D. and Seung, H. Sebastian
Neural Information Processing Systems Conference - 2000 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 7 years ago

We want to find two matrices $W$ and $H$ such that $V = WH$. Often a goal is to determine underlying patterns in the relationships between the concepts represented by each row and column. $W$ is some $m$ by $n$ matrix and we want the inner dimension of the factorization to be $r$. So 

$$\underbrace{V}_{m \times n} = \underbrace{W}_{m \times r} \underbrace{H}_{r \times n}$$

Let's consider an example matrix where of three customers (as rows) are associated with three movies (the columns) by a rating value.

$$
V = \left[\begin{array}{c c c}
5 & 4 & 1  \\\\
4 & 5 & 1 \\\\
2 & 1 & 5
\end{array}\right]
$$


We can decompose this into two matrices with $r = 1$. First lets do this without any non-negative constraint using an SVD reshaping matrices based on removing eigenvalues:


$$
W = \left[\begin{array}{c c c}
-0.656 \\\
 -0.652 \\\
 -0.379
\end{array}\right],
H = \left[\begin{array}{c c c}
-6.48 & -6.26 & -3.20\\\\
\end{array}\right]
$$

We can also decompose this into two matrices with $r = 1$ subject to the constraint that $w_{ij} \ge 0$ and  $h_{ij} \ge 0$. (Note: this is only possible when $v_{ij} \ge 0$):

$$
W = \left[\begin{array}{c c c}
0.388 \\\\
0.386 \\\\
0.224
\end{array}\right],
H = \left[\begin{array}{c c c}
11.22 & 10.57 & 5.41  \\\\
\end{array}\right]
$$

Both of these $r=1$ factorizations reconstruct matrix $V$ with the same error. 

$$
V \approx WH = \left[\begin{array}{c c c}
4.36 & 4.11 & 2.10 \\\
4.33 & 4.08 & 2.09 \\\
2.52 & 2.37 & 1.21 \\\
\end{array}\right]
$$


If they both yield the same reconstruction error then why is a non-negativity constraint useful? We can see above that it is easy to observe patterns in both factorizations such as similar customers and similar movies. `TODO: motivate why NMF is better`



#### Paper Contribution 

This paper discusses two approaches for iteratively creating a non-negative $W$ and $H$ based on random initial matrices. The paper discusses a multiplicative update rule where the elements of $W$ and $H$ are iteratively transformed by scaling each value such that error is not increased. 

The multiplicative approach is discussed in contrast to an additive gradient decent based approach where small corrections are iteratively applied. The multiplicative approach can be reduced to this by setting the learning rate ($\eta$) to a ratio that represents the magnitude of the element in $H$ to the scaling factor of $W$ on $H$.



### Still a draft

jmlr.org
scholar.google.com

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Ioffe, Sergey and Szegedy, Christian
International Conference on Machine Learning - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by José Manuel Rodríguez Sotelo 7 years ago

The main contribution of this paper is introducing a new transformation that the authors call Batch Normalization (BN). The need for BN comes from the fact that during the training of deep neural networks (DNNs) the distribution of each layer’s input change. This phenomenon is called internal covariate shift (ICS).

#### What is BN?
Normalize each (scalar) feature independently with respect to the mean and variance of the mini batch. Scale and shift the normalized values with two new parameters (per activation) that will be learned. The BN consists of making normalization part of the model architecture.

#### What do we gain?
According to the author, the use of BN provides a great speed up in the training of DNNs. In particular, the gains are greater when it is combined with higher learning rates. In addition, BN works as a regularizer for the model which allows to use less dropout or less L2 normalization. Furthermore, since the distribution of the inputs is normalized, it also allows to use sigmoids as activation functions without the saturation problem.

#### What follows?
This seems to be specially promising for training recurrent neural networks (RNNs). The vanishing and exploding gradient problems \cite{journals/tnn/BengioSF94} have their origin in the iteration of transformation that scale up or down the activations in certain directions (eigenvectors). It seems that this regularization would be specially useful in this context since this would allow the gradient to flow more easily. When we unroll the RNNs, we usually have ultra deep networks.

#### Like
* Simple idea that seems to improve training.
* Makes training faster.
* Simple to implement. Probably.
* You can be less careful with initialization.

#### Dislike
* Does not work with stochastic gradient descent (minibatch size = 1).
* This could reduce the parallelism of the algorithm since now all the examples in a mini batch are tied.
* Results on ensemble of networks for ImageNet makes it harder to evaluate the relevance of BN by itself. (Although they do mention the performance of a single model).

dx.doi.org
sci-hub
scholar.google.com

Fast R-CNN
Girshick, Ross B.
International Conference on Computer Vision - 2015 via Local Bibsonomy
Keywords: dblp

[link] Summary by Joseph Paul Cohen 8 years ago

This method is based on improving the speed of R-CNN \cite{conf/cvpr/GirshickDDM14}

1. Where R-CNN would have two different objective functions, Fast R-CNN combines localization and classification losses into a "multi-task loss" in order to speed up training.
2. It also uses a pooling method based on \cite{journals/pami/HeZR015} called the RoI pooling layer that scales the input so the images don't have to be scaled before being set an an input image to the CNN. "RoI max pooling works by dividing the $h \times w$ RoI window into an $H \times W$ grid of sub-windows of approximate size $h/H \times w/W$ and then max-pooling the values in each sub-window into the corresponding output grid cell."
3. Backprop is performed for the RoI pooling layer by taking the argmax of the incoming gradients that overlap the incoming values.

This method is further improved by the paper "Faster R-CNN" \cite{conf/nips/RenHGS15}