Welcome to ShortScience.org! |
[link]
TLDR; The authors propose Progressive Neural Networks (ProgNN), a new way to do transfer learning without forgetting prior knowledge (as is done in finetuning). ProgNNs train a neural neural on task 1, freeze the parameters, and then train a new network on task 2 while introducing lateral connections and adapter functions from network 1 to network 2. This process can be repeated with further columns (networks). The authors evaluate ProgNNs on 3 RL tasks and find that they outperform finetuning-based approaches. #### Key Points - Finetuning is a destructive process that forgets previous knowledge. We don't want that. - Layer h_k in network 3 gets additional lateral connections from layers h_(k-1) in network 2 and network 1. Parameters of those connections are learned, but network 2 and network 1 are frozen during training of network 3. - Downside: # of Parameters grows quadratically with the number of tasks. Paper discussed some approaches to address the problem, but not sure how well these work in practice. - Metric: AUC (Average score per episode during training) as opposed to final score. Transfer score = Relative performance compared with single net baseline. - Authors use Average Perturbation Sensitivity (APS) and Average Fisher Sensitivity (AFS) to analyze which features/layers from previous networks are actually used in the newly trained network. - Experiment 1: Variations of Pong game. Baseline that finetunes only final layer fails to learn. ProgNN beats other baselines and APS shows re-use of knowledge. - Experiment 2: Different Atari games. ProgNets result in positive Transfer 8/12 times, negative transfer 2/12 times. Negative transfer may be a result of optimization problems. Finetuning final layers fails again. ProgNN beats other approaches. - Experiment 3: Labyrinth, 3D Maze. Pretty much same result as other experiments. #### Notes - It seems like the assumption is that layer k always wants to transfer knowledge from layer (k-1). But why is that true? Network are trained on different tasks, so the layer representations, or even numbers of layers, may be completely different. And Once you introduce lateral connections from all layers to all other layers the approach no longer scales. - Old tasks cannot learn from new tasks. Unlike humans. - Gating or residuals for lateral connection could make sense to allow to network to "easily" re-use previously learned knowledge. - Why use AUC metric? I also would've liked to see the final score. Maybe there's a good reason for this, but the paper doesn't explain. - Scary that finetuning the final layer only fails in most experiments. That's a very commonly used approach in non-RL domains. - Someone should try this on non-RL tasks. - What happens to training time and optimization difficult as you add more columns? Seems prohibitively expensive. |
[link]
Sinha et al. introduce a variant of adversarial training based on distributional robust optimization. I strongly recommend reading the paper for understanding the introduced theoretical framework. The authors also provide guarantees on the obtained adversarial loss – and show experimentally that this guarantee is a realistic indicator. The adversarial training variant itself follows the general strategy of training on adversarially perturbed training samples in a min-max framework. In each iteration, an attacker crafts an adversarial examples which the network is trained on. In a nutshell, their approach differs from previous ones (apart from the theoretical framework) in the used attacker. Specifically, their attacker optimizes $\arg\max_z l(\theta, z) - \gamma \|z – z^t\|_p^2$ where $z^t$ is a training sample chosen randomly during training. On a side note, I also recommend reading the reviews of this paper: https://openreview.net/forum?id=Hk6kPgZA- Also view this summary at [davidstutz.de](https://davidstutz.de/category/reading/). |
[link]
The paper proposes a standardized benchmark for a number of safety-related problems, and provides an implementation that can be used by other researchers. The problems fall in two categories: specification and robustness. Specification refers to cases where it is difficult to specify a reward function that encodes our intentions. Robustness means that agent's actions should be robust when facing various complexities of a real-world environment. Here is a list of problems: 1. Specification: 1. Safe interruptibility: agents should neither seek nor avoid interruption. 2. Avoiding side effects: agents should minimize effects unrelated to their main objective. 3. Absent supervisor: agents should not behave differently depending on presence of supervisor. 4. Reward gaming: agents should not try to exploit errors in reward function. 2. Robustness: 1. Self-modification: agents should behave well when environment allows self-modification. 2. Robustness to distributional shift: agents should behave robustly when test differs from train. 3. Robustness to adversaries: agents should detect and adapt to adversarial intentions in environment. 4. Safe exploration: agent should behave safely during learning as well. It is worth noting that problems 1.2, 1.4, 2.2, and 2.4 have been described back in "Concrete Problems in AI Safety". It is suggested that each of these problems be tackled in a "gridworld" environment — a 2D environment where the agent lives on a grid, and the only actions it has available are up/down/left/right movements. The benchmark consists of 10 environments, each corresponding to one of 8 problems mentioned above. Each of the environments is an extremely simple instance of the problem, but nevertheless they are of interest as current SotA algorithms usually don't solve the posed task. Specifically, the authors trained A2C and Rainbow with DQN update on each of the environments and showed that both algorithms fail on all of specification problems, except for Rainbow on 1.1. This is expected, as neither of those algorithms are designed for cases where reward function is misspecified. Both algorithms failed on 2.2--2.4, except for A2C on 2.3. On 2.1, the authors swapped A2C for Rainbow with Sarsa update and showed that Rainbow DQN failed while Rainbow Sarsa performed well. Overall, this is a good groundwork paper with only a few questionable design decisions, such as the design of actual reward in 1.2. It is unlikely to have impact similar to MNIST or ImageNet, but it should stimulate safety-related research. |
[link]
Domain translation - for example, mapping from a summer to a winter scene, or from a photorealistic image to an object segmentation map - is often performed by GANs through something called cycle consistency loss. This model works by having, for each domain, a generator to map domain A into domain B, and a discriminator to differentiate between real images from domain B, and those that were constructed through the cross-domain generator. With a given image in domain A, training happens by using the A→B generator to map it into domain B, and then then B→ A generator to map it back the original domain. These generators are then trained using two losses: one based on the B-domain discriminator, to push the generated image to look like it belongs from that domain, and another based on the L2 loss between the original domain A image, and the image you get on the other end when you translate it into B and back again. This paper addresses an effect (identified originally in an earlier paper) where in domains with a many to one mapping between domains (for example, mapping a realistic scene into a domain segmentation map, where information is inherently lost by translating pixels to object outlines), the cycle loss incentivizes the model to operate in a strange, steganographic way, where it saves information about the that would otherwise be lost in the form of low-amplitude random noise in the translated image. This low-amplitude information can't be isolated, but can be detected in a few ways. First, we can simply examine images and notice that information that could not have been captured in the lower-information domain is being perfectly reconstructed. Second, if you add noise to the translation in the lower-information domain, in such a way as to not perceptibly change the translation to human eyes, this can cause the predicted image off of that translation to deteriorate considerably, suggesting that the model was using information that could be modified by such small additions of noise to do its reconstruction. https://i.imgur.com/08i1j0J.png The authors of this paper ask whether it's possible to train models that don't perform this steganographic information-storing (which they call "self adversarial examples"). A typical approach to such a problem would be to train generators to perform translations with and without the steganographic information, but even though we can prove the existence of the information, we can't isolate it in a way that would allow us to remove it, and thus create these kinds of training pairs. The two tactics the paper uses are: 1) Simply training the generators to be able to translate a domain-mapped image with noise as well as one without noise, in the hope that this would train it not use information that can be interfered with by the application of such noise. 2) In addition to a L2 cycle loss, adding a discriminator to differentiate between the back-translated image and the original one. I believe the idea here is that if both of the encoders are adding in noise as a kind of secret signal, this would be a way for the discriminator to distinguish between the original and reconstructed image, and would thus be penalized. They find that both of these methods reduce the use of steganographic information, as determined both by sensitivity to noise (where less sensitivity of reconstruction to noise means less use of coded information) and reconstruction honesty (which constrains accuracy of reconstruction in many to one domains to be no greater than the prediction that a supervised predictor could make given the image from the compressed domain |
[link]
### Introduction * *Curriculum Learning* - When training machine learning models, start with easier subtasks and gradually increase the difficulty level of the tasks. * Motivation comes from the observation that humans and animals seem to learn better when trained with a curriculum like a strategy. * [Link](http://ronan.collobert.com/pub/matos/2009_curriculum_icml.pdf) to the paper. ### Contributions of the paper * Explore cases that show that curriculum learning benefits machine learning. * Offer hypothesis around when and why does it happen. * Explore relation of curriculum learning with other machine learning approaches. ### Experiments with convex criteria * Training perceptron where some input data is irrelevant(not predictive of the target class). * Difficulty can be defined in terms of the number of irrelevant samples or margin from the separating hyperplane. * Curriculum learning model outperforms no-curriculum based approach. * Surprisingly, in the case of difficulty defined in terms of the number of irrelevant examples, the anti-curriculum strategy also outperforms no-curriculum strategy. ### Experiments on shape recognition with datasets having different variability in shapes * Standard(target) dataset - Images of rectangles, ellipses, and triangles. * Easy dataset - Images of squares, circles, and equilateral triangles. * Start performing gradient descent on easy dataset and switch to target data set at a particular epoch (called *switch epoch*). * For no-curriculum learning, the first epoch is the *switch epoch*. * As *switch epoch* increases, the classification error comes down with the best performance when *switch epoch* is half the total number of epochs. * Paper does not report results for higher values of *switch epoch*. ### Experiments on language modeling * Standard data set is the set of all possible windows of the text of size 5 from Wikipedia where all words in the window appear in 20000 most frequent words. * Easy dataset considers only those windows where all words appear in 5000 most frequent words in vocabulary. * Each word from the vocabulary is embedded into a *d* dimensional feature space using a matrix **W** (to be learnt). * The model predicts the score of next word, given a window of words. * Expected value of ranking loss function is minimized to learn **W**. * Curriculum Learning-based model overtakes the other model soon after switching to the target vocabulary, indicating that curriculum-based model quickly learns new words. ### Curriculum as a continuation method * Continuation methods start with a smoothed objective function and gradually move to less smoothed function. * Useful in the case where the objective function in non-convex. * Consider a family of cost functions $C_\lambda (\theta)$ such that $C_0(\theta)$ can be easily optimized and $C_1(\theta)$ is the actual objective function. * Start with $C_0 (\theta)$ and increase $\lambda$, keeping $\theta$ at a local minimum of $C_\lambda (\theta)$. * Idea is to move $\theta$ towards a dominant (if not global) minima of $C_1(\theta)$. * Curriculum learning can be seen as a sequence of training criteria starting with an easy-to-optimise objective and moving all the way to the actual objective. * The paper provides a mathematical formulation of curriculum learning in terms of a target training distribution and a weight function (to model the probability of selecting anyone training example at any step). ### Advantages of Curriculum Learning * Faster training in the online setting as learner does not try to learn difficult examples when it is not ready. * Guiding training towards better local minima in parameter space, specifically useful for non-convex methods. ### Relation to other machine learning approaches * **Unsupervised preprocessing** - Both have a regularizing effect and lower the generalization error for the same training error. * **Active learning** - The learner would benefit most from the examples that are close to the learner's frontier of knowledge and are neither too hard nor too easy. * **Boosting Algorithms** - Difficult examples are gradually emphasised though the curriculum starts with a focus on easier examples and the training criteria do not change. * **Transfer learning** and **Life-long learning** - Initial tasks are used to guide the optimisation problem. ### Criticism * Curriculum Learning is not well understood, making it difficult to define the curriculum. * In one of the examples, anti-curriculum performs better than no-curriculum. Given that curriculum learning is modeled on the idea that learning benefits when examples are presented in order of increasing difficulty, one would expect anti-curriculum to perform worse. |