[link]
Rakelly et al. propose a method to do offpolicy meta reinforcement learning (rl). The method achieves a 20100x improvement on sample efficiency compared to onpolicy meta rl like MAML+TRPO. The key difficulty for offline meta rl arises from the metalearning assumption, that metatraining and metatest time match. However during test time the policy has to explore and sees as such onpolicy data which is in contrast to the offpolicy data that should be used at metatraining. The key contribution of PEARL is an algorithm that allows for online task inference in a latent variable at train and test time, which is used to train a Soft Actor Critic, a very sample efficient offpolicy algorithm, with additional dependence of the latent variable. The implementation of Rakelly et al. proposes to capture knowledge about the current task in a latent stochastic variable Z. A inference network $q_{\Phi}(z \vert c)$ is used to predict the posterior over latents given context c of the current task in from of transition tuples $(s,a,r,s')$ and trained with an information bottleneck. Note that the task inference is done on samples according to a sampling strategy sampling more recent transitions. The latent z is used as an additional input to policy $\pi(a \vert s, z)$ and Qfunction $Q(a,s,z)$ of a soft actor critic algorithm which is trained with offline data of the full replay buffer. https://i.imgur.com/wzlmlxU.png So the challenge of differing conditions at test and train times is resolved by sampling the content for the latent context variable at train time only from very recent transitions (which is almost onpolicy) and at test time by construction onpolicy. Sampling $z \sim q(z \vert c)$ at test time allows for posterior sampling of the latent variable, yielding efficient exploration. The experiments are performed across 6 Mujoco tasks with ProMP, MAML+TRPO and $RL^2$ with PPO as baselines. They show:  PEARL is 20100x more sampleefficient  the posterior sampling of the latent context variable enables deep exploration that is crucial for sparse reward settings  the inference network could be also a RNN, however it is crucial to train it with uncorrelated transitions instead of trajectories that have high correlated transitions  using a deterministic latent variable, i.e. reducing $q_{\Phi}(z \vert c)$ to a point estimate, leaves the algorithm unable to solve sparse reward navigation tasks which is attributed to the lack of temporally extended exploration. The paper introduces an algorithm that allows to combine meta learning with an offpolicy algorithm that dramatically increases the sampleefficiency compared to onpolicy meta learning approaches. This increases the chance of seeing meta rl in any sort of real world applications. 
[link]
Interacting with the environment comes sometimes at a high cost, for example in high stake scenarios like health care or teaching. Thus instead of learning online, we might want to learn from a fixed buffer $B$ of transitions, which is filled in advance from a behavior policy. The authors show that several so called offpolicy algorithms, like DQN and DDPG fail dramatically in this pure offpolicy setting. They attribute this to the extrapolation error, which occurs in the update of a value estimate $Q(s,a)$, where the target policy selects an unfamiliar action $\pi(s')$ such that $(s', \pi(s'))$ is unlikely or not present in $B$. Extrapolation error is caused by the mismatch between the true stateaction visitation distribution of the current policy and the stateaction distribution in $B$ due to:  stateaction pairs (s,a) missing in $B$, resulting in arbitrarily bad estimates of $Q_{\theta}(s, a)$ without sufficient data close to (s,a).  the finiteness of the batch of transition tuples $B$, leading to a biased estimate of the transition dynamics in the Bellman operator $T^{\pi}Q(s,a) \approx \mathbb{E}_{\boldsymbol{s' \sim B}}\left[r + \gamma Q(s', \pi(s')) \right]$  transitions are sampled uniformly from $B$, resulting in a loss weighted w.r.t the frequency of data in the batch: $\frac{1}{\vert B \vert} \sum_{\boldsymbol{(s, a, r, s') \sim B}} \Vert r + \gamma Q(s', \pi(s'))  Q(s, a)\Vert^2$ The proposed algorithm BatchConstrained deep Qlearning (BCQ) aims to choose actions that: 1. minimize distance of taken actions to actions in the batch 2. lead to states contained in the buffer 3. maximizes the value function, where 1. is prioritized over the other two goals to mitigate the extrapolation error. Their proposed algorithm (for continuous environments) consists informally of the following steps that are repeated at each time $t$: 1. update generator model of the state conditional marginal likelihood $P_B^G(a \vert s)$ 2. sample n actions form the generator model 3. perturb each of the sampled actions to lie in a range $\left[\Phi, \Phi \right]$ 4. act according to the argmax of respective Qvalues of perturbed actions 5. update value function The experiments considers Mujoco tasks with four scenarios of batch data creation:  1 million time steps from training a DDPG agent with exploration noise $\mathcal{N}(0,0.5)$ added to the action.This aims for a diverse set of states and actions.  1 million time steps from training a DDPG agent with an exploration noise $\mathcal{N}(0,0.1)$ added to the actions as behavior policy. The batchRL agent and the behavior DDPG are trained concurrently from the same buffer.  1 million transitions from rolling out a already trained DDPG agent  100k transitions from a behavior policy that acts with probability 0.3 randomly and follows otherwise an expert demonstration with added exploration noise $\mathcal{N}(0,0.3)$ I like the fourth choice of behavior policy the most as this captures high stake scenarios like education or medicine the closest, in which training data would be acquired by human experts that are by the nature of humans not optimal but significantly better than learning from scratch. The proposed BCQ algorithm is the only algorithm that is successful across all experiments. It matches or outperforms the behavior policy. Evaluation of the value estimates showcases unstable and diverging value estimates for all algorithms but BCQ that exhibits a stable value function. The paper outlines a very important issue that needs to be tackled in order to use reinforcement learning in real world applications. 
[link]
## Introduction Two distinct research paradigms have studied how prior tasks or experiences can be used by an agent to inform future learning. * Meta Learning: past experience is used to acquire a prior over model parameters or a learning procedure, and typically studies a setting where a set of metatraining tasks are made available together upfront * Online learning : a sequential setting where tasks are revealed one after another, but aims to attain zeroshot generalization without any taskspecific adaptation. We argue that neither setting is ideal for studying continual lifelong learning. Metalearning deals with learning to learn, but neglects the sequential and nonstationary aspects of the problem. Online learning offers an appealing theoretical framework, but does not generally consider how past experience can accelerate adaptation to a new task. ## Online Learning Online learning focuses on regret minimization. Most standard notion of regret is to compare to the cumulative loss of the best fixed model in hindsight: https://i.imgur.com/pbZG4kK.png One way minimize regret is with Follow the Leader (FTL): https://i.imgur.com/NCs73vG.png ## Online Metalearning Setting: let $U_t$ be the update procedure for task $t$ e.g. in MAML: https://i.imgur.com/Q4I4HkD.png The overall protocol for the setting is as follows: 1. At round t, the agent chooses a model defined by $w_t$ 2. The world simultaneously chooses task defined by $f_t$ 3. The agent obtains access to the update procedure $U_t$, and uses it to update parameters as $\tilde w_t = U_t(w_t)$ 4. The agent incurs loss $f_t(\tilde w_t )$. Advance to round t + 1. the goal for the agent is to minimize regrets over rounds. Achieving sublinear regrets means you're improving and converging to upper bound (joint training on all tasks) ## Algorithm and Analysis: Follow the metaleader (FTML): https://i.imgur.com/qWb9g8Q.png FTML’s regret is sublinear (under some assumption) 
[link]
This paper considers "the problem of learning logical structure [...] as expressed by satisfiability problems". This is an attempt at incorporating symbolic AI into neural networks. The key contribution of the paper is the introduction of "a differentiable smoothed MAXSAT solver", that is able to learn logical relationships from examples. The example given in the paper is Sudoku. The proposed model is able to learn jointly the rules of the game and how to solve the puzzles, **without prior on the rules**. The core of the system is a new layer that learns satisfiability constraints while being differentiable. This layer can be embedded in a typical ConvNet: ![satnet](https://userimages.githubusercontent.com/8659132/5987534085ae7780936e11e9946cc8e8bcb2f869.png) Previous attempts to solve Sudoku with a neural network were unsuccessful. The networks were able to reach a high accuracy on the training set but were completely unable to generalize on new puzzles, showing they were unable to learn the underlying logic. SATNet reaches 99% test accuracy on an encoded representation of Sudoku puzzles and 63% test accuracy on images of Sudoku puzzles. # Comments This layer can probably be used to solve other puzzle games, but I'm not familiar enough with SAT to know what kind of practical problems can be solved with this system. Operational research problems maybe? Code: https://github.com/locuslab/SATNet 
[link]
This paper was presented at ICML 2019. Do you remember greedy layerwise training? Are you curious what a modern take on the idea can achieve? This is the paper for you then. And it has its own very good summary: > We use standard convolutional and fully connected network architectures, but instead of globally backpropagating errors, each weight layer is trained by a local learning signal,that is not backpropagated down the network. The learning signal is provided by two separate singlelayer subnetworks, each with their own distinct loss function. One subnetwork is trained with a standard crossentropy loss, and the other with a similarity matching loss. If it's a bit unclear, this figure might help: ![local_error_signal](https://userimages.githubusercontent.com/8659132/59717441ff67298091e511e990d58f81c3468391.png) The crossentropy loss is the standard classification loss. The similarity loss is between the output of the layer and the onehot encoded labels: $$ L_{\mathrm{sim}}=\\ S(\text { NeuralNet }(H))S(Y) \\_{F}^{2} $$ The similarity is a cosine similarity matrix $S$ where the elements are: $$ s_{i j}=s_{j i}=\frac{\tilde{\mathbf{x}}_{i}^{T} \tilde{\mathbf{x}}_{j}}{\\\widetilde{\mathbf{x}}_{i}\\_{2}\\\widetilde{\mathbf{x}}_{j}\\_{2}} $$ The method is used to train VGGlike models on MNIST, FashionMNIST, CIFAR10 and 100, SVHN and STL10. While it gets nearSOTA up to CIFAR10, it's not there yet for more complex datasets. It gets 80% accuracy on CIFAR100 where SOTA is 90% accuracy. Still, this is better than a standard ResNet for example. Why would we prefer a local loss to a global loss? A big advantage is that the weights can be updated during the forward pass, thus avoiding storing the activations in memory. There was another paper on a similar topic, which I didn't read: [Greedy Layerwise Learning Can Scale to ImageNet](https://arxiv.org/abs/1812.11446). # Comments  While this is clearly not ready to just replace standard backprop, I find this line of work very interesting as it casts a doubt on one of the assumption of backprop: that we need a global signal to learn complex functions.  Though not mentioned in the paper, wouldn't a local loss naturally avoid vanishing and exploding gradients?
1 Comments
