[link]
This work attempts to use metalearning to learn an update rule for a reinforcement learning agent. In this context, "learning an update rule" means learning the parameters of an LSTM module that takes in information about the agent's recent reward and current model and outputs two values  a scalar and a vector  that are used to update the agent's model. I'm not going to go too deep into metalearning here, but, at a high level, meta learning methods optimize parameters governing an agent's learning, and, over the course of many training processes over many environments, optimize those parameters such that the reward over the full lifetime of training is higher. To be more concrete, the agent in a given environment learns two things:  A policy, that is, a distribution over predicted action given a state.  A "prediction vector". This fits in the conceptual slot where most RL algorithms would learn some kind of value or Q function, to predict how much future reward can be expected from a given state. However, in this context, this vector is *very explicitly* not a value function, but is just a vector that the agentmodel generates and updates. The notion here is that maybe our humandesigned construction of a value function isn't actually the best quantity for an agent to be predicting, and, if we metalearn, we might find something more optimal. I'm a little bit confused about the structure of this vector, but I think it's *intended* to be a categorical 1ofm prediction At each step, after acting in the environment, the agent passes to an LSTM:  The reward at the step  A binary of whether the trajectory is done  The discount factor  The probability of the action that was taken from state t  The prediction vector evaluated at state t  The prediction vector evaluated at state t+1 Given that as input (and given access to its past history from earlier in the training process), the LSTM predicts two things:  A scalar, pihat  A prediction vector, yhat These two quantities are used to update the existing policy and prediction model according to the rule below. https://i.imgur.com/xx1W9SU.png Conceptually, the scalar governs whether to increase or decrease probability assigned to the taken action under the policy, and yhat serves as a target for the prediction vector to be pulled towards. An important thing to note about the LSTM structure is that none of the quantities it takes as input are dependent on the action or observation space of the environment, so, once it is learned it can (hopefully) generalize to new environments. Given this, the basic meta learning objective falls out fairly easily  optimize the parameters of the LSTM to maximize lifetime reward, taken in expectation over training runs. However, things don't turn out to be quite that easy. The simplest version of this metalearning objective is wildly unstable and difficult to optimize, and the authors had to add a number of training hacks in order to get something that would work. (It really is dramatic, by the way, how absolutely essential these are to training something that actually learns a prediction vector). These include:  A entropy bonus, pushing the meta learned parameters to learn policies and prediction vectors that have higher entropy (which is to say: are less deterministic)  An L2 penalty on both pihat and yhat  A removal of the softmax that had originally been originally taken over the kdimensional prediction vector categorical, and switching that target from a KL divergence to a straight mean squared error loss. As far as I can tell, this makes the prediction vector no longer actually a 1ofk categorical, but instead just a continuous vector, with each value between 0 and 1, which makes it make more sense to think of k separate binaries? This I was definitely confused about in the paper overall https://i.imgur.com/EL8R1yd.png With the help of all of these regularizers, the authors were able to get something that trained, and that appeared to be able to perform comparably to or better than A2C  the humandesigned baseline  across the simple gridworlds it was being trained in. However, the two most interesting aspects of the evaluation were: 1. The authors showed that, given the values of the prediction vector, you could predict the true value of a state quite well, suggesting that the vector captured most of the information about what states were high value. However, beyond that, they found that the metalearned vector was able to be used to predict the value calculated with discount rates different that than one used in the metalearned training, which the handengineered alternative, TDlambda, wasn't able to do (it could only wellpredict values at the same discount rate used to calculate it). This suggests that the network really is learning some more robust notion of value that isn't tied to a specific discount rate. 2. They also found that they were able to deploy the LSTM update rule learned on grid worlds to Atari games, and have it perform reasonably well  beating A2C in a few cases, though certainly not all. This is fairly impressive, since it's an example of a rule learned on a different, much simpler set of environments generalizing to more complex ones, and suggests that there's something intrinsic to Reinforcement Learning that it's capturing
Your comment:
