[link]
This paper blends concepts from variational inference and hierarchical reinforcement learning, learning skills or “options” out of which master policies can be constructed, in a way that allows for both information transfer across tasks and specialization on any given task. The idea of hierarchical reinforcement learning is that instead of maintaining one single policy distribution (a learned mapping between worldstates and actions), a learning system will maintain multiple simpler policies, and then learn a metapolicy for transitioning between these objectlevel policies. The hope is that this setup leads to both greater transparency and compactness (because skills are compartmentalized), and also greater ability to transfer across tasks (because if skills are granular enough, different combinations of the same skills can be used to solve quite different tasks). The differentiating proposal of this paper is that, instead of learning skills that will be fixed with respect to the master, taskspecific policy, we instead learning crosstask priors over different skills, which can then be finetuned for a given specific task. Mathematically, this looks like a reward function that is a combination of (1) actual rewards on a trajectory, and (2) the difference in the log probability of a given trajectory under the taskspecific posterior and under the prior. https://i.imgur.com/OCvmGSQ.png This framing works in two directions: it allows a general prior to be pulled towards taskspecific rewards, to get more specialized value, but it also pulls the pertask skill towards the global prior. This is both a source of transfer knowledge and general regularization, and also an incentive for skills to be relatively consistent across tasks, because consistent posteriors will be more locally clustered around their prior. The paper argues that one advantage of this is a symmetrybreaking effect, avoiding a local minimum where two skills are both being used to solve subtask A, and it would be better for one of them to specialize on subtask B, but in order to do so the local effect would be worse performance of that skill on subtask A, which would be to the overall policy’s detriment because that skill was being actively used to solve that task. Under a priordriven system, the model would have an incentive to pick one or the other of the options and use that for a given subtask, based on whichever’s prior was closest in trajectoryspace. https://i.imgur.com/CeFQ9PZ.png On a mechanical level, this set of priors is divided into a few structural parts: 1) A termination distribution, which chooses whether to keep drawing actions from the skill/subpolicy you’re currently on, or trade it in for a new one. This one has a prior set at a Bernoulli distribution with some learned alpha 2) A skill transition distribution, which chooses, conditional on sampling a “terminate”, which skill to switch to next. This has a prior of a uniform distribution over skills, which incentivizes the learning system to not put all its sampling focus on one policy too early 3) A distribution of actions given a skill choice, which, as mentioned before, has both a crosstask prior and a pertask learned posterior
Your comment:
