[link]
At the core of an actorcritic algorithm is an idea of approximating an objective function $Q(s, a)$ (or Q function) with a trainable function $Q_\theta$ called a critic. An actor $\pi_\phi$ is then trained to maximize $Q_\theta$ instead of the original $Q$. Often, a pair of $Q_\theta$ and $\pi_\phi$ are trained separately for each task $\mathcal{T}$. In this paper, the authors cleverly propose to share a single critic $Q_\theta$ across multiple tasks $\mathcal{T}_1, \ldots, \mathcal{T}_L$ by parametrizing it to be dependent on a task. That is, $Q_\theta(s, a, z)$, where $z$ is a task vector. For simplicity, consider a supervised learning task (as in Sec. 3.2), where $Q^t$ is an objective function for the $t$th task and takes as input a training example pair $(x, y)$. A single critic $Q_\theta$ is then trained to approximate $L$many such objective functions, i.e., $\arg\min_{\theta} \sum_{t=1}^L \sum_{(x,y)} (Q^t(x,y)  Q_\theta(x,y,z^t))^2$. The task vector $z^t$ is obtained by a task encoder (TAEN) which takes as input a minibatch of training examples of the $t$th task and outputs the task vector. The TAEN is trained together with the $Q_\theta$, and all the $L$ actors $\pi^t$. Once the critic (or metacritic, as referred to by the authors) is trained, a new actor $\pi_\phi^{L+1}$ can be trained based on a small set of training examples in the following steps. First, the small set of training examples are used to compute a new task vector $z^{L+1}$. Second, a new critic is computed: $Q_\theta(\cdot,\cdot,z^{L+1})$. Third, the new actor is trained to maximize the new critic. One big lesson I learned from this paper is that there are different ways to approach metalearning. In the context of iterative learning of neural nets, I've only thought of metalearning as learning to approximate an update direction as from https://arxiv.org/abs/1606.04474, i.e., $\phi \leftarrow \phi + g_\theta(\phi, x, y)$. This paper however suggests that you can instead learn an objective function, i.e., $\phi \leftarrow \phi + \eta \nabla_{\phi} Q_\theta(x, \pi_{\phi}(x), z(D))$, where $z(D)$ is a task vector obtained from new data $D$. This is interesting, as it maximally reuses any existing technique from gradientbased learning and frees the metalearner from having to relearn them again.
Your comment:
