At the core of an actor-critic algorithm is an idea of approximating an objective function $Q(s, a)$ (or Q function) with a trainable function $Q_\theta$ called a critic. An actor $\pi_\phi$ is then trained to maximize $Q_\theta$ instead of the original $Q$. Often, a pair of $Q_\theta$ and $\pi_\phi$ are trained separately for each task $\mathcal{T}$. In this paper, the authors cleverly propose to share a single critic $Q_\theta$ across multiple tasks $\mathcal{T}_1, \ldots, \mathcal{T}_L$ by parametrizing it to be dependent on a task. That is, $Q_\theta(s, a, z)$, where $z$ is a task vector.
For simplicity, consider a supervised learning task (as in Sec. 3.2), where $Q^t$ is an objective function for the $t$-th task and takes as input a training example pair $(x, y)$. A single critic $Q_\theta$ is then trained to approximate $L$-many such objective functions, i.e., $\arg\min_{\theta} \sum_{t=1}^L \sum_{(x,y)} (Q^t(x,y) - Q_\theta(x,y,z^t))^2$. The task vector $z^t$ is obtained by a task encoder (TAEN) which takes as input a minibatch of training examples of the $t$-th task and outputs the task vector. The TAEN is trained together with the $Q_\theta$, and all the $L$ actors $\pi^t$.
Once the critic (or meta-critic, as referred to by the authors) is trained, a new actor $\pi_\phi^{L+1}$ can be trained based on a small set of training examples in the following steps. First, the small set of training examples are used to compute a new task vector $z^{L+1}$. Second, a new critic is computed: $Q_\theta(\cdot,\cdot,z^{L+1})$. Third, the new actor is trained to maximize the new critic.
One big lesson I learned from this paper is that there are different ways to approach meta-learning. In the context of iterative learning of neural nets, I've only thought of meta-learning as learning to approximate an update direction as from https://arxiv.org/abs/1606.04474, i.e., $\phi \leftarrow \phi + g_\theta(\phi, x, y)$. This paper however suggests that you can instead learn an objective function, i.e., $\phi \leftarrow \phi + \eta \nabla_{\phi} Q_\theta(x, \pi_{\phi}(x), z(D))$, where $z(D)$ is a task vector obtained from new data $D$.
This is interesting, as it maximally reuses any existing technique from gradient-based learning and frees the meta-learner from having to re-learn them again.

more
less