The paper combines reinforcement learning with active learning to learn when to request labels to improve prediction accuracy. - The model can either predict the label at time step $t$ or request it in the next time step, in form of a one-hot vector output of an LSTM with the previous label (if requested) and the current image as an input. - A reward is issued based on the outcome of requesting labels (-0.05), or correctly (+1) or incorrectly(-1) predicting the label. - The optimal strategy involves storing class embeddings and their labels in memory and only requesting labels if a unseen class is encountered. The model is evaluated against the *Omniglot* dataset and learns a non-naive strategy to request fewer labels the more data of a class was encountered, using a learned uncertainty measure. The magnitude of the reward for incorrect labeling decides on the amount of requested labels and can be used to maximize the accuracy during prediction. ## Active Learning Active Learning is a special case of semi-supervised learning, which aims to reduce the amount of supervision needed during training. The model typically selects which datapoints to label by applying different metrics like most information content, highest uncertainty or other heuristics. ## Reinforcement Learning Reinforcement learning agents try to learn an optimal policy $\pi^*(s_t)$ for a state $s_t$ at time $t$ that will maximize future rewards issued by the environment, by choosing an action $a_t$. The policy is represented by a function $Q^*(s_t, a_t)$, which can be approximated and learned in form of a neural network.