[link]
A central question of this paper is: under what circumstances will you see agents that have been trained to optimize their own reward implement strategies  like tit for tat  that are are more sophisticated and higher overall reward than each agent simply pursuing its dominant strategy. The games under consideration here are “general sum” games like Iterated Prisoner’s Dilemma, where each agent’s dominant strategy is to defect, but with some amount of coordination or reciprocity, better overall outcomes are possible. Previously, models have achieved this via explicit hardcoding, but this paper strove to use a simpler, more general approach: allowing each agent A to optimize its reward not only with regard to a fixed opponent, but with regard to an opponent that will make a predictable update move in response to the action A is about to take. Specifically, this model  shorthanded as LOLA, Learning with OpponentLearning Awareness  maximizes a given agent’s expected discount reward, but looks at reward *conditional on* the ways the opponent will update to a given action. In a simplified world where the explicit reward function is known, it’s possible to literally take the derivative through the opponent’s expected update step, taking into account the ways your expected reward is changed by the response you expect of your opponent. Outside of this simplified framework, in the world of policy gradients, there’s no analytic loss function; you can no longer directly differentiate your reward function with respect to your opponent’s actions, but you can differentiate your expected reward estimator with respect to them. This concept is quite similar to a 2016 paper, Metz et al, that used this concept to train a more effective GAN, by allowing each network in the adversarial pair to “look ahead” to their opponent’s expected response as a way to avoid getting stuck in repetitive action/response cycles. In circumstances where the parameters of the opponent are not known  obviously closer to realistic for an adversarial scenario  the paper demonstrates proof of concept ability to model an opponent’s strategy based on their past actions, and use that to conduct responsestep estimates. https://i.imgur.com/5xddJRj.png It should of course be said in all this: even though this setup did produce results closer to what we would expect in rational reciprocity, it’s still very simplified. In most of the experiments, each agent had perfect knowledge of the opponent’s priorities and likely responses; in most game theory scenarios, constructing a model of your opponent is a nontrivial part of the difficulty. Nonetheless, I found it a
Your comment:

[link]
Normal RL agents in multiagent scenarios treat their opponents as a static part of the environment, not taking into account the fact that other agents are learning as well. This paper proposes LOLA, a learning rule that should take the agency and learning of opponents into account by optimizing "return under one step lookahead of opponent learning" So instead of optimizing under the current parameters of agent 1 and 2 $$V^1(\theta_i^1, \theta_i^2)$$ LOLA proposes to optimize taking into account one step of opponent (agent 2) learning $$V^1(\theta_i^1, \theta_i^2 + \Delta \theta^2_i)$$ where we assume the opponent's naive learning update $\Delta \theta^2_i = \nabla_{\theta^2} V^2(\theta^1, \theta^2) \cdot \eta$ and we add a secondorder correction term on top of this, the authors propose  a learning rule with policy gradients in the case that the agent does not have access to exact gradients  a way to estimate the parameters of the opponent, $\theta^2$, from its trajectories using maximum likelihood in the case you can't access them directly $$\hat \theta^2 = \text{argmax}_{\theta^2} \sum_t \log \pi_{\theta^2}(u_t^2s_t)$$ LOLA is tested on iterated prisoner's dilemma and converges to a titfortat strategy more frequently than the naive RL learning algorithm, and outperforms it. LOLA is tested on iterated matching pennies (similar to prisoner's dilemma) and stably converges to the Nash equilibrium whereas the naive learners do not. In testing on coin game (a higher dimensional version of prisoner's dilemma) they find that naive learners generally choose the defect option whereas LOLA agents have a mostlycooperative strategy. As well, the authors show that LOLA is a dominant learning rule in IPD, where both agents always do better if either is using LOLA (and even better if both are using LOLA). Finally, the authors also propose second order LOLA, which instead of assuming the opponent is a naive learner, assumes the opponent uses a LOLA learning rule. They show that second order LOLA does not lead to improved performance so there is no need to have a $n$th order LOLA arms race. 