Learning with Opponent-Learning Awareness on ShortScience.org

arxiv.org
scholar.google.com

Learning with Opponent-Learning Awareness
Jakob N. Foerster and Richard Y. Chen and Maruan Al-Shedivat and Shimon Whiteson and Pieter Abbeel and Igor Mordatch
arXiv e-Print archive - 2017 via Local arXiv
Keywords: cs.AI, cs.GT
more

Summaries/Notes 2

[link] Summary by mnoukhov 5 years ago

Normal RL agents in multi-agent scenarios treat their opponents as a static part of the environment, not taking into account the fact that other agents are learning as well. This paper proposes LOLA, a learning rule that should take the agency and learning of opponents into account by optimizing "return under one step look-ahead of opponent learning"

So instead of optimizing under the current parameters of agent 1 and 2 
$$V^1(\theta_i^1, \theta_i^2)$$

LOLA proposes to optimize taking into account one step of opponent (agent 2) learning
$$V^1(\theta_i^1, \theta_i^2 + \Delta \theta^2_i)$$

where we assume the opponent's naive learning update $\Delta \theta^2_i = \nabla_{\theta^2} V^2(\theta^1, \theta^2) \cdot \eta$ and we add a second-order correction term

on top of this, the authors propose
- a learning rule with policy gradients in the case that the agent does not have access to exact gradients
- a way to estimate the parameters of the opponent, $\theta^2$, from its trajectories using maximum likelihood in the case you can't access them directly
$$\hat \theta^2 = \text{argmax}_{\theta^2} \sum_t \log \pi_{\theta^2}(u_t^2|s_t)$$

LOLA is tested on iterated prisoner's dilemma and converges to a tit-for-tat strategy more frequently than the naive RL learning algorithm, and outperforms it. LOLA is tested on iterated matching pennies (similar to prisoner's dilemma) and stably converges to the Nash equilibrium whereas the naive learners do not. In testing on coin game (a higher dimensional version of prisoner's dilemma) they find that naive learners generally choose the defect option whereas LOLA agents have a mostly-cooperative strategy.

As well, the authors show that LOLA is a dominant learning rule in IPD, where both agents always do better if either is using LOLA (and even better if both are using LOLA).

Finally, the authors also propose second order LOLA, which instead of assuming the opponent is a naive learner, assumes the opponent uses a LOLA learning rule. They show that second order LOLA does not lead to improved performance so there is no need to have a $n$th order LOLA arms race.

Your comment:

Write your summary here (You can use $\LaTeX$ and markdown syntax):

Anon Private