Learning to Play No-Press Diplomacy with Best Response Policy Iteration on ShortScience.org

arxiv.org
scholar.google.com
Learning to Play No-Press Diplomacy with Best Response Policy Iteration
Anthony, Thomas W. and Eccles, Tom and Tacchetti, Andrea and Kramár, János and Gemp, Ian M. and Hudson, Thomas C. and Porcel, Nicolas and Lanctot, Marc and Pérolat, Julien and Everett, Richard and Singh, Satinder and Graepel, Thore and Bachrach, Yoram
arXiv e-Print archive - 2020 via Local Bibsonomy
Keywords: dblp
Summaries/Notes 1
[link] Summary by CodyWild 5 years ago
This paper focuses on an effort by a Deepmind team to train an agent that can play the game Diplomacy - a complex, multiplayer game where players play as countries controlling units, trying to take over the map of Europe. Some relevant factors of this game, for the purposes of this paper, are: 

1) All players move at the same time, which means you need to model your opponent's current move, and play a move that succeeds in expectation over that predicted move distribution. This also means that, in order to succeed, your policy can't be easily predictable, since, if it is, you're much easier to exploit, since your opponents can more easily optimize their response to what they predict you'll do 

2) The action space is huge, even compared to something like Chess 

3) The game has significant multiagent complexity: rather than being straightforwardly zero-sum in its reward structure, like Chess or Go, it's designed to require alliances between players in order to succeed 

Prior work - DipNet -  had been able to outperform other hand-coded models through a deep network that imitated human actions, but the authors hadn't been able to get RL to successfully learn on top of that imitation baseline. 

The basic approach this model takes is one that will probably feel familiar if you've read Deepmind's prior work on Chess or Go: an interplay between a fast-to-evaluate neural net component, and a slower, more explicit, strategically designed component. The slower "expert" component uses the fast network component as part of its evaluation of different moves' consequences, and then, once the slow expert has generated a series of decisions, the neural net policy learns to imitate those decisions. 

In this case, the slower expert tries to explicitly calculate a Best Response strategy, given some belief about what your opponents will do at the state you're in. Since the action space is so large, it isn't possible to calculate a full best response (that is, your best *possible* action given the actions you expect your opponents to take), so this paper instead lays out a Sampled Best Response algorithm. It takes as input a state, as well as an opponent policy, a candidate policy, and a value network. (More on how those come to be layer). In the simplest case, the SBR algorithm works by: 

1. Sampling some number (B) of actions from the opponent policy given the state. These represent a sample distribution of what moves you think your opponents are likely to play
2. Sampling some number (C) of candidate actions from your candidate policy. 
3. For each candidate action, evaluating - for each opponent action - the state you'd reach if you took the candidate action and the opponent took the opponent action, according to your value network. 
4. This gives you an estimated Q value for each of your candidate actions; if you pick the action with the highest Q value, this approximates your best response action to the opponent policy distribution you pass in 

Once you have this SBR procedure, you can use it to bootstrap a better policy and a value network by starting with a policy and value learned from pure human-imitation, and then using SBR - with your current policy as both the opponent and candidate policy - to generate a dataset of trajectories (where each action in the trajectory is chosen according to the SBR procedure). With that trajectory dataset, you can train your policy and value networks to be able to better approximate the actions that SBR would have taken. 

The basic version of this procedure is called IBR - Iterated Best Response. This is because, at each stage, SBR will return a policy that tries to perform well against the current version of the opponent policy. This kind of behavior is potentially troublesome, since you can theoretically have it lead to cycles, like those in Rock Paper Scissors where paper beats rock, scissors beat paper, and then rock beats scissors again. At each stage, you find the best response to what your opponent is doing right now, but you don't actually make transitive progress. A common way of improving along the axis of this particular failure mode is to learn via a "fictitious play" rather than "self play". Putting aside the name - which I don't find very usefully descriptive - this translates to simulating playing against, not the current version of your opponent, but a distribution made up of past versions of the opponent. This helps prevent cycling, because choosing a strategy that only defeats the current version but would lose to a prior version is no longer a rewarding option. 

The Fictitious Play-based approach this paper proposes - FPPI2 - tries to resolve this problem. Instead of sampling actions in step (1) from only your current opponent policy, it uses a sampling procedure where, for each opponent action, it first samples a point in history, and then samples the opponent policy vector at that point in history (so the multiple opponent moves collectively represent a coherent point in strategy space). Given this action profile, the final version of the algorithm continues to use the most recent/updated timestep of the policy and value network for the candidate policy and value network used in SBR, so that you're (hopefully) sampling high quality actions, and making accurate assessments of states, even while you use the distribution of (presumably worse) policies to construct the opponent action distribution that you evaluate your candidate actions against. 

https://i.imgur.com/jrQAAQW.png

The authors don't evaluate against human players, but do show that their FPPI2 approach consistently outperforms the prior DipNet state of the art, and performed the best overall, though Iterated Best Response performs better than they say they expected.

Some other thoughts: 

- This system is still fairly exploitable, and doesn't become meaningfully less so during training, though it does become more difficult to exploit quickly
- This does seem like a problem overall where you do a lot of modeling of what you expect your opponents strategies to be, and it seems hard to be robust, especially to collusion of your opponents
- I was a bit disappointed that the initial framing of the paper put a lot of emphasis on the alliance-focused nature of the game, but then neither suggested mechanisms targeting that aspect of the game, nor seemed to do any specific analysis of
- I would have expected this game to benefit from some explicit modeling of different agents having different policies; possibly this just isn't something they could have had be the case under their evaluation scheme, which played against copies of a given policy?

Overall, my sense is that this is still a pretty early-stage checkpoint in the effort of playing Diplomacy, and that we've still got a ways to go, but it is interesting early work, and I'm curious where it leads.
Your comment:
Write your summary here (You can use $\LaTeX$ and markdown syntax):
Anon Private