Intrinsic Social Motivation via Causal Influence in Multi-Agent RLIntrinsic Social Motivation via Causal Influence in Multi-Agent RLNatasha Jaques and Angeliki Lazaridou and Edward Hughes and Caglar Gulcehre and Pedro A. Ortega and DJ Strouse and Joel Z. Leibo and Nando de Freitas2018
Paper summarydecodyngThis paper builds very directly on the idea of “empowerment” as an intrinsic reward for RL agents. Where empowerment incentivizes agents to increase the amount of influence they’re able to have over the environment, “social influence,” this paper’s metric, is based on the degree which the actions of one agent influence the actions of other agents, within a multi-agent setting. The goals between the two frameworks are a little different. The notion of “empowerment” is built around a singular agent trying to figure out a short-term proxy for likelihood of long-term survival (which is a feedback point no individual wants to hit). By contrast, the problems that the authors of this paper seek to solve are more explicitly multi-agent coordination problems: prisoner’s dilemma-style situations where collective reward requires cooperation. However, they share a mathematical basis: the idea that an agent’s influence on some other element of its environment (be it the external state, or another agent’s actions) is well modeled by calculating the mutual information between its agents and that element.
While this is initially a bit of an odd conceptual jump, it does make sense: if an action can give statistical information to help you predict an outcome, it’s likely (obviously not certain, but likely) that that action influenced that outcome. In a multi-agent problem, where cooperation and potentially even communication can help solve the task, being able to influence other agents amounts to “finding ways to make oneself useful to other agents”, because other agents aren’t going to change behavior based on your actions, or “listen” to your “messages” (in the experiment where a communication channel was available between agents) if these signals don’t help them achieve *their* goals. So, this incentive, to influence the behavior of other (self-interested) agents, amounts to a good proxy for incentivizing useful cooperation.
Zooming in on the exact mathematical formulations (which differ slightly from, though they’re in a shared spirit with, the empowerment math): the agent’s (A’s) Causal Influence reward is calculated by taking a KL divergence between the action distribution of the other agent (B) conditional on the action A took, compared to other actions A might have taken. (see below. Connecting back to empowerment: Mutual Information is just the expected value of this quantity, taken over A’s action distribution).
One thing you may notice from the above equation is that, because we’re working in KL divergences, we expect agent A to have access to the full distribution of agent B’s policy conditional on A’s action, not just the action B actually took. We also require the ability to sample “counterfactuals,” i.e. what agent B would have done if agent A had done something differently. Between these two requirements. If we take a realistic model of two agents interacting with each other, in only one timeline, only having access to the external and not internal parameters of the other, it makes it clear that these quantities can’t be pulled from direct experience. Instead, they are calculated by using an internal model: each agent builds its own MOA (Model of Other Agents), where they build a predictive model of what an agent will do at a given time, conditional on the environment and the actions of all other agents. It’s this model that is used to sample the aforementioned counterfactuals, since that just involves passing in a different input. I’m not entirely sure, in each experiment, whether the MOAs are trained concurrent with agent policies, or in a separate prior step.
Testing on, again, Prisoner’s Dilemma style problems requiring agents to take risky collaborative actions, the authors did find higher performance using their method, compared to approaches where each agent just maximizes its own external reward (which, it should be said, does depend on other agents’ actions), with no explicit incentive towards collaboration. Interestingly, when they specifically tested giving agents access to a “communication channel” (the ability to output discrete signals or “words” visible to other agents), they found that it was able to train just as effectively with only an influence reward, as it was with both an influence and external reward.
Intrinsic Social Motivation via Causal Influence in Multi-Agent RL
Pedro A. Ortega
Joel Z. Leibo
Nando de Freitas
arXiv e-Print archive - 2018 via Local arXiv
cs.LG, cs.AI, cs.MA, stat.ML
First published: 2018/10/19 (1 year ago) Abstract: We derive a new intrinsic social motivation for multi-agent reinforcement
learning (MARL), in which agents are rewarded for having causal influence over
another agent's actions. Causal influence is assessed using counterfactual
reasoning. The reward does not depend on observing another agent's reward
function, and is thus a more realistic approach to MARL than taken in previous
work. We show that the causal influence reward is related to maximizing the
mutual information between agents' actions. We test the approach in challenging
social dilemma environments, where it consistently leads to enhanced
cooperation between agents and higher collective reward. Moreover, we find that
rewarding influence can lead agents to develop emergent communication
protocols. We therefore employ influence to train agents to use an explicit
communication channel, and find that it leads to more effective communication
and higher collective reward. Finally, we show that influence can be computed
by equipping each agent with an internal model that predicts the actions of
other agents. This allows the social influence reward to be computed without
the use of a centralised controller, and as such represents a significantly
more general and scalable inductive bias for MARL with independent agents.