Social Influence as Intrinsic Motivation
for Multi-Agent Deep Reinforcement Learning
Natasha Jaques 1 2 Angeliki Lazaridou 2 Edward Hughes 2 Caglar Gulcehre 2 Pedro A. Ortega 2 DJ Strouse 3
Joel Z. Leibo 2 Nando de Freitas 2
Abstract
We propose a unified mechanism for achieving
coordination and communication in Multi-Agent
Reinforcement Learning (MARL), through
rewarding agents for having causal influence
over other agents’ actions. Causal influence is
assessed using counterfactual reasoning. At each
timestep, an agent simulates alternate actions that
it could have taken, and computes their effect on
the behavior of other agents. Actions that lead to
bigger changes in other agents’ behavior are considered influential and are rewarded. We show that
this is equivalent to rewarding agents for having
high mutual information between their actions.
Empirical results demonstrate that influence leads
to enhanced coordination and communication
in challenging social dilemma environments,
dramatically increasing the learning curves of the
deep RL agents, and leading to more meaningful
learned communication protocols. The influence
rewards for all agents can be computed in a
decentralized way by enabling agents to learn a
model of other agents using deep neural networks.
In contrast, key previous works on emergent
communication in the MARL setting were unable
to learn diverse policies in a decentralized manner
and had to resort to centralized training. Consequently, the influence reward opens up a window
of new opportunities for research in this area.
1. Introduction
Intrinsic Motivation for Reinforcement Learning (RL) refers
to reward functions that allow agents to learn useful behavior
1
Media Lab, Massachusetts Institute of Technology, Cambridge,
USA 2 Google DeepMind, London, UK 3 Institute for Advanced
Study, Princeton University, Princeton, USA. Correspondence to:
Natasha Jaques , Angeliki Lazaridou .
Proceedings of the 36 th International Conference on Machine
Learning, Long Beach, California, PMLR 97, 2019. Copyright
2019 by the author(s).
across a variety of tasks and environments, sometimes in
the absence of environmental reward (Singh et al., 2004).
Previous approaches to intrinsic motivation often focus on
curiosity (e.g. Pathak et al. (2017); Schmidhuber (2010)),
or empowerment (e.g. Klyubin et al. (2005); Mohamed &
Rezende (2015)). Here, we consider the problem of deriving
intrinsic social motivation from other agents in multi-agent
RL (MARL). Social learning is incredibly important for
humans, and has been linked to our ability to achieve
unprecedented progress and coordination on a massive scale
(Henrich, 2015; Harari, 2014; Laland, 2017; van Schaik &
Burkart, 2011; Herrmann et al., 2007). While some previous
work has investigated intrinsic social motivation for RL (e.g.
Sequeira et al. (2011); Hughes et al. (2018); Peysakhovich &
Lerer (2018)), these approaches rely on hand-crafted rewards
specific to the environment, or allowing agents to view the
rewards obtained by other agents. Such assumptions make it
impossible to achieve independent training of MARL agents
across multiple environments.
Achieving coordination among agents in MARL still remains
a difficult problem. Prior work in this domain (e.g., Foerster
et al. (2017; 2016)), often resorts to centralized training to
ensure that agents learn to coordinate. While communication
among agents could help with coordination, training emergent communication protocols also remains a challenging
problem; recent empirical results underscore the difficulty
of learning meaningful emergent communication protocols,
even when relying on centralized training (e.g., Lazaridou
et al. (2018); Cao et al. (2018); Foerster et al. (2016)).
We propose a unified method for achieving both coordination
and communication in MARL by giving agents an intrinsic
reward for having a causal influence on other agents’ actions.
Causal influence is assessed using counterfactual reasoning;
at each timestep, an agent simulates alternate, counterfactual
actions that it could have taken, and assesses their effect
on another agent’s behavior. Actions that lead to relatively
higher change in the other agent’s behavior are considered
to be highly influential and are rewarded. We show how
this reward is related to maximizing the mutual information
between agents’ actions, and hypothesize that this inductive
bias will drive agents to learn coordinated behavior. Maximiz-
Social Influence as Intrinsic Motivation for Multi-Agent Deep RL
ing mutual information as a form of intrinsic motivation has
been studied in the literature on empowerment (e.g. Klyubin
et al. (2005); Mohamed & Rezende (2015)). Social influence
can be seen as a novel, social form of empowerment.
To study our influence reward, we adopt the Sequential Social Dilemma (SSD) multi-agent environments of Leibo et al.
(2017). Through a series of three experiments, we show that
the proposed social influence reward allows agents to learn to
coordinate and communicate more effectively in these SSDs.
We train recurrent neural network policies directly from pixels, and show in the first experiment that deep RL agents
trained with the proposed social influence reward learn effectively and attain higher collective reward than powerful
baseline deep RL agents, which often completely fail to learn.
In the second experiment, the influence reward is used to directly train agents to use an explicit communication channel.
We demonstrate that the communication protocols trained
with the influence reward are more meaningful and effective
for obtaining better collective outcomes. Further, we find
a significant correlation between being influenced through
communication messages and obtaining higher individual reward, suggesting that influential communication is beneficial
to the agents that receive it. By examining the learning curves
in this second experiment, we again find that the influence
reward is essential to allow agents to learn to coordinate.
Finally, we show that influence agents can be trained
independently, when each agent is equipped with an internal
neural network Model of Other Agents (MOA), which has
been trained to predict the actions of every other agent. The
agent can then simulate counterfactual actions and use its
own internal MOA to predict how these will affect other
agents, thereby computing its own intrinsic influence reward.
Influence agents can thus learn socially, only through
observing other agents’ actions, and without requiring a
centralized controller or access to another agent’s reward
function. Therefore, the influence reward offers us a simple,
general and effective way of overcoming long-standing
unrealistic assumptions and limitations in this field of
research, including centralized training and the sharing of
reward functions or policy parameters. Moreover, both
the influence rewards as well as the agents’ policies can be
learned directly from pixels using expressive deep recurrent
neural networks. In this third experiment, the learning curves
once again show that the influence reward is essential for
learning to coordinate in these complex domains.
The paper is structured as follows. We describe the environments in Section 2, and the MARL setting in Section 3.
Section 4 introduces the basic formulation of the influence
reward, Section 5 extends it with the inclusion of explicit
communication protocols, and Section 6 advances it by
including models of other agents to achieve independent
training. Each of these three sections presents experiments
and results that empirically demonstrate the efficacy of
the social influence reward. Related work is presented in
Section 7. Finally, more details about the causal inference
procedure are given in Section 8.
2. Sequential Social Dilemmas
Sequential Social Dilemmas (SSDs) (Leibo et al., 2017)
are partially observable, spatially and temporally extended
multi-agent games with a game-theoretic payoff structure.
An individual agent can obtain higher reward in the
short-term by engaging in defecting, non-cooperative
behavior (and thus is greedily motivated to defect), but the
total payoff per agent will be higher if all agents cooperate.
Thus, the collective reward obtained by a group of agents
in these SSDs gives a clear signal about how well the agents
learned to cooperate (Hughes et al., 2018).
We experiment with two SSDs, a public goods game Cleanup,
and a public pool resource game Harvest. In both games
apples (green tiles) provide the rewards, but are a limited
resource. Agents must coordinate harvesting apples with the
behavior of other agents in order to achieve cooperation (for
further details see Section 2 of the Supplementary Material).
Tthe code for these games is available in open-source.1
As the Schelling diagrams in Figure 2 of the Supplementary
Material reveal, all agents would benefit from learning to
cooperate in these games, because even agents that are being
exploited get higher reward than in the regime where more
agents defect. However, traditional RL agents struggle to
learn to coordinate or cooperate to solve these tasks effectively (Hughes et al., 2018). Thus, these SSDs represent challenging benchmark tasks for the social influence reward. Not
only must influence agents learn to coordinate their behavior
to obtain high reward, they must also learn to cooperate.
3. Multi-Agent RL for SSDs
We consider a MARL Markov game defined by the tuple
hS,T,A,ri, in which multiple agents are trained to independently maximize their own individual reward; agents do
not share weights. The environment state is given by s ∈ S.
At each timestep t, each agent k chooses an action akt ∈ A.
The actions of all N agents are combined to form a joint
action at = [a0t ,...aN
t ], which produces a transition in the
environment T (st+1 |at ,st ), according to the state transition
distribution T . Each agent then receives its own reward
rk (at ,st ), which may depend on the actions of other agents.
A history of these variables over time is termed a trajectory,
T
τ = {st ,at ,rt }t=0 . We consider a partially observable
setting in which the kth agent can only view a portion of
1
https://github.com/eugenevinitsky/
sequential_social_dilemma_games
Social Influence as Intrinsic Motivation for Multi-Agent Deep RL
the true state, skt . Each agent seeks to maximize
P∞ its own
k
total expected discounted future reward, Rk = i=0 γ i rt+i
,
where γ is the discount factor. A distributed asynchronous
advantage actor-critic (A3C) approach (Mnih et al., 2016)
is used to train each agent’s policy π k .
Our neural networks consist of a convolutional layer, fully
connected layers, a Long Short Term Memory (LSTM)
recurrent layer (Gers et al., 1999), and linear layers. All
networks take images as input and output both the policy π k
and the value function V πk (s), but some network variants
consume additional inputs and output either communication
policies or models of other agents’ behavior. We will refer to
the internal LSTM state of the kth agent at timestep t as ukt .
4. Basic Social Influence
Social influence intrinsic motivation gives an agent additional reward for having a causal influence on another agent’s
actions. Specifically, it modifies an agent’s immediate reward
so that it becomes rtk = αekt +βckt , where ekt is the extrinsic or
environmental reward, and ckt is the causal influence reward.
To compute the causal influence of one agent on another,
suppose there are two agents, k and j, and that agent j is
able to condition its policy on agent k’s action at time t, akt .
Thus, agent j computes the probability of its next action
as p(ajt |akt ,sjt ). We can then intervene on akt by replacing
it with a counterfactual action, ãkt . This counterfactual
action is used to compute a new distribution over j’s next
action, p(ajt |ãkt ,sjt ). Essentially, agent k asks a retrospective
question: “How would j’s action change if I had acted
differently in this situation?”.
By sampling several counterfactual actions, and averaging the resulting policy distribution of j in
each case, Pwe obtain the marginal policy of j,
p(ajt |sjt ) =
p(ajt |ãkt , sjt )p(ãkt |sjt ) —in other words,
ãk
t
j’s policy if it did not consider agent k. The discrepancy
between the marginal policy of j and the conditional policy
of j given k’s action is a measure of the causal influence
of k on j; it gives the degree to which j changes its planned
action distribution because of k’s action. Thus, the causal
influence reward for agent k is:
ckt =
N
h
X
j=0,j6=k
=
N
h
X
DKL [p(ajt | akt ,sjt )
i
X j
p(at | ãkt ,sjt )p(ãkt | sjt )]
ãk
t
i
DKL [p(ajt | akt ,sjt ) p(ajt | sjt )] .
(1)
j=0,j6=k
Note that it is possible to use a divergence metric other than
KL; we have found empirically that the influence reward is
robust to the choice of metric.
The reward in Eq. 1 is related to the mutual information (MI)
between the actions of agents k and j, I(ak ;aj |s). As the
reward is computed over many trajectories sampled independently from the environment, we obtain a Monte-Carlo
estimate of I(ak ;aj |s). In expectation, the influence reward
incentivizes agents to maximize the mutual information
between their actions. The proof is given in Section 1 of
the Supplementary Material. Intuitively, training agents
to maximize the MI between their actions results in more
coordinated behavior.
Moreover, the variance of policy gradient updates increases
as the number of agents in the environment grows (Lowe et al.,
2017). This issue can hinder convergence to equilibrium for
large-scale MARL tasks. Social influence can reduce the
variance of policy gradients by introducing explicit dependencies across the actions of each agent. This is because the
conditional variance of the gradients an agent is receiving
will be less than or equal to the marginalized variance.
Note that for the basic influence model we make two assumptions: 1) we use centralized training to compute ckt directly
from the policy of agent j, and 2) we assume that influence is
unidirectional: agents trained with the influence reward can
only influence agents that are not trained with the influence
reward (the sets of influencers and influencees are disjoint,
and the number of influencers is in [1,N −1]). Both of these
assumptions are relaxed in later sections. Further details, as
well as further explanation of the causal inference procedure
(including causal diagrams) are available in Section 8.
4.1. Experiment I: Basic Influence
Figure 1 shows the results of testing agents trained with the
basic influence reward against standard A3C agents, and an
ablated version of the model in which agents do not receive
the influence reward, but are able to condition their policy
on the actions of other agents (even when the other agents
are not within the agent’s partially observed view of the
environment). We term this ablated model the visible actions
baseline. In this and all other results figures, we measure
the total collective reward obtained using the best hyperparameter setting tested with 5 random seeds each. Error
bars show a 99.5% confidence interval (CI) over the random
seeds, computed within a sliding window of 200 agent steps.
We use a curriculum learning approach which gradually
increases the weight of the social influence reward over C
steps (C ∈ [0.2−3.5]×108 ); this sometimes leads to a slight
delay before the influence models’ performance improves.
As is evident in Figures 1a and 1b, introducing an awareness
of other agents’ actions helps, but having the social influence
reward eventually leads to significantly higher collective
reward in both games. Due to the structure of the SSD games,
we can infer that agents that obtain higher collective reward
learned to cooperate more effectively. In the Harvest MARL
setting, it is clear that the influence reward is essential to
Social Influence as Intrinsic Motivation for Multi-Agent Deep RL
achieve any reasonable learning.
(a) Cleanup
(1969). Evidently, the influence reward gave rise not only to
cooperative behavior, but to emergent communication.
(b) Harvest
Figure 1: Total collective reward obtained in Experiment 1.
Agents trained with influence (red) significantly outperform
the baseline and ablated agents. In Harvest, the influence
reward is essential to achieve any meaningful learning.
To understand how social influence helps agents achieve
cooperative behavior, we investigated the trajectories
produced by high scoring models in both Cleanup and
Harvest; the analysis revealed interesting behavior.
As an example, in the Cleanup video available here:
https://youtu.be/iH_V5WKQxmo a single agent
(shown in purple) was trained with the social influence
reward. Unlike the other agents, which continue to move
and explore randomly while waiting for apples to spawn,
the influencer only traverses the map when it is pursuing an
apple, then stops. The rest of the time it stays still.
Figure 2 shows a moment of
high influence between the influencer and the yellow influencee. The influencer has chosen to move towards an apple that is outside of the egocentric field-of-view of the yellow agent. Because the influencer only moves when apples
are available, this signals to the
yellow agent that an apple must
be present above it which it
cannot see. This changes the
Figure 2: A moment
yellow agent’s distribution over
of high influence when
its planned action, p(ajt |akt ,sjt ),
the purple influencer sigand allows the purple agent to
nals the presence of
gain influence. A similar moan apple (green tiles)
ment occurs when the influoutside the yellow inencer signals to an agent that has
fluencee’s field-of-view
been cleaning the river that no
(yellow outlined box).
apples have appeared by staying
still (see Figure 6 in the Supplementary Material).
In this case study, the influencer agent learned to use its own
actions as a binary code which signals the presence or absence
of apples in the environment. We observe a similar effect in
Harvest. This type of action-based communication could be
likened to the bee waggle dance discovered by von Frisch
It is important to consider the limitations of the influence
reward. Whether it will always give rise to cooperative behavior may depend on the specifics of the environment and
task, and tuning the trade-off between environmental and
influence reward. Although influence is arguably necessary
for coordination (e.g. two agents coordinating to manipulate
an object must have a high degree of influence between their
actions), it may be possible to influence another agent in a
non-cooperative way. The results provided here show that the
influence reward did lead to increased cooperation, in spite of
cooperation being difficult to achieve in these environments.
5. Influential Communication
Given the above results, we next experiment with using the
influence reward to train agents to use an explicit communication channel. At each timestep, each agent k chooses a discrete communication symbol mkt ; these symbols are concatenated into a combined message vector mt = [m0t ,m1t ...mN
t ],
for N agents. This message vector mt is then given as input
to every other agent in the next timestep. Note that previous
work has shown that self-interested agents do not learn to use
this type of ungrounded, cheap talk communication channel effectively (Crawford & Sobel, 1982; Cao et al., 2018;
Foerster et al., 2016; Lazaridou et al., 2018).
Figure 3: The communication model has two heads, which
learn the environment policy, πe , and a policy for emitting
communication symbols, πm . Other agents’ communication
messages mt−1 are input to the LSTM.
To train the agents to communicate, we augment our initial
network with an additional A3C output head, that learns a
communication policy πm and value function Vm to determine which symbol to emit (see Figure 3). The normal policy
and value function used for acting in the environment, πe and
Ve , are trained only with environmental reward e. We use
the influence reward as an additional incentive for training
the communication policy, πm , such that r = αe+βc. Counterfactuals are employed to assess how much influence an
agent’s communication message from the previous timestep,
mkt−1 , has on another agent’s action, ajt , where:
ckt =
N
h
X
j=0,j6=k
i
DKL [p(ajt | mkt−1 ,sjt ) p(ajt | sjt )]
(2)
Social Influence as Intrinsic Motivation for Multi-Agent Deep RL
Importantly, rewarding influence through a communication
channel does not suffer from the limitation mentioned in the
previous section, i.e. that it may be possible to influence
another agent in a non-cooperative way. We can see this
for two reasons. First, there is nothing that compels agent
j to act based on agent k’s communication message; if mkt
does not contain valuable information, j is free to ignore it.
Second, because j’s action policy πe is trained only with environmental reward, j will only change its intended action as
a result of observing mkt (i.e. be influenced by mkt ) if it contains information that helps j to obtain environmental reward.
Therefore, we hypothesize that influential communication
must provide useful information to the listener.
5.1. Experiment II: Influential Communication
Figure 4 shows the collective reward obtained when training
the agents to use an explicit communication channel. Here,
the ablated model has the same structure as in Figure 3, but
the communication policy πm is trained only with environmental reward. We observe that the agents incentivized to
communicate via the social influence reward learn faster, and
achieve significantly higher collective reward for the majority
of training in both games. In fact, in the case of Cleanup, we
found that α = 0 in the optimal hyperparameter setting, meaning that it was most effective to train the communication head
with zero extrinsic reward (see Table 2 in the Supplementary
Material). This suggests that influence alone can be a sufficient mechanism for training an effective communication
policy. In Harvest, once again influence is critical to allow
agents to learn coordinated policies and attain high reward.
(a) Cleanup
(b) Harvest
Figure 4: Total collective reward for deep RL agents with
communication channels. Once again, the influence reward
is essential to improve or achieve any learning.
To analyze the communication behaviour learned by the
agents, we introduce three metrics, partially inspired by
(Bogin et al., 2018). Speaker consistency, is a normalized
score ∈ [0,1] which assesses the entropy of p(ak |mk ) and
p(mk |ak ) to determine how consistently a speaker agent
emits a particular symbol when it takes a particular action,
and vice versa (the formula is given in the Supplementary Material Section 4.4). We expect this measure to be high if, for
example, the speaker always emits the same symbol when it is
cleaning the river. We also introduce two measures of instan-
taneous coordination (IC), which are both measures of mutual information (MI): (1) symbol/action IC = I(mkt ;ajt+1 )
measures the MI between the influencer/speaker’s symbol
and the influencee/listener’s next action, and (2) action/action
IC = I(akt ;ajt+1 ) measures the MI between the influencer’s
action and the influencee’s next action. To compute these
measures we first average over all trajectory steps, then take
the maximum value between any two agents, to determine
if any pair of agents are coordinating. Note that these measures are all instantaneous, as they consider only short-term
dependencies across two consecutive timesteps, and cannot
capture if an agent communicates influential compositional
messages, i.e. information that requires several consecutive
symbols to transmit and only then affects the other agents
behavior.
Figure 5: Metrics describing the quality of learned communication protocols. The models trained with influence
reward exhibit more consistent communication and more
coordination, especially in moments where influence is high.
Figure 5 presents the results. The speaker consistencies
metric reveals that influence agents more unambiguously
communicate about their own actions than baseline agents,
indicating that the emergent communication is more meaningful. The IC metrics demonstrate that baseline agents
show almost no signs of co-ordinating behavior with communication, i.e. speakers saying A and listeners doing B
consistently. This result is aligned with both theoretical results in cheap-talk literature (Crawford & Sobel, 1982), and
recent empirical results in MARL (e.g. Foerster et al. (2016);
Lazaridou et al. (2018); Cao et al. (2018)).
In contrast, we do see high IC between influence agents,
but only when we limit the analysis to timesteps on which
influence was greater than or equal to the mean influence
(cf. influential moments in Figure 5). Inspecting the results
reveals a common pattern: influence is sparse in time. An
agent’s influence is only greater than its mean influence in
less than 10% of timesteps. Because the listener agent is
not compelled to listen to any given speaker, listeners selectively listen to a speaker only when it is beneficial, and
influence cannot occur all the time. Only when the listener
decides to change its action based on the speaker’s message
does influence occur, and in these moments we observe high
I(mkt ;ajt+1 ). It appears the influencers have learned a strategy of communicating meaningful information about their
own actions, and gaining influence when this becomes relevant enough for the listener to act on it.
Social Influence as Intrinsic Motivation for Multi-Agent Deep RL
Examining the relationship between the degree to which
agents were influenced by communication and the reward
they obtained gives a compelling result: agents that are the
most influenced also achieve higher individual environmental
reward. We sampled 100 different experimental conditions
(i.e., hyper-parameters and random seeds) for both games,
and normalized and correlated the influence and individual
rewards. We found that agents who are more often influenced
tend to achieve higher task reward in both Cleanup, ρ = .67,
p
Purchase answer to see full
attachment