LU Social Influence as Intrinsic Motivation for MARL Article Summary

User Generated

x0120

Writing

Liberty University

Description

You will read the articles attached below and then summarize the articles in approximately 350 words or more.

Make sure that you include:

  • Title and authors of the article
  • Why the article was written (introduction), and what it attempts to find or answer (hypothesis section)
  • How it answers the question or questions it proposes (method section)
  • What the article found (results)
  • What the results actually mean (discussion)

The journal article is attached as well as an example of the journal summary.

Unformatted Attachment Preview

Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning Natasha Jaques 1 2 Angeliki Lazaridou 2 Edward Hughes 2 Caglar Gulcehre 2 Pedro A. Ortega 2 DJ Strouse 3 Joel Z. Leibo 2 Nando de Freitas 2 Abstract We propose a unified mechanism for achieving coordination and communication in Multi-Agent Reinforcement Learning (MARL), through rewarding agents for having causal influence over other agents’ actions. Causal influence is assessed using counterfactual reasoning. At each timestep, an agent simulates alternate actions that it could have taken, and computes their effect on the behavior of other agents. Actions that lead to bigger changes in other agents’ behavior are considered influential and are rewarded. We show that this is equivalent to rewarding agents for having high mutual information between their actions. Empirical results demonstrate that influence leads to enhanced coordination and communication in challenging social dilemma environments, dramatically increasing the learning curves of the deep RL agents, and leading to more meaningful learned communication protocols. The influence rewards for all agents can be computed in a decentralized way by enabling agents to learn a model of other agents using deep neural networks. In contrast, key previous works on emergent communication in the MARL setting were unable to learn diverse policies in a decentralized manner and had to resort to centralized training. Consequently, the influence reward opens up a window of new opportunities for research in this area. 1. Introduction Intrinsic Motivation for Reinforcement Learning (RL) refers to reward functions that allow agents to learn useful behavior 1 Media Lab, Massachusetts Institute of Technology, Cambridge, USA 2 Google DeepMind, London, UK 3 Institute for Advanced Study, Princeton University, Princeton, USA. Correspondence to: Natasha Jaques , Angeliki Lazaridou . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). across a variety of tasks and environments, sometimes in the absence of environmental reward (Singh et al., 2004). Previous approaches to intrinsic motivation often focus on curiosity (e.g. Pathak et al. (2017); Schmidhuber (2010)), or empowerment (e.g. Klyubin et al. (2005); Mohamed & Rezende (2015)). Here, we consider the problem of deriving intrinsic social motivation from other agents in multi-agent RL (MARL). Social learning is incredibly important for humans, and has been linked to our ability to achieve unprecedented progress and coordination on a massive scale (Henrich, 2015; Harari, 2014; Laland, 2017; van Schaik & Burkart, 2011; Herrmann et al., 2007). While some previous work has investigated intrinsic social motivation for RL (e.g. Sequeira et al. (2011); Hughes et al. (2018); Peysakhovich & Lerer (2018)), these approaches rely on hand-crafted rewards specific to the environment, or allowing agents to view the rewards obtained by other agents. Such assumptions make it impossible to achieve independent training of MARL agents across multiple environments. Achieving coordination among agents in MARL still remains a difficult problem. Prior work in this domain (e.g., Foerster et al. (2017; 2016)), often resorts to centralized training to ensure that agents learn to coordinate. While communication among agents could help with coordination, training emergent communication protocols also remains a challenging problem; recent empirical results underscore the difficulty of learning meaningful emergent communication protocols, even when relying on centralized training (e.g., Lazaridou et al. (2018); Cao et al. (2018); Foerster et al. (2016)). We propose a unified method for achieving both coordination and communication in MARL by giving agents an intrinsic reward for having a causal influence on other agents’ actions. Causal influence is assessed using counterfactual reasoning; at each timestep, an agent simulates alternate, counterfactual actions that it could have taken, and assesses their effect on another agent’s behavior. Actions that lead to relatively higher change in the other agent’s behavior are considered to be highly influential and are rewarded. We show how this reward is related to maximizing the mutual information between agents’ actions, and hypothesize that this inductive bias will drive agents to learn coordinated behavior. Maximiz- Social Influence as Intrinsic Motivation for Multi-Agent Deep RL ing mutual information as a form of intrinsic motivation has been studied in the literature on empowerment (e.g. Klyubin et al. (2005); Mohamed & Rezende (2015)). Social influence can be seen as a novel, social form of empowerment. To study our influence reward, we adopt the Sequential Social Dilemma (SSD) multi-agent environments of Leibo et al. (2017). Through a series of three experiments, we show that the proposed social influence reward allows agents to learn to coordinate and communicate more effectively in these SSDs. We train recurrent neural network policies directly from pixels, and show in the first experiment that deep RL agents trained with the proposed social influence reward learn effectively and attain higher collective reward than powerful baseline deep RL agents, which often completely fail to learn. In the second experiment, the influence reward is used to directly train agents to use an explicit communication channel. We demonstrate that the communication protocols trained with the influence reward are more meaningful and effective for obtaining better collective outcomes. Further, we find a significant correlation between being influenced through communication messages and obtaining higher individual reward, suggesting that influential communication is beneficial to the agents that receive it. By examining the learning curves in this second experiment, we again find that the influence reward is essential to allow agents to learn to coordinate. Finally, we show that influence agents can be trained independently, when each agent is equipped with an internal neural network Model of Other Agents (MOA), which has been trained to predict the actions of every other agent. The agent can then simulate counterfactual actions and use its own internal MOA to predict how these will affect other agents, thereby computing its own intrinsic influence reward. Influence agents can thus learn socially, only through observing other agents’ actions, and without requiring a centralized controller or access to another agent’s reward function. Therefore, the influence reward offers us a simple, general and effective way of overcoming long-standing unrealistic assumptions and limitations in this field of research, including centralized training and the sharing of reward functions or policy parameters. Moreover, both the influence rewards as well as the agents’ policies can be learned directly from pixels using expressive deep recurrent neural networks. In this third experiment, the learning curves once again show that the influence reward is essential for learning to coordinate in these complex domains. The paper is structured as follows. We describe the environments in Section 2, and the MARL setting in Section 3. Section 4 introduces the basic formulation of the influence reward, Section 5 extends it with the inclusion of explicit communication protocols, and Section 6 advances it by including models of other agents to achieve independent training. Each of these three sections presents experiments and results that empirically demonstrate the efficacy of the social influence reward. Related work is presented in Section 7. Finally, more details about the causal inference procedure are given in Section 8. 2. Sequential Social Dilemmas Sequential Social Dilemmas (SSDs) (Leibo et al., 2017) are partially observable, spatially and temporally extended multi-agent games with a game-theoretic payoff structure. An individual agent can obtain higher reward in the short-term by engaging in defecting, non-cooperative behavior (and thus is greedily motivated to defect), but the total payoff per agent will be higher if all agents cooperate. Thus, the collective reward obtained by a group of agents in these SSDs gives a clear signal about how well the agents learned to cooperate (Hughes et al., 2018). We experiment with two SSDs, a public goods game Cleanup, and a public pool resource game Harvest. In both games apples (green tiles) provide the rewards, but are a limited resource. Agents must coordinate harvesting apples with the behavior of other agents in order to achieve cooperation (for further details see Section 2 of the Supplementary Material). Tthe code for these games is available in open-source.1 As the Schelling diagrams in Figure 2 of the Supplementary Material reveal, all agents would benefit from learning to cooperate in these games, because even agents that are being exploited get higher reward than in the regime where more agents defect. However, traditional RL agents struggle to learn to coordinate or cooperate to solve these tasks effectively (Hughes et al., 2018). Thus, these SSDs represent challenging benchmark tasks for the social influence reward. Not only must influence agents learn to coordinate their behavior to obtain high reward, they must also learn to cooperate. 3. Multi-Agent RL for SSDs We consider a MARL Markov game defined by the tuple hS,T,A,ri, in which multiple agents are trained to independently maximize their own individual reward; agents do not share weights. The environment state is given by s ∈ S. At each timestep t, each agent k chooses an action akt ∈ A. The actions of all N agents are combined to form a joint action at = [a0t ,...aN t ], which produces a transition in the environment T (st+1 |at ,st ), according to the state transition distribution T . Each agent then receives its own reward rk (at ,st ), which may depend on the actions of other agents. A history of these variables over time is termed a trajectory, T τ = {st ,at ,rt }t=0 . We consider a partially observable setting in which the kth agent can only view a portion of 1 https://github.com/eugenevinitsky/ sequential_social_dilemma_games Social Influence as Intrinsic Motivation for Multi-Agent Deep RL the true state, skt . Each agent seeks to maximize P∞ its own k total expected discounted future reward, Rk = i=0 γ i rt+i , where γ is the discount factor. A distributed asynchronous advantage actor-critic (A3C) approach (Mnih et al., 2016) is used to train each agent’s policy π k . Our neural networks consist of a convolutional layer, fully connected layers, a Long Short Term Memory (LSTM) recurrent layer (Gers et al., 1999), and linear layers. All networks take images as input and output both the policy π k and the value function V πk (s), but some network variants consume additional inputs and output either communication policies or models of other agents’ behavior. We will refer to the internal LSTM state of the kth agent at timestep t as ukt . 4. Basic Social Influence Social influence intrinsic motivation gives an agent additional reward for having a causal influence on another agent’s actions. Specifically, it modifies an agent’s immediate reward so that it becomes rtk = αekt +βckt , where ekt is the extrinsic or environmental reward, and ckt is the causal influence reward. To compute the causal influence of one agent on another, suppose there are two agents, k and j, and that agent j is able to condition its policy on agent k’s action at time t, akt . Thus, agent j computes the probability of its next action as p(ajt |akt ,sjt ). We can then intervene on akt by replacing it with a counterfactual action, ãkt . This counterfactual action is used to compute a new distribution over j’s next action, p(ajt |ãkt ,sjt ). Essentially, agent k asks a retrospective question: “How would j’s action change if I had acted differently in this situation?”. By sampling several counterfactual actions, and averaging the resulting policy distribution of j in each case, Pwe obtain the marginal policy of j, p(ajt |sjt ) = p(ajt |ãkt , sjt )p(ãkt |sjt ) —in other words, ãk t j’s policy if it did not consider agent k. The discrepancy between the marginal policy of j and the conditional policy of j given k’s action is a measure of the causal influence of k on j; it gives the degree to which j changes its planned action distribution because of k’s action. Thus, the causal influence reward for agent k is: ckt = N h X j=0,j6=k = N h X DKL [p(ajt | akt ,sjt ) i X j p(at | ãkt ,sjt )p(ãkt | sjt )] ãk t i DKL [p(ajt | akt ,sjt ) p(ajt | sjt )] . (1) j=0,j6=k Note that it is possible to use a divergence metric other than KL; we have found empirically that the influence reward is robust to the choice of metric. The reward in Eq. 1 is related to the mutual information (MI) between the actions of agents k and j, I(ak ;aj |s). As the reward is computed over many trajectories sampled independently from the environment, we obtain a Monte-Carlo estimate of I(ak ;aj |s). In expectation, the influence reward incentivizes agents to maximize the mutual information between their actions. The proof is given in Section 1 of the Supplementary Material. Intuitively, training agents to maximize the MI between their actions results in more coordinated behavior. Moreover, the variance of policy gradient updates increases as the number of agents in the environment grows (Lowe et al., 2017). This issue can hinder convergence to equilibrium for large-scale MARL tasks. Social influence can reduce the variance of policy gradients by introducing explicit dependencies across the actions of each agent. This is because the conditional variance of the gradients an agent is receiving will be less than or equal to the marginalized variance. Note that for the basic influence model we make two assumptions: 1) we use centralized training to compute ckt directly from the policy of agent j, and 2) we assume that influence is unidirectional: agents trained with the influence reward can only influence agents that are not trained with the influence reward (the sets of influencers and influencees are disjoint, and the number of influencers is in [1,N −1]). Both of these assumptions are relaxed in later sections. Further details, as well as further explanation of the causal inference procedure (including causal diagrams) are available in Section 8. 4.1. Experiment I: Basic Influence Figure 1 shows the results of testing agents trained with the basic influence reward against standard A3C agents, and an ablated version of the model in which agents do not receive the influence reward, but are able to condition their policy on the actions of other agents (even when the other agents are not within the agent’s partially observed view of the environment). We term this ablated model the visible actions baseline. In this and all other results figures, we measure the total collective reward obtained using the best hyperparameter setting tested with 5 random seeds each. Error bars show a 99.5% confidence interval (CI) over the random seeds, computed within a sliding window of 200 agent steps. We use a curriculum learning approach which gradually increases the weight of the social influence reward over C steps (C ∈ [0.2−3.5]×108 ); this sometimes leads to a slight delay before the influence models’ performance improves. As is evident in Figures 1a and 1b, introducing an awareness of other agents’ actions helps, but having the social influence reward eventually leads to significantly higher collective reward in both games. Due to the structure of the SSD games, we can infer that agents that obtain higher collective reward learned to cooperate more effectively. In the Harvest MARL setting, it is clear that the influence reward is essential to Social Influence as Intrinsic Motivation for Multi-Agent Deep RL achieve any reasonable learning. (a) Cleanup (1969). Evidently, the influence reward gave rise not only to cooperative behavior, but to emergent communication. (b) Harvest Figure 1: Total collective reward obtained in Experiment 1. Agents trained with influence (red) significantly outperform the baseline and ablated agents. In Harvest, the influence reward is essential to achieve any meaningful learning. To understand how social influence helps agents achieve cooperative behavior, we investigated the trajectories produced by high scoring models in both Cleanup and Harvest; the analysis revealed interesting behavior. As an example, in the Cleanup video available here: https://youtu.be/iH_V5WKQxmo a single agent (shown in purple) was trained with the social influence reward. Unlike the other agents, which continue to move and explore randomly while waiting for apples to spawn, the influencer only traverses the map when it is pursuing an apple, then stops. The rest of the time it stays still. Figure 2 shows a moment of high influence between the influencer and the yellow influencee. The influencer has chosen to move towards an apple that is outside of the egocentric field-of-view of the yellow agent. Because the influencer only moves when apples are available, this signals to the yellow agent that an apple must be present above it which it cannot see. This changes the Figure 2: A moment yellow agent’s distribution over of high influence when its planned action, p(ajt |akt ,sjt ), the purple influencer sigand allows the purple agent to nals the presence of gain influence. A similar moan apple (green tiles) ment occurs when the influoutside the yellow inencer signals to an agent that has fluencee’s field-of-view been cleaning the river that no (yellow outlined box). apples have appeared by staying still (see Figure 6 in the Supplementary Material). In this case study, the influencer agent learned to use its own actions as a binary code which signals the presence or absence of apples in the environment. We observe a similar effect in Harvest. This type of action-based communication could be likened to the bee waggle dance discovered by von Frisch It is important to consider the limitations of the influence reward. Whether it will always give rise to cooperative behavior may depend on the specifics of the environment and task, and tuning the trade-off between environmental and influence reward. Although influence is arguably necessary for coordination (e.g. two agents coordinating to manipulate an object must have a high degree of influence between their actions), it may be possible to influence another agent in a non-cooperative way. The results provided here show that the influence reward did lead to increased cooperation, in spite of cooperation being difficult to achieve in these environments. 5. Influential Communication Given the above results, we next experiment with using the influence reward to train agents to use an explicit communication channel. At each timestep, each agent k chooses a discrete communication symbol mkt ; these symbols are concatenated into a combined message vector mt = [m0t ,m1t ...mN t ], for N agents. This message vector mt is then given as input to every other agent in the next timestep. Note that previous work has shown that self-interested agents do not learn to use this type of ungrounded, cheap talk communication channel effectively (Crawford & Sobel, 1982; Cao et al., 2018; Foerster et al., 2016; Lazaridou et al., 2018). Figure 3: The communication model has two heads, which learn the environment policy, πe , and a policy for emitting communication symbols, πm . Other agents’ communication messages mt−1 are input to the LSTM. To train the agents to communicate, we augment our initial network with an additional A3C output head, that learns a communication policy πm and value function Vm to determine which symbol to emit (see Figure 3). The normal policy and value function used for acting in the environment, πe and Ve , are trained only with environmental reward e. We use the influence reward as an additional incentive for training the communication policy, πm , such that r = αe+βc. Counterfactuals are employed to assess how much influence an agent’s communication message from the previous timestep, mkt−1 , has on another agent’s action, ajt , where: ckt = N h X j=0,j6=k i DKL [p(ajt | mkt−1 ,sjt ) p(ajt | sjt )] (2) Social Influence as Intrinsic Motivation for Multi-Agent Deep RL Importantly, rewarding influence through a communication channel does not suffer from the limitation mentioned in the previous section, i.e. that it may be possible to influence another agent in a non-cooperative way. We can see this for two reasons. First, there is nothing that compels agent j to act based on agent k’s communication message; if mkt does not contain valuable information, j is free to ignore it. Second, because j’s action policy πe is trained only with environmental reward, j will only change its intended action as a result of observing mkt (i.e. be influenced by mkt ) if it contains information that helps j to obtain environmental reward. Therefore, we hypothesize that influential communication must provide useful information to the listener. 5.1. Experiment II: Influential Communication Figure 4 shows the collective reward obtained when training the agents to use an explicit communication channel. Here, the ablated model has the same structure as in Figure 3, but the communication policy πm is trained only with environmental reward. We observe that the agents incentivized to communicate via the social influence reward learn faster, and achieve significantly higher collective reward for the majority of training in both games. In fact, in the case of Cleanup, we found that α = 0 in the optimal hyperparameter setting, meaning that it was most effective to train the communication head with zero extrinsic reward (see Table 2 in the Supplementary Material). This suggests that influence alone can be a sufficient mechanism for training an effective communication policy. In Harvest, once again influence is critical to allow agents to learn coordinated policies and attain high reward. (a) Cleanup (b) Harvest Figure 4: Total collective reward for deep RL agents with communication channels. Once again, the influence reward is essential to improve or achieve any learning. To analyze the communication behaviour learned by the agents, we introduce three metrics, partially inspired by (Bogin et al., 2018). Speaker consistency, is a normalized score ∈ [0,1] which assesses the entropy of p(ak |mk ) and p(mk |ak ) to determine how consistently a speaker agent emits a particular symbol when it takes a particular action, and vice versa (the formula is given in the Supplementary Material Section 4.4). We expect this measure to be high if, for example, the speaker always emits the same symbol when it is cleaning the river. We also introduce two measures of instan- taneous coordination (IC), which are both measures of mutual information (MI): (1) symbol/action IC = I(mkt ;ajt+1 ) measures the MI between the influencer/speaker’s symbol and the influencee/listener’s next action, and (2) action/action IC = I(akt ;ajt+1 ) measures the MI between the influencer’s action and the influencee’s next action. To compute these measures we first average over all trajectory steps, then take the maximum value between any two agents, to determine if any pair of agents are coordinating. Note that these measures are all instantaneous, as they consider only short-term dependencies across two consecutive timesteps, and cannot capture if an agent communicates influential compositional messages, i.e. information that requires several consecutive symbols to transmit and only then affects the other agents behavior. Figure 5: Metrics describing the quality of learned communication protocols. The models trained with influence reward exhibit more consistent communication and more coordination, especially in moments where influence is high. Figure 5 presents the results. The speaker consistencies metric reveals that influence agents more unambiguously communicate about their own actions than baseline agents, indicating that the emergent communication is more meaningful. The IC metrics demonstrate that baseline agents show almost no signs of co-ordinating behavior with communication, i.e. speakers saying A and listeners doing B consistently. This result is aligned with both theoretical results in cheap-talk literature (Crawford & Sobel, 1982), and recent empirical results in MARL (e.g. Foerster et al. (2016); Lazaridou et al. (2018); Cao et al. (2018)). In contrast, we do see high IC between influence agents, but only when we limit the analysis to timesteps on which influence was greater than or equal to the mean influence (cf. influential moments in Figure 5). Inspecting the results reveals a common pattern: influence is sparse in time. An agent’s influence is only greater than its mean influence in less than 10% of timesteps. Because the listener agent is not compelled to listen to any given speaker, listeners selectively listen to a speaker only when it is beneficial, and influence cannot occur all the time. Only when the listener decides to change its action based on the speaker’s message does influence occur, and in these moments we observe high I(mkt ;ajt+1 ). It appears the influencers have learned a strategy of communicating meaningful information about their own actions, and gaining influence when this becomes relevant enough for the listener to act on it. Social Influence as Intrinsic Motivation for Multi-Agent Deep RL Examining the relationship between the degree to which agents were influenced by communication and the reward they obtained gives a compelling result: agents that are the most influenced also achieve higher individual environmental reward. We sampled 100 different experimental conditions (i.e., hyper-parameters and random seeds) for both games, and normalized and correlated the influence and individual rewards. We found that agents who are more often influenced tend to achieve higher task reward in both Cleanup, ρ = .67, p
Purchase answer to see full attachment
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

Explanation & Answer

Assignment Complete.

Running Head: JOURNAL SUMMARY

1

Journal Summary

Name:
Institutional Affiliation:
Date:

JOURNAL SUMMARY

2

The journal article is named Social Influence as Intrinsic Motivation for Multi-Agent
Deep Reinforcement Learning, written by Natasha Jaques, Angeliki Lazaridou, Edward Hughes,
Caglar Gulcehre, Pedro A. Ortega, DJ Strouse, Joel Z. Leibo and Nando de Freitas. As the title
states, the article was written to propose an integrated technique that will help achieve optimum
machine coordination and machine communication through a learning method known as the
Multi-Agent Reinforcement Learning (M...


Anonymous
Super useful! Studypool never disappoints.

Studypool
4.7
Trustpilot
4.5
Sitejabber
4.4

Similar Content

Related Tags