InforMARL: Scalable Multi-Agent Reinforcement Learning through Intelligent Information Aggregation.
ICML 2023

Abstract

We consider the problem of multi-agent navigation and collision avoidance when observations are limited to the local neighborhood of each agent. We propose InforMARL, a novel architecture for multi-agent reinforcement learning (MARL) which uses local information intelligently to compute paths for all the agents in a decentralized manner. Specifically, InforMARL aggregates information about the local neighborhood of agents for both the actor and the critic using a graph neural network and can be used in conjunction with any standard MARL algorithm. We show that (1) in training, InforMARL has better sample efficiency and performance than baseline approaches, despite using less information, and (2) in testing, it scales well to environments with arbitrary numbers of agents and obstacles. We illustrate these results using four task environments, including one with predetermined goals for each agent, and one in which the agents collectively try to cover all goals.

Motivation

MARL-based techniques have achieved significant successes in recent times, eg., DeepMind's AlphaStar surpassing professional level players in StarCraft II, OpenAI Five defeating the world-champion in Dota II, etc. The performance of many of these MARL algorithms depends on the amount of information included in the state given as input to the neural networks [1]. In many practical multi-agent scenarios, each agent aims to share as little information as possible to accomplish the task at hand. This structure naturally arises in many multi-agent navigation settings, where agents may have a desired end goal but do not want to share their information due to communication constraints or proprietary concerns [2], [3]. These scenarios result in a decentralized structure, as agents only have locally available information about the overall system's state. In this paper, we focus on the question: "Can we train scalable multi-agent reinforcement learning policies that use limited local information about the environment to perform collision-free navigation effectively?"

A Motivating Experiment

The amount of information available to each agent determines whether or not the agent can learn a meaningful policy. Although having more information is generally correlated to better performance, it does not necessarily scale well with the number of agents. Prior works like MAPPO [1] , MADDPG [4] have typically used a naïve concatenation of the states of all agents or entities in the environment fed into a neural network. Such an approach scales poorly (the network input size is determined by the number of agents) and does not transfer well to scenarios with a different number of agents than the training environment. We illustrate the dependence of the learned policies on the amount of information available to agents by defining three information modes:
  • Local: In the local information mode, where and are the position and velocity of agent in a global frame, and is the position of the goal relative to the agent's position.
  • Global: Here, where comprises of the relative positions of all the other entities in the environment. The scenarios defined in the MAPE (and consequently, other approaches that use MAPE) use this type of information mode unless explicitly stated otherwise.
  • Neighbourhood: In this information mode, agent observes where comprises of the relative positions of all other entities which are within a distance nbd-dist of the agent. The maximum number of entities within the neighborhood is denoted max-nbd-entities, and so the dimension of the observation vector is fixed. If there are fewer than max-nbd-entities within a distance nbd-dist of the agent, we pad this vector with zeros.


The MDP for RL agent

Figure 1:
MAPPO with the local, neighborhood (with 1, 3, and 5 max-nbd-entities), and global information given as states. The plots show the rewards during training for the 3 agent-3 obstacle navigation scenario. The means and standard deviations of the rewards over training with five random seeds are shown. Comparing the global information mode to the others, we see that merely providing local information and a naïve concatenation of neighborhood information is not sufficient to learn an optimal policy.


The MDP for RL agent

Figure 2
The title of our paper in an alternate universe where academic papers can be informa(r)l

Method

This is where our MARL framework for navigation - InforMARL comes into the picture. InforMARL, consists of four modules, as shown in Figure 1. We describe each in detail below:

The MDP for RL agent

Figure 3:
Overview of our method - InforMARL

(i) Environment: The agents are depicted by green circles, the goals are depicted by red rectangles, and the unknown obstacles are depicted by gray circles. represents the aggregated information from the neighborhood, which is the output of a graph neural network. A graph is created by connecting entities within the sensing-radius of the agents.

(ii) Information Aggregation: Each agent's observation is concatenated with . The inter-agent edges are bidirectional, while the edges between agents and non-agent entities are unidirectional.

(iii) Graph Information Aggregation: The aggregated vector from all the agents is averaged to get .

(iv) Actor-Critic: The concatenated vector is fed into the actor network to get the action, and is fed into the critic network to get the state-action values.


Experiments and Results

Environments

We evaluate our method on the following environments:
  • Target: Each agent tries to reach its preassigned goal while avoiding collisions with other entities in the environment.
  • Coverage: Each agent tries to go to a goal while avoidin collsions with other entities, and ensuring that no more than one agent reaches the same goal.
  • Formation: There is a single landmark (the counterpart of a goal for this task) and the agents try to position themselves in an N-sided regular polygon with the landmark at its centre.
  • Line: There are two landmarks, and the agents try to position themselves equally spread out in a line between the two.

Target

Coverage

Formation

Line Formation


Figure 4
: The agents are shown in blue circles, the goals are shown in green and obstacles are shown in black in the Target and Coverage environment. The landmarks are shown in black in the Formation and Line environments.

Comparisons against other methods

We compare our method with a few different MARL baselines in the Target environment.
Mountains
Mountains

Figure 5
: Comparison of the training performance of InforMARL with the best-performing baselines using global and local information. InforMARL significantly outperforms most baseline algorithms. Although RMAPPO has similar performance, it requires global information. Refer to Appendix in the paper for a complete comparison to more baselines.


The following metrics are compared:
  • Total reward obtained in an episode by all the agents (higher is better).
  • Fraction of episode taken by the agents to reach the goal, (lower is better).
  • The total number of collisions the agents had in the episode, # col (lower is better).
  • Percent of episodes in which all agents are able to get to their goals, (higher is better).
The best-performing methods that use global information (RMAPPO) and local information (InforMARL) are highlighted.
Comparison Table

Table 1
: Comparison of InforMARL with other baseline methods, for scenarios with 3, 7, and 10 agents in the Target environment. The results presented represent the average of 100 test episodes.

Scalability of InforMARL

The weight extraction method

Table 2
: Test performance of InforMARL for the Target environment, when trained on scenarios with agents and tested on scenarios with agents in the environment.

Performance in Other Task Environments

other envs

Table 3
: Performance of RMAPPO and InforMARL on the coverage, formation, and line tasks. We note that Infor- MARL was trained on the 3-agent scenario and tested on m = {3, 7} agents, while RMAPPO was trained and tested on the same number of agents (i.e., with m = n).

Effect of sensing radius

The weight extraction method

Figure 6
: Diminishing returns in performance gains from increasing the sensing radius for InforMARL. The dashed lines are the reward values after saturation for RMAPPO in the global (in green) and local (in red) information modes, respectively. They are provided for reference.

Ablation study for Graph Information Aggregation Module

The weight extraction method

Figure 7
: Training performance of InforMARL with, and without, the graph information aggregation module, for a 3-agent scenario. The two variants have similar sample complexities. However, the critic network with the graph information aggregation module has fewer parameters than the one without this module.

Conclusions

  • We showed that having just local observations as states is not enough for standard MARL algorithms to learn meaningful policies.
  • Along with this, we also showed that albeit naïvely concatenating state information about all the entities in the environment helps to learn good policies, they are not transferable to other scenarios with a different number of entities than what it was trained on.
  • InforMARL is able to learn transferable policies using standard MARL algorithms using just local observations and an aggregated neighborhood information vector. Furthermore, it has better sample complexity than other standard MARL algorithms that use global observation.

Future Work

  • Include the introduction of more complex (potentially adversarial) dynamic obstacles in the environment.
  • Adding a safety guarantee layer for the actions of the agents to avoid collisions at all costs.
  • Investigate the use of InforMARL for curriculum learning and transfer learning to more complex environments.



 [Slides]

Click on the link if PDF is not rendered.

Citation


If you find our work or code useful in your research, please consider citing the following:




Related Links

  • Application of InforMARL for space traffic management with minimum information sharing.

Acknowledgements


The authors would like to thank the MIT SuperCloud and the Lincoln Laboratory Supercomputing Center for providing high performance computing resources that have contributed to the research results reported within this paper. The NASA University Leadership initiative (grant #80NSSC20M0163) provided funds to assist the authors with their research, but this article solely reflects the opinions and conclusions of its authors and not any NASA entity. This research was sponsored in part by the United States AFRL and the United States Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notion herein. Sydney Dolan was supported by the National Science Foundation Graduate Research Fellowship under Grant No. 1650114.

This website template was borrowed from Michaël Gharbi and Matthew Tannick.