InforMARL: Scalable Multi-Agent Reinforcement Learning through Intelligent Information Aggregation. ICML 2023
- Siddharth Nayak MIT
- Kenneth Choi MIT
- Wenqi Ding MIT
- Sydney Dolan MIT
- Karthik Gopalakrishnan Stanford University
- Hamsa Balakrishnan MIT
Abstract
We consider the problem of multi-agent navigation and
collision avoidance when observations are limited to the
local neighborhood of each agent. We propose InforMARL,
a novel architecture for multi-agent reinforcement learning (MARL)
which uses local information intelligently to compute paths
for all the agents in a decentralized manner. Specifically,
InforMARL aggregates information about the local neighborhood
of agents for both the actor and the critic using a graph
neural network and can be used in conjunction with any
standard MARL algorithm. We show that (1) in training,
InforMARL has better sample efficiency and performance
than baseline approaches, despite using less information,
and (2) in testing, it scales well to environments
with arbitrary numbers of agents and obstacles.
We illustrate these results using four task environments,
including one with predetermined goals
for each agent, and one in which the agents collectively
try to cover all goals.
Motivation
MARL-based techniques have achieved significant successes
in recent times, eg., DeepMind's AlphaStar surpassing
professional level players in StarCraft II,
OpenAI Five defeating the world-champion in Dota II, etc.
The performance of many of these MARL algorithms depends
on the amount of information included in the state given
as input to the neural networks [1].
In many practical multi-agent scenarios, each agent aims
to share as little information as possible to accomplish
the task at hand. This structure naturally arises in many
multi-agent navigation settings, where agents may have a
desired end goal but do not want to share their information
due to communication constraints or proprietary concerns
[2],
[3].
These scenarios result in a decentralized structure,
as agents only have locally available information about
the overall system's state. In this paper, we focus on
the question: "Can we train scalable multi-agent
reinforcement learning policies that use limited local
information about the environment to perform collision-free
navigation effectively?"
A Motivating Experiment
The amount of information available to each agent determines whether or not the agent can learn a meaningful policy. Although having more information is generally correlated to better performance, it does not necessarily scale well with the number of agents. Prior works like MAPPO [1] , MADDPG [4] have typically used a naïve concatenation of the states of all agents or entities in the environment fed into a neural network. Such an approach scales poorly (the network input size is determined by the number of agents) and does not transfer well to scenarios with a different number of agents than the training environment. We illustrate the dependence of the learned policies on the amount of information available to agents by defining three information modes:- Local: In the local information mode, where and are the position and velocity of agent in a global frame, and is the position of the goal relative to the agent's position.
- Global: Here, where comprises of the relative positions of all the other entities in the environment. The scenarios defined in the MAPE (and consequently, other approaches that use MAPE) use this type of information mode unless explicitly stated otherwise.
- Neighbourhood: In this information mode, agent observes where comprises of the relative positions of all other entities which are within a distance nbd-dist of the agent. The maximum number of entities within the neighborhood is denoted max-nbd-entities, and so the dimension of the observation vector is fixed. If there are fewer than max-nbd-entities within a distance nbd-dist of the agent, we pad this vector with zeros.
Method
This is where our MARL framework for navigation - InforMARL comes into the picture. InforMARL, consists
of four modules, as shown in Figure 1. We describe each in detail below:
(i) Environment: The agents are depicted by green circles, the goals are depicted by red rectangles, and the unknown obstacles are depicted by gray circles. represents the aggregated information from the neighborhood, which is the output of a graph neural network. A graph is created by connecting entities within the sensing-radius of the agents.
(ii) Information Aggregation: Each agent's observation is concatenated with . The inter-agent edges are bidirectional, while the edges between agents and non-agent entities are unidirectional.
(iii) Graph Information Aggregation: The aggregated vector from all the agents is averaged to get .
(iv) Actor-Critic: The concatenated vector is fed into the actor network to get the action, and is fed into the critic network to get the state-action values.
Experiments and Results
Environments
We evaluate our method on the following environments:- Target: Each agent tries to reach its preassigned goal while avoiding collisions with other entities in the environment.
- Coverage: Each agent tries to go to a goal while avoidin collsions with other entities, and ensuring that no more than one agent reaches the same goal.
- Formation: There is a single landmark (the counterpart of a goal for this task) and the agents try to position themselves in an N-sided regular polygon with the landmark at its centre.
- Line: There are two landmarks, and the agents try to position themselves equally spread out in a line between the two.
Target
Coverage
Formation
Line Formation
Figure 4: The agents are shown in blue circles, the goals are shown in green and obstacles are shown in black in the Target and Coverage environment. The landmarks are shown in black in the Formation and Line environments.
Comparisons against other methods
We compare our method with a few different MARL baselines in the Target environment.Figure 5: Comparison of the training performance of InforMARL with the best-performing baselines using global and local information. InforMARL significantly outperforms most baseline algorithms. Although RMAPPO has similar performance, it requires global information. Refer to Appendix in the paper for a complete comparison to more baselines.
The following metrics are compared:
- Total reward obtained in an episode by all the agents (higher is better).
- Fraction of episode taken by the agents to reach the goal, (lower is better).
- The total number of collisions the agents had in the episode, # col (lower is better).
- Percent of episodes in which all agents are able to get to their goals, (higher is better).
Scalability of InforMARL
Performance in Other Task Environments
Effect of sensing radius
Ablation study for Graph Information Aggregation Module
Conclusions
- We showed that having just local observations as states is not enough for standard MARL algorithms to learn meaningful policies.
- Along with this, we also showed that albeit naïvely concatenating state information about all the entities in the environment helps to learn good policies, they are not transferable to other scenarios with a different number of entities than what it was trained on.
- InforMARL is able to learn transferable policies using standard MARL algorithms using just local observations and an aggregated neighborhood information vector. Furthermore, it has better sample complexity than other standard MARL algorithms that use global observation.
Future Work
- Include the introduction of more complex (potentially adversarial) dynamic obstacles in the environment.
- Adding a safety guarantee layer for the actions of the agents to avoid collisions at all costs.
- Investigate the use of InforMARL for curriculum learning and transfer learning to more complex environments.
Citation
If you find our work or code useful in your research, please consider citing the following:
Related Links
-
Application
of InforMARL for space traffic management with
minimum information sharing.
Acknowledgements
The authors would like to thank the
MIT SuperCloud
and the Lincoln Laboratory Supercomputing Center for providing
high performance computing resources that have contributed to
the research results reported within this paper.
The NASA University Leadership initiative (grant #80NSSC20M0163)
provided funds to assist the authors with their research,
but this article solely reflects the opinions and conclusions
of its authors and not any NASA entity. This research was
sponsored in part by the United States AFRL and the United
States Air Force Artificial Intelligence Accelerator and was
accomplished under Cooperative Agreement Number FA8750-19-2-1000.
The views and conclusions contained in this document are
those of the authors and should not be interpreted as
representing the official policies, either expressed or implied,
of the United States Air Force or the U.S. Government.
The U.S. Government is authorized to reproduce and distribute
reprints for Government purposes notwithstanding any copyright
notion herein. Sydney Dolan was supported by the National
Science Foundation Graduate Research Fellowship under Grant No. 1650114.
This website template was borrowed from Michaël Gharbi and
Matthew Tannick.