LLaMAR: Long-Horizon Planning for Multi-Agent Robots in Partially Observable Environments NeurIPS 2024
- Siddharth Nayak MIT
- Adelmo Morrison Orozco MIT
- Marina Ten Have MIT
- Vittal Thirumalai MIT
- Jackson Zhang MIT
- Darren Chen MIT
- Aditya Kapoor TCS RnI
- Eric Robinson USAF-MIT AI Accelerator
- Karthik Gopalakrishnan Stanford University
- James Harrison Google DeepMind
- Brian Ichter Google DeepMind
- Anuj Mahajan Apple
- Hamsa Balakrishnan MIT
Abstract
The ability of Language Models (LMs) to understand natural
language makes them a powerful tool for parsing human
instructions into task plans for autonomous robots.
Unlike traditional planning methods that rely on domain-specific
knowledge and handcrafted rules, LMs generalize from
diverse data and adapt to various tasks with minimal
tuning, acting as a compressed knowledge base. However,
LMs in their standard form face challenges with
long-horizon tasks, particularly in partially observable
multi-agent settings. We propose an LM-based Long-Horizon
Planner for Multi-Agent Robotics (LLaMAR), a cognitive
architecture for planning that achieves state-of-the-art
results in long-horizon tasks within partially observable
environments. LLaMAR employs a plan-act-correct-verify
framework, allowing self-correction from action execution
feedback without relying on oracles or simulators.
Additionally, we present MAP-THOR, a comprehensive test
suite encompassing household tasks of varying complexity
within the AI2-THOR environment. Experiments show that
LLaMAR achieves a 30% higher success rate compared to
other state-of-the-art LM-based multi-agent planners.
Demo
LLaMAR leverages LMs to generalize from diverse data,
acting as a compressed knowledge base for planning without
relying on domain-specific knowledge or handcrafted rules.
This approach grounds LLMs to the environment by
incorporating real-time feedback and observations.
LLaMAR: Approach
This is where our MARL framework for navigation - InforMARL comes into the picture. InforMARL, consists
of four modules, as shown in Figure 1. We describe each in detail below:
Grounding LLMs to the environment involves ensuring that the language models' generated plans and actions correspond accurately to the environment and its dynamics. In LLaMAR, we achieve this through action execution feedback that allows to dynamically adjust strategies, and observation and memory modules allowing agents to build a comprehensive understanding of the environment over time.
How does it work? LLaMAR enables agents to:
- Plan subtasks for task completion by creating two lists: open subtasks and closed subtasks
- Select high-level actions for agents to complete subtasks
- Identify and correct failures post-action execution using visual feedback
- Verify subtask completion based on action execution and modify the open and closed subtasks list
LLaMAR’s cognitive architecture includes four specialized modules that collectively enhance its planning capabilities:
- Planner Module: This module breaks down high-level tasks into feasible subtasks based on current observations and memory.
- Actor Module: The Actor module selects high-level actions for each agent to execute the planned subtasks. It takes into account the corrective actions suggested by the Corrector module and updates the shared memory with the outcomes of these actions.
- Corrector Module: It provides corrective actions and plausible reasons for action failures for the subsequent timestep to prevent the Actor Module from choosing the same action.
- Verifier Module: It checks the completion of subtasks based on the outcomes of executed actions based on current observations, actions, and memory and distinguishes itself from other self-verification methods that rely on their internal environment model.
MAP-THOR
To evaluate the performance of LLaMAR and benchmark other
baseline methods, we create a benchmark dataset of tasks
which we call MAP-THOR (Multi-Agent Planning tasks in AI2-THOR).
While Smart-LLM (Kannan et al., 2023) introduces a dataset of
36 tasks within AI2-Thor (Kolve et al., 2017) classified by
complexity, their tasks are limited to single floor plans.
This limitation hinders testing the robustness of planners
across different room layouts. Additionally, some tasks in
their dataset cannot be performed by multiple agents,
regardless of task division, such as Pick up the pillow,
Open the laptop to turn it on, and Turn off the lamp.
By contrast, MAP-THOR includes tasks solvable by both single
and multiple agents. We classify the tasks into four
categories based on the ambiguity of the language instructions.
To test the planner robustness, we provide five different
floor plans for each task. We also include automatic checker
modules to verify subtask completion and evaluate plan quality.
Our dataset comprises 45 tasks, each defined for five distinct
floor plans, ensuring comprehensive testing and evaluation.
Task Classification
We conduct experiments with tasks of varying difficulty levels, where an increase in difficulty of the tasks corresponds to an increased ambiguity in the language instructions.-
To summarize, the tasks in MAP-THOR are classified into 4 categories:
- Explicit item type, quantity, and target location: Agents are explicitly instructed to transport specific items to specific target locations. For example, put bread, lettuce, and a tomato in the fridge clearly defines the objects (tomato, lettuce, bread) and the target (fridge).
- Explicit item type and target location but implicit item quantity: The object type is explicitly described, but its quantity is not disclosed. For example, Put all the apples in the fridge. Agents must explore the environment to locate all specified items and also predict when to stop.
- Explicit target location but implicit item types and quantity: The target location is explicitly defined but the item types and their quantities are concealed. For example, Put all groceries in the fridge.
- Implicit target location, item type and quantity : Item types and their quantities along with the target location are implicitly defined. For example, Clear the floor by placing the items at their appropriate positions. The agent is expected to place items like pens, books, and laptops on the study table, and litter in the trash can.
Metrics
We propose to use the following metrics to compare the performance of different algorithms on the tasks:- Success Rate (SR): The fraction of episodes in which all subtasks are completed. Success equals 1 if all subtasks are successfully executed in an episode, otherwise it is 0.
- Transport Rate (TR): The fraction of subtasks completed within an episode, which provides a finer granularity of task completion.
- Coverage (C): The fraction of successful interactions with target objects. It is useful to verify if the LMs can infer the objects to interact with, in scenarios where the tasks have objects that are specified implicitly.
- Balance (B): The ratio between the minimum and maximum number of successful high-level actions executed by any agent that contributed towards making progress in a subtask necessary for the completion of the language instruction task. We only check for a subset of high-level actions that must be executed for accomplishing critical subtasks that leads to the successful completion of the language instruction task.
- Average Steps (L): The number of high-level actions taken by the team to complete the task, capped at T=30 in our experiments. If the task is not completed within T steps, the episode is deemed a failure.
Experiments
Choice of the underlying LM
Citation
If you find our work or code useful in your research, please consider citing the following:
Acknowledgements
We would like to thank Keerthana Gopalakrishnan, Sydney Dolan,
Jasmine Aloor, and Victor Qin for helpful discussions and feedback.
OpenAI credits for GPT-4 access was provided through OpenAI's
Researcher Access Program. The research was sponsored by the
United States Air Force Research Laboratory and the
Department of the Air Force Artificial Intelligence Accelerator
and was accomplished under Cooperative Agreement Number FA8750-19-2-1000.
The views and conclusions contained in this document are those
of the authors and should not be interpreted as representing
the official policies, either expressed or implied, of the
Department of the Air Force or the U.S. Government. The U.S.
Government is authorized to reproduce and distribute reprints
for Government purposes notwithstanding any copyright notation
herein.
This website template was borrowed from Michaël Gharbi and
Matthew Tannick.