LLaMAR: Long-Horizon Planning for Multi-Agent Robots in Partially Observable Environments
Preprint

Abstract

The ability of Language Models (LMs) to understand natural language makes them a powerful tool for parsing human instructions into task plans for autonomous robots. Unlike traditional planning methods that rely on domain-specific knowledge and handcrafted rules, LMs generalize from diverse data and adapt to various tasks with minimal tuning, acting as a compressed knowledge base. However, LMs in their standard form face challenges with long-horizon tasks, particularly in partially observable multi-agent settings. We propose an LM-based Long-Horizon Planner for Multi-Agent Robotics (LLaMAR), a cognitive architecture for planning that achieves state-of-the-art results in long-horizon tasks within partially observable environments. LLaMAR employs a plan-act-correct-verify framework, allowing self-correction from action execution feedback without relying on oracles or simulators. Additionally, we present MAP-THOR, a comprehensive test suite encompassing household tasks of varying complexity within the AI2-THOR environment. Experiments show that LLaMAR achieves a 30% higher success rate compared to other state-of-the-art LM-based multi-agent planners.

Demo

Description of the first GIF
Alice's POV
Description of the second GIF
Bob's POV
Alice and Bob are completing the task "Put all groceries in the fridge" in the AI2-THOR environment. The agents split up the task amongst themselves and work together to complete the task.


LLaMAR leverages LMs to generalize from diverse data, acting as a compressed knowledge base for planning without relying on domain-specific knowledge or handcrafted rules. This approach grounds LLMs to the environment by incorporating real-time feedback and observations.

LLaMAR: Approach

This is where our MARL framework for navigation - InforMARL comes into the picture. InforMARL, consists of four modules, as shown in Figure 1. We describe each in detail below:

The MDP for RL agent

Figure 3:
Overview of LLaMAR’s modular cognitive architecture. LLaMAR leverages LMs within four key modules: Planner, Actor, Corrector, and Verifier, each with specific roles. The Planner breaks down the high-level language instruction into feasible subtasks to achieve the environment goal. The Actor determines the high-level actions each agent should perform. These actions trigger low-level policies that generate and execute a sequence of primitive actions in sync across all agents. Based on execution feedback, the Corrector suggests corrections for high-level actions and the Verifier Module validates completion of subtasks.


Grounding LLMs to the environment involves ensuring that the language models' generated plans and actions correspond accurately to the environment and its dynamics. In LLaMAR, we achieve this through action execution feedback that allows to dynamically adjust strategies, and observation and memory modules allowing agents to build a comprehensive understanding of the environment over time.

How does it work? LLaMAR enables agents to:
  • Plan subtasks for task completion by creating two lists: open subtasks and closed subtasks
  • Select high-level actions for agents to complete subtasks
  • Identify and correct failures post-action execution using visual feedback
  • Verify subtask completion based on action execution and modify the open and closed subtasks list


LLaMAR’s cognitive architecture includes four specialized modules that collectively enhance its planning capabilities:
  • Planner Module: This module breaks down high-level tasks into feasible subtasks based on current observations and memory.
  • Actor Module: The Actor module selects high-level actions for each agent to execute the planned subtasks. It takes into account the corrective actions suggested by the Corrector module and updates the shared memory with the outcomes of these actions.
  • Corrector Module: It provides corrective actions and plausible reasons for action failures for the subsequent timestep to prevent the Actor Module from choosing the same action.
  • Verifier Module: It checks the completion of subtasks based on the outcomes of executed actions based on current observations, actions, and memory and distinguishes itself from other self-verification methods that rely on their internal environment model.

MAP-THOR

To evaluate the performance of LLaMAR and benchmark other baseline methods, we create a benchmark dataset of tasks which we call MAP-THOR (Multi-Agent Planning tasks in AI2-THOR). While Smart-LLM (Kannan et al., 2023) introduces a dataset of 36 tasks within AI2-Thor (Kolve et al., 2017) classified by complexity, their tasks are limited to single floor plans. This limitation hinders testing the robustness of planners across different room layouts. Additionally, some tasks in their dataset cannot be performed by multiple agents, regardless of task division, such as Pick up the pillow, Open the laptop to turn it on, and Turn off the lamp.

By contrast, MAP-THOR includes tasks solvable by both single and multiple agents. We classify the tasks into four categories based on the ambiguity of the language instructions. To test the planner robustness, we provide five different floor plans for each task. We also include automatic checker modules to verify subtask completion and evaluate plan quality. Our dataset comprises 45 tasks, each defined for five distinct floor plans, ensuring comprehensive testing and evaluation.

Task Classification

We conduct experiments with tasks of varying difficulty levels, where an increase in difficulty of the tasks corresponds to an increased ambiguity in the language instructions.
    To summarize, the tasks in MAP-THOR are classified into 4 categories:
  • Explicit item type, quantity, and target location: Agents are explicitly instructed to transport specific items to specific target locations. For example, put bread, lettuce, and a tomato in the fridge clearly defines the objects (tomato, lettuce, bread) and the target (fridge).
  • Explicit item type and target location but implicit item quantity: The object type is explicitly described, but its quantity is not disclosed. For example, Put all the apples in the fridge. Agents must explore the environment to locate all specified items and also predict when to stop.
  • Explicit target location but implicit item types and quantity: The target location is explicitly defined but the item types and their quantities are concealed. For example, Put all groceries in the fridge.
  • Implicit target location, item type and quantity : Item types and their quantities along with the target location are implicitly defined. For example, Clear the floor by placing the items at their appropriate positions. The agent is expected to place items like pens, books, and laptops on the study table, and litter in the trash can.


Metrics

We propose to use the following metrics to compare the performance of different algorithms on the tasks:
  • Success Rate (SR): The fraction of episodes in which all subtasks are completed. Success equals 1 if all subtasks are successfully executed in an episode, otherwise it is 0.
  • Transport Rate (TR): The fraction of subtasks completed within an episode, which provides a finer granularity of task completion.
  • Coverage (C): The fraction of successful interactions with target objects. It is useful to verify if the LMs can infer the objects to interact with, in scenarios where the tasks have objects that are specified implicitly.
  • Balance (B): The ratio between the minimum and maximum number of successful high-level actions executed by any agent that contributed towards making progress in a subtask necessary for the completion of the language instruction task. We only check for a subset of high-level actions that must be executed for accomplishing critical subtasks that leads to the successful completion of the language instruction task.
  • Average Steps (L): The number of high-level actions taken by the team to complete the task, capped at T=30 in our experiments. If the task is not completed within T steps, the episode is deemed a failure.
For all the metrics, we propose to report the means along with the 95% confidence interval across all the tasks. Since SR is a binomial metric, we report the Clopper-Pearson Interval as the confidence interval.

Experiments

Choice of the underlying LM

The MDP for RL agent

Table 2:
To understand the impact of the underlying LM’s quality on decision-making, we initially experimented with different LMs. Specifically, we utilize both the language-only and vision-language models of GPT-4, IDEFICS-2, LLaVA, and CoGVLM. Among these, GPT-4, when used solely with text inputs, exhibits the poorest performance. This is attributed to the agents’ inability to reason about visual observations, which is particularly detrimental for the Corrector module. Substituting GPT-4V with other vision-language models results in a decline in performance and hence we use GPT-4V as the underlying VLM while comparing to the baselines.

Acknowledgements


We would like to thank Keerthana Gopalakrishnan, Sydney Dolan, Jasmine Aloor, and Victor Qin for helpful discussions and feedback. OpenAI credits for GPT-4 access was provided through OpenAI's Researcher Access Program. The research was sponsored by the United States Air Force Research Laboratory and the Department of the Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Department of the Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.

This website template was borrowed from Michaël Gharbi and Matthew Tannick.