MIT's HiP: A Multimodal Approach to Robotic Planning

MIT's HiP: A Multimodal Approach to Robotic Planning

MIT's Improbable AI Lab, part of the Computer Science and Artificial Intelligence Laboratory (CSAIL), has introduced a novel framework named Compositional Foundation Models for Hierarchical Planning (HiP). Published on the arXiv preprint server, HiP aims to enhance robotic planning by utilizing three distinct foundation models, each trained on different data modalities.

Unlike existing multimodal models, such as RT2, HiP avoids the need for paired vision, language, and action data. Instead, it employs a trio of foundation models, each capturing a specific aspect of decision-making. According to NVIDIA AI researcher Jim Fan, HiP's approach decomposes the complex task of embodied agent planning into three constituent models, making the decision-making process more tractable and transparent.

The potential applications of HiP extend to household chores, multistep construction, and manufacturing tasks. By leveraging linguistic, physical, and environmental intelligence, the system aims to assist robots in achieving long-horizon goals.

The three components of HiP's planning process operate in a hierarchical manner. At the base is a large language model (LLM) that formulates abstract task plans using common sense knowledge from the internet. A video diffusion model processes geometric and physical information from online footage, refining the task plan through iterative refinement. The top layer involves an egocentric action model, mapping out task execution based on first-person images.

In testing, HiP outperformed comparable frameworks in manipulation tasks. Its adaptability and intelligent planning were evident in scenarios like stacking differently colored blocks, arranging objects in a box, and completing kitchen sub-goals. The system's ability to adjust plans based on new information showcased its potential for real-world applications.

The researchers envision HiP's future augmentation with pre-trained models capable of processing touch and sound, further enhancing its planning capabilities. Despite the need for high-quality video foundation models, HiP's cost-effective training and demonstrated potential suggest a promising avenue for robotic planning in various domains.