Luban: Building Open-Ended Creative Agents via Autonomous Embodied Verification

Yuxuan Guo, Shaohui Peng, Jiaming Guo, Di Huang, Xishan Zhang, Rui Zhang, Yifan Hao, Ling Li, Zikang Tian, Mingju Gao, Yutai Li, Yiming Gan, Shuai Liang, Zihao Zhang, Zidong Du, Qi Guo, Xing Hu*, Yunji Chen*

University of Science and Technology of China & State Key Lab of Processors, ICT, CAS

*Correspong Author
motivation

Figure 1: (a) Agents for Well-defined long-horizontal tasks v.s. (b) Luban agent for creative tasks.

Abstract

Building open agents has always been the ultimate goal in AI research, and creative agents are the more enticing. Existing LLM agents excel at long-horizon tasks with well-defined goals (e.g., `mine diamonds' in Minecraft). However, they encounter difficulties on creative tasks with open goals and abstract criteria due to the inability to bridge the gap between them, thus lacking feedback for self-improvement in solving the task. In this work, we introduce autonomous embodied verification techniques for agents to fill the gap, laying the groundwork for creative tasks. Specifically, we propose the Luban agent target creative building tasks in Minecraft, which equips with two-level autonomous embodied verification inspired by human design practices: (1) visual verification of 3D structural speculates, which comes from agent synthesized CAD modeling programs; (2) pragmatic verification of the creation by generating and verifying environment-relevant functionality programs based on the abstract criteria. Extensive multi-dimensional human studies and Elo ratings show that the Luban completes diverse creative building tasks in our proposed benchmark and outperforms other baselines (33% to 100%) in both visualization and pragmatism. Additional demos on the real-world robotic arm show the creation potential of the Luban in the physical world.

Method

method

Figure 2: The diagram of Luban agent. (a) The 3D structural speculation stage uses VLM to synthesize Instructions I into a CAD program representing the building 3D objects, which further includes decomposing, subcomponents generation, and assembling. The visual verification evaluates the quality of buildings through the appearance results of the CAD program construction. (b) The construction stage uses VLM to synthesize the building's 3D object program into executable construction actions to get the building in the environment. The pragmatic verification evaluates the building 3D object's pragmatism by generating environment-relevant functionality annotations and action verify programs.

Minecraft Showcases

Embodied Robotic Arm Showcases

BibTeX

@misc{2405.15414,
Author = {Yuxuan Guo and Shaohui Peng and Jiaming Guo and Di Huang and Xishan Zhang and Rui Zhang and Yifan Hao and Ling Li and Zikang Tian and Mingju Gao and Yutai Li and Yiming Gan and Shuai Liang and Zihao Zhang and Zidong Du and Qi Guo and Xing Hu and Yunji Chen},
Title = {Luban: Building Open-Ended Creative Agents via Autonomous Embodied Verification},
Year = {2024},
Eprint = {arXiv:2405.15414},
}