The Allen Institute for AI (Ai2) has unveiled MolmoAct, a model belonging to a new class called Action Reasoning Models (ARM), and designed to bridge the gap between language understanding and spatial reasoning for robotics applications. Unlike traditional vision-language-action (VLA) models that rely on text-based reasoning, MolmoAct can think and plan in three-dimensional space.

Built on Ai2's open-source Molmo vision-language model family, MolmoAct addresses a fundamental limitation in current AI systems: the inability of primarily text-based models to gain sufficient understanding of the spatial concepts needed for actions in the physical world. The model operates through three key stages: understanding the physical world using depth-aware perception tokens, planning waypoints in image space, and converting these plans into detailed robotic commands.

MolmoAct-7B demonstrates impressive efficiency and performance metrics. Pre-trained on just 26.3 million samples using 256 NVIDIA H100 GPUs in about a day and fine-tuned for about two hours using 64 H100 chips, MolmoAct-7B significantly outpaces competitors that required 600-900 million samples and much more resources. On the SimplerEnv benchmark, MolmoAct achieved a state-of-the-art 72.1% success rate on out-of-distribution tasks, beating models from Physical Intelligence, Google, Microsoft, and NVIDIA.

The model offers unique interpretability features, overlaying planned motion trajectories directly onto input images and allowing users to guide actions by sketching paths on devices. True to Ai2's mission, MolmoAct is completely open-source, including training code, evaluation scripts, and the curated post-training dataset containing ~10,000 robot episodes.