AI2's Molmo is bridging the gap between closed and open-source vision-capable models

The Allen Institute for AI has released Molmo, a family of open-source multimodal models that perform competitively with many closed-source vision-enabled models while being only a fraction of their size. Available in 72B, 7B, and 1B parameters, Molmo models have shown to rival, and in some cases surpass, proprietary models like GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet. Additionally, Molmo models can learn to point at what they see, potentially unlocking a new generation of use cases for vision-enabled AI systems.

Unlike other open-source vision-enabled models, Molmo's impressive performance is not trained using synthetic data outputted by proprietary models. Rather, two essential aspects are key to Molmo's performance. The first is that Molmo's entire pipeline, comprising its weights, code, data, and evaluations, is open and free of VLM distillation. The other is PixMo, a high-quality data set collected using an innovative method to overcome some limitations associated with human-annotated image captioning.

Rather than having human annotators write freeform captions, which lead to captions that are not detailed enough, or enforcing a word count minimum, which leads to additional delays or responses extracted from proprietary VLMs, PixMo's collection method consists of asking human annotators to create recordings in which they describe images for 60 to 90 seconds. The descriptions are then transcribed and refined using language models. The strategy unlocks faster and more affordable data collection with the added benefit of having the recording function as a sort of receipt.

Molmo's initial release includes a demo, inference code, a technical report on arXiv, and weights for four models: MolmoE-1B, Molmo-7B-O, Molmo-7B-D, and Molmo-72B, our best model. Soon, a more detailed technical report, the PixMo datasets, additional model weights and checkpoints, and training and evaluation code will also be available.