Subscribe to Our Newsletter

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn't arrive within 3 minutes, check your spam folder.

Ok, Thanks

AI2's Molmo is bridging the gap between closed and open-source vision-capable models

The Allen Institute for AI has introduced Molmo, a family of open-source multimodal AI models that rival or surpass proprietary systems in performance, featuring novel data collection methods and unique pointing capabilities, with plans for full release of weights, datasets, and code.

Ellie Ramirez-Camara profile image
by Ellie Ramirez-Camara
AI2's Molmo is bridging the gap between closed and open-source vision-capable models
Credit: Allen Institute

The Allen Institute for AI has released Molmo, a family of open-source multimodal models that perform competitively with many closed-source vision-enabled models while being only a fraction of their size. Available in 72B, 7B, and 1B parameters, Molmo models have shown to rival, and in some cases surpass, proprietary models like GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet. Additionally, Molmo models can learn to point at what they see, potentially unlocking a new generation of use cases for vision-enabled AI systems.

Unlike other open-source vision-enabled models, Molmo's impressive performance is not trained using synthetic data outputted by proprietary models. Rather, two essential aspects are key to Molmo's performance. The first is that Molmo's entire pipeline, comprising its weights, code, data, and evaluations, is open and free of VLM distillation. The other is PixMo, a high-quality data set collected using an innovative method to overcome some limitations associated with human-annotated image captioning.

Rather than having human annotators write freeform captions, which lead to captions that are not detailed enough, or enforcing a word count minimum, which leads to additional delays or responses extracted from proprietary VLMs, PixMo's collection method consists of asking human annotators to create recordings in which they describe images for 60 to 90 seconds. The descriptions are then transcribed and refined using language models. The strategy unlocks faster and more affordable data collection with the added benefit of having the recording function as a sort of receipt.

Molmo's initial release includes a demo, inference code, a technical report on arXiv, and weights for four models: MolmoE-1B, Molmo-7B-O, Molmo-7B-D, and Molmo-72B, our best model. Soon, a more detailed technical report, the PixMo datasets, additional model weights and checkpoints, and training and evaluation code will also be available.

Ellie Ramirez-Camara profile image
by Ellie Ramirez-Camara
Updated

Data Phoenix Digest

Subscribe to the weekly digest with a summary of the top research papers, articles, news, and our community events, to keep track of trends and grow in the Data & AI world!

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn’t arrive within 3 minutes, check your spam folder.

Ok, Thanks

Read More