Hugging Face has released Idefics2, a multimodal model for the community

Hugging Face launched Idefics 2, the latest iteration of the Apache 2.0 licensed, 8B-parameter general multimodal model with enhanced OCR capabilities that performs at the top of its size class and competes with larger models such as LLava-Next-34B and MM1-30B-chat.

Idefics 2 is an Apache 2.0 licensed, 8B-parameter general multimodal model with enhanced OCR capabilities that can take text and images as input to answer questions about images, describe visual content, generate narratives grounded on image inputs, extract relevant information from documents, and do basic math. Its performance at visual question-answering benchmarks is at the top of its size class and competes with larger-sized models such as LLava-Next-34B and MM1-30B-chat. Idefics 2 is also integrated out-of-the-box in 🤗 Transformers for simplified fine-tuning.

Idefics 2 was trained on openly available datasets, including Wikipedia and OBELICS for interleaved web documents; Public Multimodal Dataset, LAION-COCO for image-caption pairs; PDFA (en), IDL and Rendered-text for OCR data, and WebSight for image-to-code. Instruction fine-tuning was achieved using The Cauldron, a multimodal instruction fine-tuning dataset released in parallel with Idefics 2. The Cauldron compiles 50 manually-curated datasets for multiturn conversations. Additional instruction fine-tuning was performed using text-based instruction fine-tuning datasets. Improvements over the previous Idefics version include image manipulation in the original resolution and aspect ratio to circumvent the need to resize images, enhanced OCR capabilities by feeding the model with data needing to be transcribed from images or documents, and a simpler visual features architecture. Additional information on the dataset, training, and license, plus a "getting started" guide and additional resources, can be found here.

Subscribe

Hugging Face has released Idefics2, a multimodal model for the community

Comments

Read Next

Mercor reports it fell victim to a cyberattack linked to the recently compromised LiteLLM

Legal tech darling Harvey confirms new funding round at a $11B valuation

Wikipedia bans LLM usage for article rewriting and generation

Granola, the viral AI-powered note-taking app, has become a unicorn

Interloom raises $16.5M to build an operational knowledge "memory" for enterprise AI agents