Ghostboard pixel

Subscribe to Our Newsletter

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn't arrive within 3 minutes, check your spam folder.

Ok, Thanks

Hugging Face has released Idefics2, a multimodal model for the community

Hugging Face launched Idefics 2, the latest iteration of the Apache 2.0 licensed, 8B-parameter general multimodal model with enhanced OCR capabilities that performs at the top of its size class and competes with larger models such as LLava-Next-34B and MM1-30B-chat.

Ellie Ramirez-Camara profile image
by Ellie Ramirez-Camara
Hugging Face has released Idefics2, a multimodal model for the community
Credit: Hugging Face

Idefics 2 is an Apache 2.0 licensed, 8B-parameter general multimodal model with enhanced OCR capabilities that can take text and images as input to answer questions about images, describe visual content, generate narratives grounded on image inputs, extract relevant information from documents, and do basic math. Its performance at visual question-answering benchmarks is at the top of its size class and competes with larger-sized models such as LLava-Next-34B and MM1-30B-chat. Idefics 2 is also integrated out-of-the-box in 🤗 Transformers for simplified fine-tuning.

Idefics 2 was trained on openly available datasets, including Wikipedia and OBELICS for interleaved web documents; Public Multimodal Dataset, LAION-COCO for image-caption pairs; PDFA (en), IDL and Rendered-text for OCR data, and WebSight for image-to-code. Instruction fine-tuning was achieved using The Cauldron, a multimodal instruction fine-tuning dataset released in parallel with Idefics 2. The Cauldron compiles 50 manually-curated datasets for multiturn conversations. Additional instruction fine-tuning was performed using text-based instruction fine-tuning datasets. Improvements over the previous Idefics version include image manipulation in the original resolution and aspect ratio to circumvent the need to resize images, enhanced OCR capabilities by feeding the model with data needing to be transcribed from images or documents, and a simpler visual features architecture. Additional information on the dataset, training, and license, plus a "getting started" guide and additional resources, can be found here.

Ellie Ramirez-Camara profile image
by Ellie Ramirez-Camara

Data Phoenix Digest

Subscribe to the weekly digest with a summary of the top research papers, articles, news, and our community events, to keep track of trends and grow in the Data & AI world!

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn’t arrive within 3 minutes, check your spam folder.

Ok, Thanks

Read More