Pixtral 12B is Mistral AI's first open-source multimodal model

Mistral AI has released its first-ever multimodal model, Pixtral 12B. Based on the recently launched NeMo 12B, Pixtral 12B delivers competitive performance in various tasks involving image inputs in a compact 24 GB presentation. Pixtral 12B can understand images and text, without limit to the amount and size of the pictures that can be provided as input via a URL or encoded using base64. The model can also process documents including images, features a 128k context window, and is released under an Apache 2.0 license.

Pixtral 12B was benchmarked against some of the best open and closed-source multimodal models in its size class, including Qwen2 7B, LLaVA-OV 7B, Phi-3 Vision, Claude 3 Haiku, and Gemini 1.5 8B. Pixtral's solid performance is expected to provide an open-source alternative for use cases dominated by models such as Claude 3.5 Sonnet and GPT-4o, such as image captioning, image-to-text transcription (OCR), data processing and extraction, and image analysis. Pixtral 12B is currently available through a torrent link at GitHub and Hugging Face, and it has been confirmed that it will be accessible through Mistral's Le Plateforme and Le Chat soon.

Subscribe

Pixtral 12B is Mistral AI's first open-source multimodal model

Comments

Read Next

Databricks raises $4B Series L, reports $4.8B in revenue and YoY growth of over 55%

Kling VIDEO 2.6 introduces simultaneous video and audio generation

Runway launches GWM-1 models for worlds simulation, robotics, and avatar creation