Grok-1.5V is xAI's first multimodal foundation model. Grok-1.5V shares its predecessors' text capabilities and complements them with a strong visual information processing capacity enabling the model to extract data from sources including documents, diagrams, charts, screenshots, and photographs. To evaluate Grok-1.5V, the research team at xAI developed the RealWorldQA, which measures real-world spatial understanding by asking questions that involve comparing several objects in a picture, describing an object's position, or determining true size by considering perspective. The RealWorldQA benchmark was released under a CC BY-ND 4.0 license in parallel with the Grok-1.5V preview announcement. The RealWorldQA contains 700 anonymized pictures taken from various real-world sources, including vehicles, annotated with easily verifiable question-answer pairs. Grok-1.5V will be available to early testers and existing Grok users shortly, as xAI plans to continue its delve into multimodal AI as part of its journey toward AGI.
Grok just got vision: xAI announces Grok-1.5V preview
xAI recently announced the preview availability of Grok-1.5V, its first multimodal foundation model. Grok-1.5V features competitive performance in visual information processing tasks and is the first model diving into real-world spatial understanding.
Comments