Gemini: Google's largest and most proficient model yet is also natively multimodal
As multimodal becomes the next frontier in the generative AI race, the biggest names in the field are scrambling to get ahead of the competition by being the ones to launch the next state-of-the-art model. Back in October, OpenAI announced the availability of DALL-E within ChatGPT. A month before that, the company also announced that users could show images to and communicate via voice with ChatGPT. Around that time, Adept also open-sourced its most compact multimodal model, Fuyu-8B, which notably does not depend on a separate image encoder to process images. These announcements are just a scratch on the surface of the current interest in multimodal generative AI.
Just a month after OpenAI's DevDay, where many of the updates concerning GPT-4 Turbo were officially announced, Google has announced the launch of Gemini, its family of natively multimodal models. Rather than piecing independent text, audio, and image processing components into a composite, Gemini was pre-trained on different data types, including text, images, and code. It was also fine-tuned using multimodal data, which increased its effectiveness further. Its unique training enables Gemini to process visual and textual inputs seamlessly via multimodal prompting.
Google's announcement opens with a note from Google and Alphabet CEO Sundar Pichai, who refers to the Gemini family of models as the realization of the vision that led to the formation of Google DeepMind and an unrivaled science and research effort undertaken by the company. This is right on target since, as we will see, the Gemini models are an impressive achievement that is about to transform the user experience of many of Google's products. There is also no doubt that the models will be harnessed into novel applications.
Gemini's capabilities for sophisticated multimodal reasoning are extensively showcased in the announcement and the myriad of resources Google released. The diversity and complexity of the tasks that it can complete are undeniably admirable: from reviewing monumental collections of research papers and refreshing graphs based on its findings, to excelling at competitive programming (by powering AlphaCode2), to detecting patterns and reasoning about sequences of images, it seems like there is little that Gemini cannot do.
Indeed, this sensation is only confirmed by the results of the comprehensive benchmark testing undergone by Gemini: it has displayed state-of-the-art performance in 30 out of 32 widely used benchmarks in LLM research and development. Furthermore, Gemini Ultra is the first model to surpass human experts on the MMLU (massive multitask language understanding), a test measuring multitask accuracy by evaluating knowledge and problem-solving skills by way of 57 tasks covering subjects such as elementary mathematics, US history, law, computer science, medicine, and more.
Thanks to its multimodal training, Gemini also achieves state-of-the-art performance in image benchmarks without the assistance of OCR. Finally, the model also achieved a state-of-the-art score of 59.4% on the new MMMU (Massive Multi-discipline Multimodal Understanding) benchmark that comprises "11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering."
Another of Gemini's most salient features is its flexibility. The family of models includes three different sizes, intended for specific purposes: Gemini Ultra is the largest, most capable model that excels at complex tasks; Gemini Pro is the most scalable model, delivering a wide range of applications; and Gemini Nano is the most efficient model, intended for on-device task performance. Pixel 8 Pro is the first smartphone engineered to run Gemini. Features such as Summarize in the Recorder app and Smart Reply in GBoard will be powered by Gemini Nano. Android developers can build with Gemini Nano via AICore, a new system capability available in Android 14, currently in early preview.
Seeing that Gemini Pro is meant to be a scalable model, it is no surprise that Bard will now rely on a fine-tuned version of Gemini Pro for more advanced reasoning. The company also expects to roll out Gemini in several products, including Search, Ads, and Chrome. Developers and enterprises will have access to Gemini Pro via the Gemini API in Google AI Studio or Google Cloud Vertex AI starting December 13. Google AI Studio is the company's web-based developer tool to build and launch apps using an API key. In contrast, Vertex AI is a fully managed AI platform for customizing Gemini while benefiting from many Google Cloud features for enterprise security and data privacy, governance, and compliance.
These developer tools will be matched with the availability of Cloud TPU v5p, Google's "most powerful, scalable, and flexible AI accelerator thus far." Gemini 1.0 was trained using the in-house designed v4 and v5e TPUs, the same AI accelerators found at the core of AI-powered products such as Search, YouTube, and Gmail. The v5p TPUs will expedite Gemini's development and will continue to enable customers to train and deploy large-scale AI models efficiently.
Gemini Ultra is coming soon, as extensive trust and safety checks are underway. Before general availability, Google will offer Gemini Ultra access to select partners, customers, and experts for feedback before the model rolls out to developers and enterprises early next year. The availability of Gemini Ultra will be joined by the launch of Bard Advanced, an AI experience powered by Google's best models and capabilities. The company's AI Principles guided Gemini's development, and extensive efforts were made to ensure the models comply with the most up-to-date security and safety benchmarks.