The Google Brain team has created the best Text-to-Video solution, Imagen Video. It is an AI system capable of generating video clips based on a text query. The text-based video diffusion model can generate videos at up to 1280×768 resolution at 24 frames per second.
Given a text-based query, Imagen Video generates high-definition video using a basic video generation model and a sequence of alternating spatial and temporal superresolution video models.
Imagen Video is a so-called "diffusion" model, which consists of a text encoder (frozen T5-XXL), a basic video diffusion model, and alternating spatial and temporal superresolution diffusion models. It generates new data (e.g., video) by learning how to "break down" and "restore" multiple existing data samples.
A particular development feature is Video U-Net, a video-unet architecture whose spatial operations are performed independently on frames with common parameters (batch x time, height, width, channels), while temporal operations work already on the entire 5-dimensional tensor (batch, time, height, width, channels).
Not only is Imagen Video capable of generating video with high fidelity, but it also has a high degree of control and knowledge of the world, including the ability to generate a variety of video and text animations in a variety of artistic styles and with a 3D understanding of objects.
Imagen Video is based on Google's Imagen, an image generation system comparable to DALL-E 2, which was previously reported to have been taken off the beta waiting list, and users can now start using it at any time.