
Uninterrupted diffusion with Imagen Video
The Google Brain team has created the best Text-to-Video solution, Imagen Video. It is an AI system capable of generating video clips based on a text query. The text-based video diffusion model can generate videos at up to 1280×768 resolution at 24 frames per second.
Given a text-based query, Imagen Video generates high-definition video using a basic video generation model and a sequence of alternating spatial and temporal superresolution video models.
Imagen Video is a so-called "diffusion" model, which consists of a text encoder (frozen T5-XXL), a basic video diffusion model, and alternating spatial and temporal superresolution diffusion models. It generates new data (e.g., video) by learning how to "break down" and "restore" multiple existing data samples.
A particular development feature is Video U-Net, a video-unet architecture whose spatial operations are performed independently on frames with common parameters (batch x time, height, width, channels), while temporal operations work already on the entire 5-dimensional tensor (batch, time, height, width, channels).
Not only is Imagen Video capable of generating video with high fidelity, but it also has a high degree of control and knowledge of the world, including the ability to generate a variety of video and text animations in a variety of artistic styles and with a 3D understanding of objects.
Imagen Video is based on Google's Imagen, an image generation system comparable to DALL-E 2, which was previously reported to have been taken off the beta waiting list, and users can now start using it at any time.
Data Phoenix Newsletter
Join the newsletter to receive the latest updates in your inbox.