The open-source Pyramid Flow model demonstrates a new approach to video generation

Researchers from the Peking University, Beijing University of Posts and Telecommunications, and Kuaishou Technology have shared their research into video generation using pyramidal flow matching as an alternative to the traditional cascading methods, where generating at full resolution is avoided by generating all frames in a video at once before having a separate model handle the decompression (or super-resolution stage). Even if the cascading methods reduce the computational demands of the whole system, the requirement of different models and sub-models handling different stages in the generation process entails sacrificing flexibility, scalability, and knowledge sharing.

Leveraging pyramidal flow matching, the researchers were able to further optimize the computational demands by forfeiting the need for training different models to handle each stage of the video generation process. Instead, the algorithm can be run end-to-end using a single Diffusion Transformer (DiT) model. The authors state that, as an example, their approach requires ≤15,360 tokens, versus the 119,040 tokens for cascading approaches, to generate a 10-second, 241-frame video.

More importantly, by incorporating spatial and temporal pyramid representation, researchers could generate high-quality videos on par with some of the most popular commercial and open-source offerings. First, the research team handled spatial complexity (or individual frame generation) by noting that the first stages of the process are noisy and uninformative, which may be taken to mean that they do not need to operate at full resolution. Thus, only the final stage of the frame generation process operates at full resolution. Operating the early stages at a lower resolution further optimizes the computational requirements of the process.

Similarly, when addressing the temporal continuity of the video generations, the researchers noted that having the full-resolution history conditions for video frames was also redundant, as the earlier frames in a video are more about the high-level conditions (setting the stage for the story) than the detail. Thus, each frame generation is conditioned on a compressed, lower-resolution history that optimizes the autoregressive model's performance without compromising its output quality.

Pyramid Flow, the model resulting from integrating this approach into a single unified Diffusion Transformer (DiT) is available under an MIT License. The model is available as a raw code download on GitHub and Hugging Face, the latter also hosting a model demo. Pyramid Flow's GitHub project also features quantitative and qualitative results.