Musika! Fast Infinite Waveform Music Generation

Fast, user-controlled music generation opens up new possibilities for composing and performing music. But today's music generation systems require large amounts of data and computing resources for training, and slow output. This makes them impractical for real-time interactive use.

Marco Pasini and Jan Schlüter's work, called Musika, is a music generation system that can be trained on hundreds of hours of music using a single consumer GPU, and which allows for much faster than real-time generation of arbitrary length music on a consumer CPU.

Normally, autoregressive models are capable of generating high-quality sound with long-range dependencies, but the sampling process is extremely slow and inefficient, making possible real-world applications difficult and unconditional sound generation impossible to achieve.Given these shortcomings of non-autoregressive audio generation systems, Musika becomes a GAN-based system that allows fast unconditional and conditional sound generation of arbitrary length.

The developers achieved this by first training a compact inverted representation of spectrogram magnitudes and phases using inverting autocoders, and then training a generative adversative network (GAN) on this representation for a specific musical domain.

The latent coordinate system allows parallel generation of arbitrarily long sequences of passages, while the global context vector allows the music to remain stylistically consistent over time. Thus, the researchers conduct quantitative evaluations to determine the quality of the generated samples and demonstrate the possibilities of user control in generating piano and techno music.

They then publish the source code and pre-trained autoencoder weights and so GAN can be trained on a new music domain using a single GPU in a few hours.

🎵 Musika generates 44.1 kHz stereo music in an instant 🎵

You can now train Musika on your own music dataset (from scratch or by finetuning) with a single consumer GPU!

Demo (New!): https://t.co/z5zWmSbtxD
Code: https://t.co/i6ogu0QZfA

Techno, Metal, Lofi in the video below: pic.twitter.com/5yxWqcYtEt
— Marco Pasini (@marco_ppasini) October 20, 2022

Links to the work and code are left below.

Paper - https://arxiv.org/abs/2208.08706
Code - https://github.com/marcoppasini/musika
Hugging Face - https://huggingface.co/spaces/marcop/musika
Colab - https://colab.research.google.com/drive/1PowSw3doBURwLE-OTCiWkO8HVbS5paRb

At this time, the developers are making many changes to the design of YOWO to make it better. For the network structure, they use the same elements of the official YOWO implementation, including 3D-ResNext-101 and YOLOv2, but use the better pre-trained weight of the re-implemented YOLOv2, which is better than the official YOLOv2.