Musika! Fast Infinite Waveform Music Generation

Dmitry Spodarets
Dmitry Spodarets

Fast, user-controlled music generation opens up new possibilities for composing and performing music. But today's music generation systems require large amounts of data and computing resources for training, and slow output. This makes them impractical for real-time interactive use.

Marco Pasini and Jan Schlüter's work, called Musika, is a music generation system that can be trained on hundreds of hours of music using a single consumer GPU, and which allows for much faster than real-time generation of arbitrary length music on a consumer CPU.

Normally, autoregressive models are capable of generating high-quality sound with long-range dependencies, but the sampling process is extremely slow and inefficient, making possible real-world applications difficult and unconditional sound generation impossible to achieve.Given these shortcomings of non-autoregressive audio generation systems, Musika becomes a GAN-based system that allows fast unconditional and conditional sound generation of arbitrary length.

The developers achieved this by first training a compact inverted representation of spectrogram magnitudes and phases using inverting autocoders, and then training a generative adversative network (GAN) on this representation for a specific musical domain.

The latent coordinate system allows parallel generation of arbitrarily long sequences of passages, while the global context vector allows the music to remain stylistically consistent over time. Thus, the researchers conduct quantitative evaluations to determine the quality of the generated samples and demonstrate the possibilities of user control in generating piano and techno music.

They then publish the source code and pre-trained autoencoder weights and so GAN can be trained on a new music domain using a single GPU in a few hours.

Links to the work and code are left below.

At this time, the developers are making many changes to the design of YOWO to make it better. For the network structure, they use the same elements of the official YOWO implementation, including 3D-ResNext-101 and YOLOv2, but use the better pre-trained weight of the re-implemented YOLOv2, which is better than the official YOLOv2.