Stable Audio uses generative AI to create high-quality downloadable music and sound effects tracks through an easy-to-use web interface. Stability AI offers a free version and a ‘pro’ subscription. The Verge reports that Stable Audio has a three-tier pricing system: the basic free version allows users to create twenty 45-second downloadable tracks per month, which are not available for commercial use. The Professional tier allows users to generate 500 commercially available 90-second tracks for $11.99 per month. Finally, Enterprise pricing lets the interested companies determine the appropriate usage and pricing for the product.
The launch of Stable Audio has sparked significant interest seeing that, at the time of writing, the site remains inaccessible due to the high volume of traffic it is currently experiencing. Judging from the quality of the example tracks and prompts featured in the official announcement, Stable Audio’s rise to popularity is well-deserved. The tracks also provide tangible evidence to back up some of the claims on what makes Stable Audio superior to other audio diffusion models in the field.
According to the research behind Stable Audio, audio diffusion models are usually trained on fixed-length clips and will usually generate fixed-length tracks. A model trained on 30-second fragments of audio will only generate 30-second tracks. This poses an obstacle for AI-generated song creation, since songs may have different lengths, which seems to require that models be trained by varying-length clips. Training on fixed-length clips taken from longer material often means that the model will generate random chunks of a song without consideration of the place of the clip in the original track; therefore, AI-generated songs tend to include random chunks that can start or end in the middle of a musical phrase.
Stable Audio mitigates this effect because it is conditioned on text metadata, in addition to the audio files’ start time and duration. This allows better control of the generated song’s content and duration. The timing conditioning also enables the generation of varying-length tracks up to the size of the training window. Additionally, Stable Audio works on heavily downsampled material rather than on raw audio. This allows for a faster generation of tracks since the “flagship Stable Audio model is able to render 95 seconds of stereo audio at a 44.1 kHz sample rate in less than one second on an NVIDIA A100 GPU.”
The model was trained using audio and metadata from AudioSparx, a library specializing “in licensing high-quality audio content to clients in film and TV production, game production, ad agencies and others needing world-class audio cues for their productions and projects.” The dataset used consists of over 800,000 audio files and corresponding metadata, spanning over 19,500 hours of audio.
While AI-assisted audio generation is not new, Stable Audio is definitely a game changer in that it is widely available as a free and subscription-based product. Competitors such as Google’s MusicLM and Meta’s AudioCraft have a far more restricted availability.
The company is rightfully thrilled about this landmark in its audio generation research. Harmonai, Stability AI’s generative audio research lab, is already planning on releasing open-source models based on Stable Audio along with training code so developers everywhere can train their own audio generation models. Additionally, Emad Mostaque, CEO of Stability AI, stated “Our hope is that Stable Audio will empower music enthusiasts and creative professionals to generate new content with the help of AI, and we look forward to the endless innovations it will inspire.”
Read Stability AI’s announcement in full, as well as the full paper on the research behind Stable Audio. Both pieces include plenty of tracks and prompts that will let you catch a glimpse of Stable Audio’s impressive capabilities.
Data Phoenix Newsletter
Join the newsletter to receive the latest updates in your inbox.