Audio generation has seen rapid advances thanks to diffusion models, but controllability and speed remain challenges. Harmonai, Stability AI’s generative audio lab, has introduced Stable Audio – a new conditioned latent diffusion model that achieves both.
Stable Audio can generate high-fidelity music, instruments, and sound effects conditioned on text prompts, audio length, and start time. This timing conditioning allows precise control over the length and content of the AI-generated audio.
The model works by compressing audio into a latent space using a variational autoencoder (VAE), then conditioning a diffusion model U-Net on text and timing embeddings. Stable Audio builds on top of Stability AI’s image generation model Stable Diffusion.
Timing conditioning is achieved by passing discrete learned embeddings of the target audio length and start second into the diffusion model. During training, audio chunks are encoded with their snippet length and original full length.
Stable Audio represents a massive speed breakthrough. Using the latest sampling techniques, the model can generate 95 seconds of audio in under 1 second on an NVIDIA A100 GPU. This is orders of magnitude faster than raw audio generation.
The 907M parameter diffusion model was trained on a dataset of over 800,000 audio files spanning 19,500 hours. The variety of music, instruments, and sound effects allows diverse creative applications.
Harmonai continues to refine Stable Audio. On the roadmap are further quality improvements, longer output lengths, and open-sourced models and code. Community developers will be able to build their own conditioned audio generation systems.
Stable Audio points to a future where AI audio generation not only produces impressive results, but gives users precision control over timing, length, and content. As Harmonai rolls out more tools, we inch closer to truly controllable creative AI.
Rapid advances from labs like Harmonai ensure audio generation remains an exciting space. With models like Stable Audio pushing new frontiers in speed and conditioning, the generative audio field has a bright future indeed.