in the dynamic realm of audio synthesis, a groundbreaking advancement has emerged with the introduction of Stable Audio a cutting-edge generative model. This innovative technology significantly enhances our capacity to generate intricate, top-tier audio based on textual cues.
Unlike its predecessors, Stable Audio achieves the creation of extensive stereo music and variable-length sound effects that boast both high fidelity and versatility, addressing a longstanding challenge in this domain.
The key to Stable Audio’s method lies in its distinctive integration of a fully convolutional variational autoencoder and a diffusion model, both conditioned on text prompts and timing embeddings. This unique conditioning allows unparalleled control over the audio’s content and duration, empowering the generation of intricate audio narratives closely aligned with their textual descriptions.

The inclusion of timing embeddings is groundbreaking, facilitating the generation of audio with precise lengths—an elusive feature for prior models. Performance-wise, Stable Audio sets a new benchmark in efficiency and quality, capable of rendering up to 95 seconds of stereo audio at 44.1kHz in just eight seconds on an A100 GPU.
This leap in performance does not compromise quality; in fact, Stable Audio showcases superior fidelity and structure in the generated audio. This is achieved by harnessing a latent diffusion process within a highly compressed latent space, enabling swift generation without sacrificing detail or texture.
To thoroughly assess Stable Audio’s performance, the research team introduced novel metrics tailored for evaluating long-form, full-band stereo audio. These metrics gauge the plausibility of generated audio, the semantic alignment between audio and text prompts, and the extent to which the audio adheres to provided descriptions.
Stable Audio consistently outperforms existing models by these measures, showcasing its ability to generate realistic, high-quality audio that accurately captures the nuances of input text. Notably, Stable Audio excels in producing audio with a discernible structure, encompassing introductions, developments, and conclusions while preserving stereo integrity.
This capability signifies a significant advancement over previous models that struggled with coherent long-form content generation or maintaining stereo quality over extended durations.
In summary, Stable Audio marks a significant stride in audio synthesis, bridging the gap between textual prompts and high-fidelity, structured audio. Its innovative approach unlocks new possibilities for creative expression, multimedia production, and automated content creation, setting a new standard for text-to-audio synthesis.