Stability AI, a leading artificial intelligence developer committed to the open-source ethos, released Stable Audio 2 this week, a new audio and music generator. It's the first major point release since Stable Audio debuted in September, with a number of enhancements that ramp up the competition among tools from companies like Suno, Google's MusicFX, and Meta's AudioCraft.
"Stable Audio 2.0 enables high-quality, full tracks with coherent musical structure up to three minutes long at 44.1 kHz stereo from a single natural language prompt," Stability AI declared.
The announcement comes amid a rocky time for Stability, which had reportedly depleted its cash reserves before CEO Emad Mostaque resigned two weeks ago.
The firm nonetheless continues to push forward in the open-source AI space. In addition to Stable Audio, the company launched a new coding LLM named Stable Code Instruct 3B on March 25 and released an advanced open-source text-to-video generator called Stable Video Diffusion last year.
Stability AI is also set to release its most advanced image generator, Stable Diffusion 3, later this year.
Among open-source adherents, Stability AI plays a leading role alongside notable names like Mistral and Nous. Other big tech companies are also exploring the open-source space, however, with Meta and Microsoft sharing important contributions.
Introducing Stable Audio 2.0 – a new model capable of producing high-quality, full tracks with coherent musical structure up to three minutes long at 44.1 kHz stereo from a single prompt.
Explore the model and start creating for free at: https://t.co/E9ZIGagmPf
Read the… pic.twitter.com/rFGb0KpdeX
— Stability AI (@StabilityAI) April 3, 2024
Inside Stable Audio
At its core, Stable Audio 2 leverages diffusion transformer technology (DiT), following the same approach as Stability AI's upcoming Stable Diffusion 3 image generator, representing a shift from its previously adopted U-Net technology.
DiT and U-Net are both common architectures used in machine learning, but DiT is designed to refine random noise into structured data incrementally, making it particularly effective at handling long data sequences. U-Net, by contrast, focuses on accuracy for short generations but is less capable of handling longer, more complex sequences.
Among the major upgrades in Stable Audio 2 is audio-to-audio generation, a new feature that enables users to transform sound samples that they upload—akin to Stable Diffusion’s img2img for image modification.
"Users can now upload audio samples and, through natural language prompts, transform these samples into a wide array of sounds," the announcement explained. “This update also expands sound effect generation and style transfer, providing artists and musicians more flexibility, control, and an elevated creative process.”
In other words, Stable Audio 2 does not start refining a random noise, instead modeling the initial audio file to make it match the user’s prompt. The result is a generation that follows the prompt but sounds similar to the reference audio.
The company touts the fact that Stable Audio 2 was exclusively trained on a licensed dataset from the AudioSparx music library. This ensures that all artists were given the option to opt out of the Stable Audio model training, honoring their rights and ensuring fair compensation.
Decrypt tested the model, and the results showed significant improvements compared to Stable Audio 1.0. The generated music tracks were more coherent, and the generations were longer—twice as long as the 90-second limit of version one.
The prompting style of Stable Audio 2 resembles that of Stable Diffusion 1.5, focusing heavily on tags or keywords. Natural language prompts do not yield good results.
The model seems best suited for inspiration or background music rather than replacing properly trained musicians for marquee songs. In many cases, generations suffered from multiple hallucinations and discordant sounds that diverged from the prompt. Still, it did often generate nice riffs that could be used later on.
Stable Audio 2 versus Suno 3
As impressive as Stable Audio 2 is—particularly in comparison to its predecessor—its capabilities quickly wither when compared to sounds and songs generated by Suno 3, an update to the leading audio generator released only a month ago. Many AI enthusiasts say Suno 3 is the best model in the AI music space, with Kevin Hutson from Futurepedia describing it as "mindblowing" and MatVidPro saying it's a "game changer."
While what makes a pleasant—or even simply good—music track is relative, Decrypt attempted a side-by-side comparison of Stable Audio 2 and Suno 3 using the same prompts. It's an imperfect approach given the differences in their optimal prompting styles—Stable Audio prefers keywords, and Suno 3 expects natural language.
We decided to use the Stability AI approach, even though it might disadvantage Suno. Fortunately, Suno 3 was able to effectively understand our instructions, providing a reasonable way to compare their output.
Still, the Stable Audio prompting style is not friendly to beginners—using only keywords and tags can limit the creativity and complexity of the output. A normal Suno prompt, for example, could be, “A pop rock song about Decrypt, a media site covering the AI space.” A typical Stable Audio prompt would be something like, “Format: Band | Instruments: electric guitar, bass, keyboards, banjo| Genre: Country| Sub-genre: Country Rock.”
Out of the gate, Suno 3 has one major advantage over the competition: in addition to accepting natural language prompts, it can integrate with a large-language model (LLM) to generate lyrics.
Here is a comparison between Stable Audio 2 and Suno v3 both with and without lyrics. The prompt was: Grand, planetary, sweeping, pensive, science fiction epic orchestral opening credits theme with lonely solo violin.
In terms of the quality of the generated audio, Stable Audio 2 falls short up against Suno 3. While Stability AI said its tool can generate coherent music up to three minutes long, the tracks tend to be more plain, lacking the creativity and structural complexity of the audio generated by Suno 3.
Suno 3's generations typically include proper song structure with natural riffs, choruses, bridges, and variations, making the output feel more like a complete song rather than a background instrumental track.
Here is a comparison between the generations provided by Stable Audio 2 and Suno v3. The prompt was: Format: Band | Instruments: drum, electric guitar, bass, keyboards,| Genre: Rock | Sub-genre: Heavy Metal | Mood: Energetic, Epic | Tempo: Fast |
Moreover, the transitions between riffs in Stable Audio's music generations are often abrupt. This is in stark contrast to Suno 3, which generally transitions smoothly between different parts of the song, creating a more enjoyable listening experience.
Another notable difference between the two models is the speed of audio generation. Suno 3 generates audio much faster than Stable Audio 2. While this could be a server issue, it's still an important factor to consider, especially for users who need to generate audio quickly and efficiently.
But there is one thing that Stable Audio 2 does that Suno 3 cannot do: audio-to-audio generations.
With Stable Audio 2, you could whistle the melody of a song, for example, and Stable Audio would bring some life to your ideas. This is a level of control that Suno users do not yet have. While not a dealbreaker for us, this could definitely be important for many.
Both Stable Audio and Suno are powerful and worth trying, especially if you've got a music making bug but lack musical skills. But Stable Audio may need to advance to its third version to come within striking distance of the same generation from Suno.
Edited by Ryan Ozawa.