New AI Image Generator Does More Than SDXL With Less

Stability AI has adopted a new architecture that outperforms SDXL and SD 1.5 in both speed and quality.

By Jose Antonio Lanz

Feb 14, 2024

4 min read

Image: Stability AI

Add on Google

Stability AI, the company behind the wildly popular Stable Diffusion image generator, has just lobbed another grenade into the hotly competitive AI arena.

Stability’s brand new Stable Cascade, powered by the new, open-source Würstchen architecture, provides a highly efficient and modular approach to text-to-image generation, balancing quality, speed, and adaptability.

The model achieves a compression factor unlike anything previously seen in traditional Stable Diffusion models, the company claims, and it is capable of producing results of greater resolution and details—comparable to modern generators like SDXL or MidJourney (which typically work with 1024x1024 resolutions).

Würstchen ingredients

Stable Cascade adopts a three-stage process, as distinct from the traditional Stable Diffusion pipeline:

Stage A: The Image Compressor: Unlike typical models, this initial stage processes images like advanced puzzles. Employing a Vector-Quantized Generative Adversarial Network (VQGAN), the image is chopped into compact 256x256 sections. Each section receives a discrete "token" from a specialized codebook. This step paves the way for lightning-fast processing in the stages that follow.
Stage B: The Rebuilder (Latent Diffusion Model) This phase handles the image reconstruction work after compression. Think of it as a skilled building renovator using detailed instructions and blueprints for its work.
Stage C: The Text-Conditional Latent Generator Stage C focuses solely on processing text-based instructions and producing compressed latents. This decoupled text-generation approach drastically reduces the complexity and cost of fine-tuning for specific use cases.

In other words, it does what its name suggests. It begins with a text-driven generator that churns out tiny image snapshots, which are inflated into a more detailed one, then properly presented to your eyes as a high-quality, full-resolution image.

Modular advantages

The modular design of Stable Cascade brings several compelling advantages, according to its developers. First is the extreme efficiency: due to the compressed latent space (the way an AI evaluates image composition as opposed to pixel space, which is what humans see) and the focused Stage C model, Stable Cascade achieves faster inference times, meaning it calculates its predictions faster. And it does so with significantly reduced hardware requirements compared to larger Stable Diffusion models like SDXL.

Stability AI's internal tests demonstrated Stable Cascade's ability to consistently outperform comparable models like SDXL in terms of both image quality and aesthetic appeal. Further, the model achieves these results at very high speeds while demanding significantly fewer computational resources.

Another advantage that Stability AI claims is its versatility. Many of the tools Stable Diffusion artists now use to refine their work —like ControlNets or LoRas—are compatible. And, because of its extreme efficiency, users can add more of these tools into their workflows without collapsing their memories.

The model's lightweight architecture, smaller model footprint, and compatibility with less powerful computing hardware lower the barrier to entry, increasing the accessibility of advanced text-to-image generation techniques for casual users and researchers alike.

Doing more with less

Our tests have found that the model is accurate and detailed and does not show the washed-out, rubbery aesthetic of Stability AI’s previous SDXL turbo or LCM models. Instead, it generates highly detailed images on par with fine-tuned SDXL models.

It also has some basic text generation capabilities, which can be further enhanced with LoRAs that are already available in online repositories like Civitai.

Stability AI reports that despite hosting more parameters than Stable Diffusion XL, Stable Cascade still enjoys faster inference times and excels at prompt alignment.

Fine-tuning Stable Cascade is also less resource-intensive compared to similar-sized Stable Diffusion models. Researchers and enthusiasts can potentially train the model on smaller datasets and with considerably less computing power, which makes it very cost-efficient.

Stable Cascade is released under a non-commercial research license and is readily available on Stability AI's GitHub repository with a community-maintained ComfyUI workflow already available that automatically downloads the models for more ease of use.

Edited by Ryan Ozawa.

Daily Debrief Newsletter

Start every day with the top news stories right now, plus original features, a podcast, videos and more.

Coin Prices