This does outperform Stable Diffusion 2.1, but uses a different architecture and requires more memory and compute. Stable Diffusion runs its denoising process in a compressed "latent space" which is how it was able to be so compute-efficient compared to other diffusion models. It also uses the (relatively) small text encoder from OpenAI's CLIP model to encode user prompts. Both of these optimizations meant that it could run much faster compared to say, DALLE or Imagen, but it didn't follow complicated user prompts especially well and had trouble with things like counting and text-rendering.
DeepFloyd IF is based on Google's Imagen model, which has two key differences from Stable Diffusion: (1) it denoises in pixel space instead of a compressed latent space, and (2) it uses a 10x larger pretrained text encoder (T5-XXL-1.1) compared to SD's CLIP encoder. (1) allows it to better render high-frequency details and text, and (2) allows it to understand complex prompts much better. These improvements come at the cost of multiple times more memory usage and compute requirements compared to SD, though.
In terms of "will it replace SD?"—in the short term I think yes. But I still think latent diffusion models are the future. For example, Stability is gearing up to release Stable Diffusion XL right now, a larger version of the original SD that does higher fidelity and higher resolution generations. I wouldn't be surprised if it takes the crown back from DeepFloyd when it releases, but I guess we'll have to see.
> For example, Stability is gearing up to release Stable Diffusion XL right now, a larger version of the original SD that does higher fidelity and higher resolution generations. I wouldn't be surprised if it takes the crown back from DeepFloyd when it releases, but I guess we'll have to see.
SDXL is available on StabilityAI’s hosted services already, so they can be compared head to head.
I believe the version available on DreamStudio is heavily RLHF-tuned, no? I'm mostly interested to see how the raw weights perform out of the box compared to IF, which we have to wait for the release for.
The VRAM requirements are higher (14GB) so lots of things that can do SD won’t do this with thr existing toolchain. But some of that is “aftermarket” SD optimization, and this maybe could see some of that, too.
But there are consumer cards with 14GB+ VRAM, so its not, even before optimization, out of reach of consumer hardware.
I’m not sure why “denoise in latent space at 64x64 and decode to pixel space at target resolution” is fundamentally better than “denoise in pixel space at 64x64, then upscale to pixel space at target resolution and denoise some more”.
The former seems likely to be lower compute-for-resolution, but that’s not the only consideration for “better”...
I didn't think about that but you're totally right, assuming they have those embeddings cached it would be super easy to retrain SD using them. 11B parameter count is rather unfortunate though tbh, I've never been the biggest fan of "scale is all you need" even though it seems to ring irritatingly true most of the time.
DeepFloyd IF is based on Google's Imagen model, which has two key differences from Stable Diffusion: (1) it denoises in pixel space instead of a compressed latent space, and (2) it uses a 10x larger pretrained text encoder (T5-XXL-1.1) compared to SD's CLIP encoder. (1) allows it to better render high-frequency details and text, and (2) allows it to understand complex prompts much better. These improvements come at the cost of multiple times more memory usage and compute requirements compared to SD, though.
In terms of "will it replace SD?"—in the short term I think yes. But I still think latent diffusion models are the future. For example, Stability is gearing up to release Stable Diffusion XL right now, a larger version of the original SD that does higher fidelity and higher resolution generations. I wouldn't be surprised if it takes the crown back from DeepFloyd when it releases, but I guess we'll have to see.