This does outperform Stable Diffusion 2.1, but uses a different architecture and...

dragonwriter · on April 28, 2023

> For example, Stability is gearing up to release Stable Diffusion XL right now, a larger version of the original SD that does higher fidelity and higher resolution generations. I wouldn't be surprised if it takes the crown back from DeepFloyd when it releases, but I guess we'll have to see.

SDXL is available on StabilityAI’s hosted services already, so they can be compared head to head.

mkaic · on April 28, 2023

I believe the version available on DreamStudio is heavily RLHF-tuned, no? I'm mostly interested to see how the raw weights perform out of the box compared to IF, which we have to wait for the release for.

causality0 · on April 28, 2023

Does the increased memory footprint mean it can't be run on a normal desktop like SD?

dragonwriter · on April 28, 2023

The VRAM requirements are higher (14GB) so lots of things that can do SD won’t do this with thr existing toolchain. But some of that is “aftermarket” SD optimization, and this maybe could see some of that, too.

But there are consumer cards with 14GB+ VRAM, so its not, even before optimization, out of reach of consumer hardware.

causality0 · on April 28, 2023

Damn. That's basically just the 4090 and 4080.

mkaic · on April 29, 2023

The 3090 has 24GB so it's an option as well.

2bitencryption · on April 28, 2023

thanks for the explanation!

denoising in latent space certainly seems like the "correct" path. My (amateur) thinking is, the more you can do in latent space, the better.

dragonwriter · on April 28, 2023

I’m not sure why “denoise in latent space at 64x64 and decode to pixel space at target resolution” is fundamentally better than “denoise in pixel space at 64x64, then upscale to pixel space at target resolution and denoise some more”.

The former seems likely to be lower compute-for-resolution, but that’s not the only consideration for “better”...

mftb · on April 28, 2023

This was an excellent summary, ty.

lucidrains · on April 28, 2023

tldr: bigger text encoder is better. SD will catch up quickly, as conditioning on a new set of precomputed text embeddings is a trivial change

mkaic · on April 28, 2023

I didn't think about that but you're totally right, assuming they have those embeddings cached it would be super easy to retrain SD using them. 11B parameter count is rather unfortunate though tbh, I've never been the biggest fan of "scale is all you need" even though it seems to ring irritatingly true most of the time.