Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Details on V2 architecture #20

Open
nicolas-dufour opened this issue Sep 18, 2023 · 2 comments
Open

Details on V2 architecture #20

nicolas-dufour opened this issue Sep 18, 2023 · 2 comments

Comments

@nicolas-dufour
Copy link

Hey @dome272,
Amazing work on the V2.
Looking at the code, I see that stage C is not diffusing in the latent space of EffNet, since it's shape is Bx16x24x24 and not 16*12x12 as stated in the paper. I however see that stage B uncond shape is still 16x12x12, so I'm a bit confuse with what is happening there.
Also if i understand well, Stage B is not Paella like anymore?

Will there be a V2 of the paper as well with all the changes?
Thanks!

@madebyollin
Copy link

Based on the video, it sounds like Wuerstchen V2 started training with 512x512 images (12x12 latents) and then fine tuned on 1024x1024 images (24x24 latents) to get the final checkpoint.

image

This approach is similar to how SD team did their initial training on 256x256 images (32x32 latents), then finetuned on 512x512 images (64x64 latents), and (for SDXL) did further fine-tuning on 1024x1024-area images to get the final checkpoint.
image

@dome272
Copy link
Owner

dome272 commented Sep 18, 2023

Hey there,
@madebyollin is fully right. We pretrain at 3x512x512 -> 16x12x12 and then after 500k iterations moved to 3x1024x1024 -> 16x24x24. Some great people are helping us right now rewriting the paper and bringing all the updates into an updated v2 paper. But this might still take a bit :c

But yea Stage B is a diffusion model now as well. We haven't done any comparison. It was just that Pablo initially was frustrated that the LDM Stage B would always crash and set it his goal to make it work really well. And after this was achieved we just went on with it. It would be interesting tho to make a fair comparison to the Paella architecture for Stage B. Another idea would be to discretize Stage B latents and then learn a Paella as Stage C. But we haven't done this yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants