Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what parameter changes would I need to make sure it runs on our dataset? #2

Open
Rushi117108 opened this issue Jan 1, 2024 · 12 comments

Comments

@Rushi117108
Copy link

I am running this code on set of images but getting thisu error
" CUDA out of memory. Tried to allocate 150.06 GiB (GPU 0; 15.89 GiB total capacity; 720.18 MiB already allocated; 14.31 GiB free; 736.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. " I have updated the batch size, and also resize images to 224, 224 shape but it still giving me this CUDA error.

Can you please tell me what shold I do?

Thanks

@explainingai-code
Copy link
Owner

Hello,

224x224 is still large for this model. Can you please try to follow the steps mentioned here and see if it works fine after that ?

@Rushi117108
Copy link
Author

Hi,
Thank you for reply. It is running now. But if I have to run on 224 size then how can I do it? BTW I am taking im_size = 64

@explainingai-code
Copy link
Owner

explainingai-code commented Jan 1, 2024

With 224x224 images, using the current code version it would be difficult, but you could try the following:

  1. Reduce the number of channels and layers significantly until single gpu memory is enough (but chances are it would not give good results).
  2. Right now the code does not support multi gpu, but feel free to make changes to have it run on multiple gpus.
  3. Use vae/vqvae to get 224x224->64x64 latents then train diffusion on single gpu on these 64x64 .During sampling feed the generated 64x64 to the decoder of vae/vqvae to get 224x224 image. By end of this month I will have a repo for stable diffusion that will allow you to do this.

@Rushi117108
Copy link
Author

Thank you for your response.

@Rushi117108
Copy link
Author

Hi,

I trained model on medical dataset and after sampling results are not as expected. Am I missing something? Please throw some light.

@explainingai-code
Copy link
Owner

When you say results are not as expected, do you mean images generated are completely garbage or they are just not of that high quality ?
Was the generation output improving throughout the training epochs ?
Also Is it possible to share the model config and sample database image and generated output ?

@Rushi117108
Copy link
Author

Hi,
I am attaching config setting, output and input image
config
output
image1_0_png rf 679690475fa46b1e44696e692efcb4bc

@Rushi117108
Copy link
Author

Model is improving during training.

@explainingai-code
Copy link
Owner

Couple of things that I can think of.
I see your images are grayscale, any specific reason to use 3 channels. Maybe try with im_channels : 1
Based on these images,I suspect that model needs to be trained more(I had used 40 for mnist itself), maybe train for 100/200 epochs.

Can you see if this helps ?

@Rushi117108
Copy link
Author

No images are not grayscale. It has 3 channels. But I will use epoch more.

@xiaoxiao079
Copy link

Hi, I am attaching config setting, output and input image config output image1_0_png rf 679690475fa46b1e44696e692efcb4bc

hi there, how you did this? my dataset is also have 3 channel and also i did all the changes which is mention by @explainingai-code but i got size mismatch error.
image

@explainingai-code
Copy link
Owner

Hi @xiaoxiao079 , It looks from the error that code is trying to load a checkpoint which is trained on a different than what you are currently using to train/infer.
If this error is coming during training, there might already be a checkpoint with same name but trained using different configuration that throws error here - https://github.com/explainingai-code/DDPM-Pytorch/blob/main/tools/train_ddpm.py#L49
If this error is during sampling then the config that you might be using might be incorrect during sampling here - https://github.com/explainingai-code/DDPM-Pytorch/blob/main/tools/sample_ddpm.py#L73

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants