You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update: I wonder if repo update is possible with torch.amp or mixed precision applied.
Hello,
I'd like to ask, with my dataset and machine, if it is normal to see out of memory or if I might have some programming issue.
RuntimeError: CUDA out of memory. Tried to allocate 260.00 MiB (GPU 0; 15.78 GiB total capacity; 14.02 GiB already allocated; 198.19 MiB free; 14.13 GiB reserved in total by PyTorch)
I using AWS p3.8xlarge(4 Tesla V100s), trying to train a CycleGAN with 5055 images of 1024x1024 resolution.
I checked that resized dataset of 512x512 works with batch size 4, but with 1024x1024, even batch size 1 doesn't work.
I think we need p4d.24xlarge for this project, but it's hard to get the instance due to the lack of zone capacity.
possible tries are:
-reduce num of the dataset (but I think 5055 images are still small for training) My colleague thinks the model loads the whole dataset at once, and that's the reason we need to reduce the dataset.
-find a memory leak?
Update: I wonder if repo update is possible with
torch.amp
or mixed precision applied.Hello,
I'd like to ask, with my dataset and machine, if it is normal to see out of memory or if I might have some programming issue.
RuntimeError: CUDA out of memory. Tried to allocate 260.00 MiB (GPU 0; 15.78 GiB total capacity; 14.02 GiB already allocated; 198.19 MiB free; 14.13 GiB reserved in total by PyTorch)
I using AWS p3.8xlarge(4 Tesla V100s), trying to train a CycleGAN with 5055 images of 1024x1024 resolution.
I checked that resized dataset of 512x512 works with batch size 4, but with 1024x1024, even batch size 1 doesn't work.
I think we need p4d.24xlarge for this project, but it's hard to get the instance due to the lack of zone capacity.
possible tries are:
-reduce num of the dataset (but I think 5055 images are still small for training) My colleague thinks the model loads the whole dataset at once, and that's the reason we need to reduce the dataset.
-find a memory leak?
any comments or hints are appreciated.
below is the log for reference.
The text was updated successfully, but these errors were encountered: