Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory #26

Open
youjin-c opened this issue Oct 4, 2022 · 0 comments
Open

CUDA out of memory #26

youjin-c opened this issue Oct 4, 2022 · 0 comments

Comments

@youjin-c
Copy link

youjin-c commented Oct 4, 2022

Update: I wonder if repo update is possible with torch.amp or mixed precision applied.

Hello,

I'd like to ask, with my dataset and machine, if it is normal to see out of memory or if I might have some programming issue.

RuntimeError: CUDA out of memory. Tried to allocate 260.00 MiB (GPU 0; 15.78 GiB total capacity; 14.02 GiB already allocated; 198.19 MiB free; 14.13 GiB reserved in total by PyTorch)

I using AWS p3.8xlarge(4 Tesla V100s), trying to train a CycleGAN with 5055 images of 1024x1024 resolution.

I checked that resized dataset of 512x512 works with batch size 4, but with 1024x1024, even batch size 1 doesn't work.

I think we need p4d.24xlarge for this project, but it's hard to get the instance due to the lack of zone capacity.

possible tries are:
-reduce num of the dataset (but I think 5055 images are still small for training) My colleague thinks the model loads the whole dataset at once, and that's the reason we need to reduce the dataset.
-find a memory leak?

any comments or hints are appreciated.

below is the log for reference.

train.py --dataroot database/face2smile \
>   --model cycle_gan \
>   --log_dir logs/cycle_gan/face2smile/teacher_1080 \
>   --netG inception_9blocks \
>   --real_stat_A_path real_stat_1080/face2smile_A.npz \
>   --real_stat_B_path real_stat_1080/face2smile_B.npz \
>   --batch_size 1 \
>   --num_threads 1 \
>   --gpu_ids 0,1,2,3 \
>   --norm_affine \
>   --norm_affine_D \
>   --channels_reduction_factor 6 \
>   --kernel_sizes 1 3 5 \
>   --save_latest_freq 10000 --save_epoch_freq 5 \
>   --nepochs 1 --nepochs_decay 0 \
>   --preprocess none
----------------- Options ---------------
                active_fn: nn.ReLU                       
              active_fn_D: nn.LeakyReLU                  
             aspect_ratio: 1.0                           
               batch_size: 4                             	[default: 1]
                    beta1: 0.5                           
                 channels: None                          
channels_reduction_factor: 6                             	[default: 1]
          cityscapes_path: database/cityscapes-origin    
                crop_size: 256, 256                      
                 dataroot: database/face2smile           	[default: None]
             dataset_mode: unaligned                     
                direction: AtoB                          
          display_winsize: 256                           
                 drn_path: drn-d-105_ms_cityscapes.pth   
             dropout_rate: 0                             
               epoch_base: 1                             
          eval_batch_size: 1                             
                 gan_mode: lsgan                         
                  gpu_ids: 0,1,2,3                       	[default: 0]
                init_gain: 0.02                          
                init_type: normal                        
                 input_nc: 3                             
                  isTrain: True                          	[default: None]
                iter_base: 1                             
             kernel_sizes: [1, 3, 5]                     	[default: [3, 5, 7]]
                 lambda_A: 10.0                          
                 lambda_B: 10.0                          
          lambda_identity: 0.5                           
           load_in_memory: False                         
                load_size: 286                           
                  log_dir: logs/cycle_gan/face2smile/teacher_1080	[default: logs]
                       lr: 0.0002                        
           lr_decay_iters: 50                            
                lr_policy: linear                        
         max_dataset_size: -1                            
                    model: cycle_gan                     	[default: pix2pix]
     moving_average_decay: 0.0                           
moving_average_decay_adjust: False                         
moving_average_decay_base_batch: 32                            
               n_layers_D: 3                             
                      ndf: 64                            
                  nepochs: 1                             	[default: 100]
            nepochs_decay: 0                             	[default: 100]
                     netD: n_layers                      
                     netG: inception_9blocks             
                      ngf: 64                            
                  no_flip: False                         
                     norm: instance                      
              norm_affine: True                          	[default: False]
            norm_affine_D: True                          	[default: False]
             norm_epsilon: 1e-05                         
            norm_momentum: 0.1                           
             norm_student: instance                      
 norm_track_running_stats: False                         
              num_threads: 32                            	[default: 4]
                output_nc: 3                             
             padding_type: reflect                       
                    phase: train                         
                pool_size: 50                            
               preprocess: none                          	[default: resize_and_crop]
               print_freq: 100                           
         real_stat_A_path: real_stat_1080/face2smile_A.npz	[default: None]
         real_stat_B_path: real_stat_1080/face2smile_B.npz	[default: None]
         restore_D_A_path: None                          
         restore_D_B_path: None                          
         restore_G_A_path: None                          
         restore_G_B_path: None                          
           restore_O_path: None                          
          save_epoch_freq: 5                             	[default: 20]
         save_latest_freq: 10000                         	[default: 20000]
                     seed: 233                           
           serial_batches: False                         
               table_path: datasets/table.txt            
          tensorboard_dir: None                          
----------------- End -------------------
train.py --dataroot database/face2smile --model cycle_gan --log_dir logs/cycle_gan/face2smile/teacher_1080 --netG inception_9blocks --real_stat_A_path real_stat_1080/face2smile_A.npz --real_stat_B_path real_stat_1080/face2smile_B.npz --batch_size 4 --num_threads 32 --gpu_ids 0,1,2,3 --norm_affine --norm_affine_D --channels_reduction_factor 6 --kernel_sizes 1 3 5 --save_latest_freq 10000 --save_epoch_freq 5 --nepochs 1 --nepochs_decay 0 --preprocess none
dataset [UnalignedDataset] was created
The number of training images = 5055
data shape is: channel=3, height=1024, width=1024.
initialize network with normal
initialize network with normal
initialize network with normal
initialize network with normal
dataset [SingleDataset] was created
dataset [SingleDataset] was created
/home/ubuntu/.local/lib/python3.9/site-packages/torchvision/models/inception.py:80: FutureWarning: The default weight initialization of inception_v3 will be changed in future releases of torchvision. If you wish to keep the old behavior (which leads to long initialization times due to scipy/scipy#11299), please set init_weights=True.
  warnings.warn('The default weight initialization of inception_v3 will be changed in future releases of '
model [CycleGANModel] was created
---------- Networks initialized -------------
DataParallel(
  (module): InceptionGenerator(
    (down_sampling): Sequential(
      (0): ReflectionPad2d((3, 3, 3, 3))
      (1): Conv2d(3, 64, kernel_size=(7, 7), stride=(1, 1))
      (2): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (3): ReLU(inplace=True)
      (4): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (5): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (6): ReLU(inplace=True)
      (7): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (8): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (9): ReLU(inplace=True)
    )
    (features): Sequential(
      (0): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (1): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (2): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (3): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (4): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (5): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (6): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (7): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (8): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
    )
    (up_sampling): Sequential(
      (0): ConvTranspose2d(256, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
      (1): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (2): ReLU(inplace=True)
      (3): ConvTranspose2d(128, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
      (4): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (5): ReLU(inplace=True)
      (6): ReflectionPad2d((3, 3, 3, 3))
      (7): Conv2d(64, 3, kernel_size=(7, 7), stride=(1, 1))
      (8): Tanh()
    )
  )
)
[Network G_A] Total number of parameters : 8.154 M
DataParallel(
  (module): InceptionGenerator(
    (down_sampling): Sequential(
      (0): ReflectionPad2d((3, 3, 3, 3))
      (1): Conv2d(3, 64, kernel_size=(7, 7), stride=(1, 1))
      (2): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (3): ReLU(inplace=True)
      (4): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (5): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (6): ReLU(inplace=True)
      (7): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (8): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (9): ReLU(inplace=True)
    )
    (features): Sequential(
      (0): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (1): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (2): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (3): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (4): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (5): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (6): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (7): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
      (8): InvertedResidualChannels(256, 256, res_channels=[42, 42, 42], dw_channels=[42, 42, 42], res_kernel_sizes=[1, 3, 5], dw_kernel_sizes=[1, 3, 5])
    )
    (up_sampling): Sequential(
      (0): ConvTranspose2d(256, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
      (1): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (2): ReLU(inplace=True)
      (3): ConvTranspose2d(128, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), output_padding=(1, 1))
      (4): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (5): ReLU(inplace=True)
      (6): ReflectionPad2d((3, 3, 3, 3))
      (7): Conv2d(64, 3, kernel_size=(7, 7), stride=(1, 1))
      (8): Tanh()
    )
  )
)
[Network G_B] Total number of parameters : 8.154 M
DataParallel(
  (module): NLayerDiscriminator(
    (model): Sequential(
      (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (1): LeakyReLU(negative_slope=0.2, inplace=True)
      (2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (3): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (4): LeakyReLU(negative_slope=0.2, inplace=True)
      (5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (6): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (7): LeakyReLU(negative_slope=0.2, inplace=True)
      (8): Conv2d(256, 512, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
      (9): InstanceNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (10): LeakyReLU(negative_slope=0.2, inplace=True)
      (11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
    )
  )
)
[Network D_A] Total number of parameters : 2.767 M
DataParallel(
  (module): NLayerDiscriminator(
    (model): Sequential(
      (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (1): LeakyReLU(negative_slope=0.2, inplace=True)
      (2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (3): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (4): LeakyReLU(negative_slope=0.2, inplace=True)
      (5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
      (6): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (7): LeakyReLU(negative_slope=0.2, inplace=True)
      (8): Conv2d(256, 512, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
      (9): InstanceNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
      (10): LeakyReLU(negative_slope=0.2, inplace=True)
      (11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))
    )
  )
)
[Network D_B] Total number of parameters : 2.767 M
-----------------------------------------------
start_epoch: 1
end_epoch: 1
total_iter: 1
current memory allocated: 265.4296875
max memory allocated: 265.4296875
cached memory: 276.0
will set input data
Traceback (most recent call last):
  File "/data/CAT/train.py", line 14, in <module>
    trainer.start()
  File "/data/CAT/trainer.py", line 159, in start
    model.optimize_parameters(total_iter)
  File "/data/CAT/models/cycle_gan_model.py", line 295, in optimize_parameters
    self.forward()
  File "/data/CAT/models/cycle_gan_model.py", line 235, in forward
    self.rec_A = self.netG_B(self.fake_B)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data/CAT/models/modules/inception_architecture/inception_generator.py", line 141, in forward
    res = self.up_sampling(res)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/modules/padding.py", line 173, in forward
    return F.pad(input, self.padding, 'reflect')
  File "/home/ubuntu/.local/lib/python3.9/site-packages/torch/nn/functional.py", line 4014, in _pad
    return torch._C._nn.reflection_pad2d(input, pad)
RuntimeError: CUDA out of memory. Tried to allocate 260.00 MiB (GPU 0; 15.78 GiB total capacity; 14.02 GiB already allocated; 198.19 MiB free; 14.13 GiB reserved in total by PyTorch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant