Training: CUDA: Out of Memory Optimizations #4

raks097 · 2020-01-23T08:26:42Z

Hi,
A wonderful paper and thanks for providing the implementation, so that one could reproduce the results.

I have tried training the privileged agent using the script, as mentioned in the README
python train_birdview.py --dataset_dir=../data/sample --log_dir=../logs/sample

I get a Runtime Error : Tried to allocate 144.00 MiB (GPU 0; 10.73 GiB total capacity; 9.77 GiB already allocated; 74.62 MiB free; 69.10 MiB cached). followed by a ConnectionResetError

I tried tracing back the error using nvidia-smi and found that the memory usage quickly builds up ( reaching the maximum) before the training begins.

Any leads and suggestions are much appreciated.
Thanks

Attaching the full stack trace for further reference

dotchen · 2020-01-23T21:40:08Z

Thank you for your interest in our paper!

Can you make sure you installed the packages following instructions on README, and that the code is unchanged? Training the privileged agent using the default batch size will not take up 10GB memory, so also double check that you don't have other programs running on that GPU.

raks097 · 2020-01-24T04:41:02Z

Thank you for your interest in our paper!

Can you make sure you installed the packages following instructions on README, and that the code is unchanged? Training the privileged agent using the default batch size will not take up 10GB memory, so also double check that you don't have other programs running on that GPU.

Hi @dianchen96 , thank you for the quick response.
The only code change I made was fixing up an import error ( utils.train_utils instead of train_util ) and did make sure my GPU does not have any other background processes.

Apart from that, I'm using pytorch 1.0.0 version py3.5_cuda10.0.130_cudnn7.4.1_1 instead of the suggested py3.5_cuda8.0.61_cudnn7.1.2_1 , since my GPU was raising warnings to use a newer version of CUDA.
I might be wrong but that should not be an issue ideally.

The benchmark_agent.py script worked without any issues

bradyz · 2020-01-24T06:17:58Z

Can you test out with smaller batch size, like 32 or 16 and see if those OOM?

raks097 · 2020-01-24T07:32:57Z

Can you test out with smaller batch size, like 32 or 16 and see if those OOM?

UPDATE
I have just tested it out with a batch_size of 128 and that seems to do the trick.
Phase 0 training could only be done with a batch_size of 64.
Any suggestions on the optimisations(if possible, I'm fairly new to PyTorch framework) that I can do, in order to replicate the original results with a batch size of 256 ?

Thanks

bradyz · 2020-01-24T17:54:11Z

In the train loop, try changing

            optim.zero_grad()
            loss_mean.backward()
            optim.step()

to

            loss_mean.backward()
            optim.step()
            optim.zero_grad()

raks097 · 2020-01-28T11:30:25Z

In the train loop, try changing

            optim.zero_grad()
            loss_mean.backward()
            optim.step()

to

            loss_mean.backward()
            optim.step()
            optim.zero_grad()

Hi,
I tried the suggested change. However, it does not fix the issue.
The issue seems to occur during these commands;

location_preds = [location_pred(h) for location_pred in self.location_pred]
location_preds = torch.stack(location_preds, dim=1)

Stack of those location predictions seems to make my GPU run out of memory. ( An RTX 2080 with 11 GB of available memory)

Any suggestions regarding these?

dotchen · 2020-01-28T21:40:27Z

Hmm interesting, we have not experienced this, not sure how much this is due to the hardware or cuda/cudnn mismatch. Does the OMM happen right after this operation, or does it happen during the backward pass?

raks097 · 2020-01-29T04:10:28Z

Hmm interesting, we have not experienced this, not sure how much this is due to the hardware or cuda/cudnn mismatch. Does the OMM happen right after this operation, or does it happen during the backward pass?

It happens right after the operation. I am currently using cuda 10.0.130 with cudnn 7.6.0

Fixes dotchen#4.

tomcur · 2020-04-17T19:38:30Z

I had the same issue with training/train_birdview.py (I have not tested the other training phases yet).

I was able to run with a minibatch size of 128. Delving deeper into the problem, I noticed reserved memory doubling from the end of the first iteration to the end of the second.

Start of first iteration:
1210MiB reserved

Start of second iteration (one complete iteration done):
3958MiB reserved

Start of third iteration (two complete iterations done):
6574MiB reserved

Start of fourth (three complete iterations done):
6814MiB reserved

I opened #19, which explicitly deletes device-converted tensors at the end of each iteration, to let Python/PyTorch know the memory is free again.

I'm not sure if this has always been necessary in context of iterations, but in general the following pattern is quite memory inefficient:

import torch
foo = torch.zeros(512, 512, 512).to(0)
# At this point 512 MB is used

# This reserves a further 512 MB, so 1024 MB is needed
foo = torch.zeros(512, 512, 512).to(0)
# The previous object is freed. Now 512 MB is used again (Torch still keeps the "free" 512MB as cache, unless explicitly freed)

Kin-Zhang · 2021-06-12T07:46:41Z

I also have this error like this:

And the solution is like bradyz said:

Can you test out with smaller batch size, like 32 or 16 and see if those OOM?

smaller batch_size in train_birdview.py:

parser.add_argument('--batch_size', type=int, default=128)

tomcur · 2021-06-13T14:27:01Z

@Kin-Zhang If you apply this PR you can run at 2x greater batch size: #19.

raks097 changed the title ~~Training: CUDA: Out of Memory~~ Training: CUDA: Out of Memory Optimizations Jan 24, 2020

tomcur added a commit to tomcur/LearningByCheating that referenced this issue Apr 17, 2020

Explicitly delete device-converted tensors.

14daa4a

Fixes dotchen#4.

tomcur mentioned this issue Apr 17, 2020

Explicitly delete device-converted tensors. #19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training: CUDA: Out of Memory Optimizations #4

Training: CUDA: Out of Memory Optimizations #4

raks097 commented Jan 23, 2020 •

edited

Loading

dotchen commented Jan 23, 2020

raks097 commented Jan 24, 2020 •

edited

Loading

bradyz commented Jan 24, 2020

raks097 commented Jan 24, 2020 •

edited

Loading

bradyz commented Jan 24, 2020

raks097 commented Jan 28, 2020

dotchen commented Jan 28, 2020

raks097 commented Jan 29, 2020 •

edited

Loading

tomcur commented Apr 17, 2020 •

edited

Loading

Kin-Zhang commented Jun 12, 2021

tomcur commented Jun 13, 2021

Training: CUDA: Out of Memory Optimizations #4

Training: CUDA: Out of Memory Optimizations #4

Comments

raks097 commented Jan 23, 2020 • edited Loading

dotchen commented Jan 23, 2020

raks097 commented Jan 24, 2020 • edited Loading

bradyz commented Jan 24, 2020

raks097 commented Jan 24, 2020 • edited Loading

bradyz commented Jan 24, 2020

raks097 commented Jan 28, 2020

dotchen commented Jan 28, 2020

raks097 commented Jan 29, 2020 • edited Loading

tomcur commented Apr 17, 2020 • edited Loading

Kin-Zhang commented Jun 12, 2021

tomcur commented Jun 13, 2021

raks097 commented Jan 23, 2020 •

edited

Loading

raks097 commented Jan 24, 2020 •

edited

Loading

raks097 commented Jan 24, 2020 •

edited

Loading

raks097 commented Jan 29, 2020 •

edited

Loading

tomcur commented Apr 17, 2020 •

edited

Loading