Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training: CUDA: Out of Memory Optimizations #4

Open
raks097 opened this issue Jan 23, 2020 · 11 comments
Open

Training: CUDA: Out of Memory Optimizations #4

raks097 opened this issue Jan 23, 2020 · 11 comments

Comments

@raks097
Copy link

raks097 commented Jan 23, 2020

Hi,
A wonderful paper and thanks for providing the implementation, so that one could reproduce the results.

I have tried training the privileged agent using the script, as mentioned in the README
python train_birdview.py --dataset_dir=../data/sample --log_dir=../logs/sample

I get a Runtime Error : Tried to allocate 144.00 MiB (GPU 0; 10.73 GiB total capacity; 9.77 GiB already allocated; 74.62 MiB free; 69.10 MiB cached). followed by a ConnectionResetError

I tried tracing back the error using nvidia-smi and found that the memory usage quickly builds up ( reaching the maximum) before the training begins.

Any leads and suggestions are much appreciated.
Thanks

Attaching the full stack trace for further reference
stack

@dotchen
Copy link
Owner

dotchen commented Jan 23, 2020

Thank you for your interest in our paper!

Can you make sure you installed the packages following instructions on README, and that the code is unchanged? Training the privileged agent using the default batch size will not take up 10GB memory, so also double check that you don't have other programs running on that GPU.

@raks097
Copy link
Author

raks097 commented Jan 24, 2020

Thank you for your interest in our paper!

Can you make sure you installed the packages following instructions on README, and that the code is unchanged? Training the privileged agent using the default batch size will not take up 10GB memory, so also double check that you don't have other programs running on that GPU.

Hi @dianchen96 , thank you for the quick response.
The only code change I made was fixing up an import error ( utils.train_utils instead of train_util ) and did make sure my GPU does not have any other background processes.

Apart from that, I'm using pytorch 1.0.0 version py3.5_cuda10.0.130_cudnn7.4.1_1 instead of the suggested py3.5_cuda8.0.61_cudnn7.1.2_1 , since my GPU was raising warnings to use a newer version of CUDA.
I might be wrong but that should not be an issue ideally.

The benchmark_agent.py script worked without any issues

@bradyz
Copy link
Collaborator

bradyz commented Jan 24, 2020

Can you test out with smaller batch size, like 32 or 16 and see if those OOM?

@raks097
Copy link
Author

raks097 commented Jan 24, 2020

Can you test out with smaller batch size, like 32 or 16 and see if those OOM?

UPDATE
I have just tested it out with a batch_size of 128 and that seems to do the trick.
Phase 0 training could only be done with a batch_size of 64.
Any suggestions on the optimisations(if possible, I'm fairly new to PyTorch framework) that I can do, in order to replicate the original results with a batch size of 256 ?

Thanks

@raks097 raks097 changed the title Training: CUDA: Out of Memory Training: CUDA: Out of Memory Optimizations Jan 24, 2020
@bradyz
Copy link
Collaborator

bradyz commented Jan 24, 2020

In the train loop, try changing

            optim.zero_grad()
            loss_mean.backward()
            optim.step()

to

            loss_mean.backward()
            optim.step()
            optim.zero_grad()

@raks097
Copy link
Author

raks097 commented Jan 28, 2020

In the train loop, try changing

            optim.zero_grad()
            loss_mean.backward()
            optim.step()

to

            loss_mean.backward()
            optim.step()
            optim.zero_grad()

Hi,
I tried the suggested change. However, it does not fix the issue.
The issue seems to occur during these commands;

location_preds = [location_pred(h) for location_pred in self.location_pred]
location_preds = torch.stack(location_preds, dim=1)

Stack of those location predictions seems to make my GPU run out of memory. ( An RTX 2080 with 11 GB of available memory)

Any suggestions regarding these?

@dotchen
Copy link
Owner

dotchen commented Jan 28, 2020

Hmm interesting, we have not experienced this, not sure how much this is due to the hardware or cuda/cudnn mismatch. Does the OMM happen right after this operation, or does it happen during the backward pass?

@raks097
Copy link
Author

raks097 commented Jan 29, 2020

Hmm interesting, we have not experienced this, not sure how much this is due to the hardware or cuda/cudnn mismatch. Does the OMM happen right after this operation, or does it happen during the backward pass?

It happens right after the operation. I am currently using cuda 10.0.130 with cudnn 7.6.0

tomcur added a commit to tomcur/LearningByCheating that referenced this issue Apr 17, 2020
@tomcur
Copy link
Contributor

tomcur commented Apr 17, 2020

I had the same issue with training/train_birdview.py (I have not tested the other training phases yet).

I was able to run with a minibatch size of 128. Delving deeper into the problem, I noticed reserved memory doubling from the end of the first iteration to the end of the second.

Start of first iteration:
1210MiB reserved

Start of second iteration (one complete iteration done):
3958MiB reserved

Start of third iteration (two complete iterations done):
6574MiB reserved

Start of fourth (three complete iterations done):
6814MiB reserved

I opened #19, which explicitly deletes device-converted tensors at the end of each iteration, to let Python/PyTorch know the memory is free again.

I'm not sure if this has always been necessary in context of iterations, but in general the following pattern is quite memory inefficient:

import torch
foo = torch.zeros(512, 512, 512).to(0)
# At this point 512 MB is used

# This reserves a further 512 MB, so 1024 MB is needed
foo = torch.zeros(512, 512, 512).to(0)
# The previous object is freed. Now 512 MB is used again (Torch still keeps the "free" 512MB as cache, unless explicitly freed)

@Kin-Zhang
Copy link

I also have this error like this:
image

And the solution is like bradyz said:

Can you test out with smaller batch size, like 32 or 16 and see if those OOM?

smaller batch_size in train_birdview.py:

parser.add_argument('--batch_size', type=int, default=128)

@tomcur
Copy link
Contributor

tomcur commented Jun 13, 2021

@Kin-Zhang If you apply this PR you can run at 2x greater batch size: #19.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants