Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Memory question #21

Open
fangyuan-ksgk opened this issue Feb 29, 2024 · 1 comment
Open

GPU Memory question #21

fangyuan-ksgk opened this issue Feb 29, 2024 · 1 comment

Comments

@fangyuan-ksgk
Copy link

Hello! Thanks for the open-sourced code release.
I have been trying to run the fine-tuning with a phi-2 3B model on a 40GB A100 GPU, while running
accelerate launch spin/run_spin.py configs/config.yaml
I get GPU run out of memory errors, which really confuses me, since I have set the batch size to 1, and 1 for all the number of processes here. I can not imagine what is consuming so much memory:

[INFO|trainer.py:571] 2024-02-29 14:04:29,359 >> Using auto half precision backend
[INFO|trainer.py:1721] 2024-02-29 14:04:32,728 >> ***** Running training *****
[INFO|trainer.py:1722] 2024-02-29 14:04:32,728 >> Num examples = 20
[INFO|trainer.py:1723] 2024-02-29 14:04:32,728 >> Num Epochs = 3
[INFO|trainer.py:1724] 2024-02-29 14:04:32,728 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1727] 2024-02-29 14:04:32,728 >> Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:1728] 2024-02-29 14:04:32,728 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1729] 2024-02-29 14:04:32,728 >> Total optimization steps = 60
[INFO|trainer.py:1730] 2024-02-29 14:04:32,729 >> Number of trainable parameters = 2,779,683,840
0% 0/60 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
[WARNING|modeling_utils.py:1126] 2024-02-29 14:04:34,067 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
Traceback (most recent call last):
File "/content/SPIN/spin/run_spin.py", line 206, in
main()
File "/content/SPIN/spin/run_spin.py", line 169, in main
train_result = spin_trainer.train()
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1917, in _inner_training_loop
self.optimizer.step()
File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 145, in step
self.optimizer.step(closure)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 373, in wrapper
out = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 76, in _use_grad
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/rmsprop.py", line 115, in step
self._init_group(group, params_with_grad, grads, square_avgs, momentum_buffer_list, grad_avgs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/rmsprop.py", line 72, in _init_group
state["square_avg"] = torch.zeros_like(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU 0 has a total capacty of 39.56 GiB of which 64.81 MiB is free. Process 468145 has 39.49 GiB memory in use. Of the allocated memory 37.54 GiB is allocated by PyTorch, and 1.44 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
0% 0/60 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 986, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 628, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'spin/run_spin.py', 'configs/config.yaml']' returned non-zero exit status 1.

@angelahzyuan
Copy link
Collaborator

Hello! Thanks for the open-sourced code release. I have been trying to run the fine-tuning with a phi-2 3B model on a 40GB A100 GPU, while running accelerate launch spin/run_spin.py configs/config.yaml I get GPU run out of memory errors, which really confuses me, since I have set the batch size to 1, and 1 for all the number of processes here. I can not imagine what is consuming so much memory:

[INFO|trainer.py:571] 2024-02-29 14:04:29,359 >> Using auto half precision backend [INFO|trainer.py:1721] 2024-02-29 14:04:32,728 >> ***** Running training ***** [INFO|trainer.py:1722] 2024-02-29 14:04:32,728 >> Num examples = 20 [INFO|trainer.py:1723] 2024-02-29 14:04:32,728 >> Num Epochs = 3 [INFO|trainer.py:1724] 2024-02-29 14:04:32,728 >> Instantaneous batch size per device = 1 [INFO|trainer.py:1727] 2024-02-29 14:04:32,728 >> Total train batch size (w. parallel, distributed & accumulation) = 1 [INFO|trainer.py:1728] 2024-02-29 14:04:32,728 >> Gradient Accumulation steps = 1 [INFO|trainer.py:1729] 2024-02-29 14:04:32,728 >> Total optimization steps = 60 [INFO|trainer.py:1730] 2024-02-29 14:04:32,729 >> Number of trainable parameters = 2,779,683,840 0% 0/60 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( [WARNING|modeling_utils.py:1126] 2024-02-29 14:04:34,067 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed Traceback (most recent call last): File "/content/SPIN/spin/run_spin.py", line 206, in main() File "/content/SPIN/spin/run_spin.py", line 169, in main train_result = spin_trainer.train() File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1917, in _inner_training_loop self.optimizer.step() File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 145, in step self.optimizer.step(closure) File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 68, in wrapper return wrapped(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 373, in wrapper out = func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 76, in _use_grad ret = func(self, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/rmsprop.py", line 115, in step self._init_group(group, params_with_grad, grads, square_avgs, momentum_buffer_list, grad_avgs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/rmsprop.py", line 72, in _init_group state["square_avg"] = torch.zeros_like( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU 0 has a total capacty of 39.56 GiB of which 64.81 MiB is free. Process 468145 has 39.49 GiB memory in use. Of the allocated memory 37.54 GiB is allocated by PyTorch, and 1.44 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 0% 0/60 [00:01<?, ?it/s] Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 986, in launch_command simple_launcher(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 628, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', 'spin/run_spin.py', 'configs/config.yaml']' returned non-zero exit status 1.

You might need to specify DeepSpeed configuration. Check scripts/finetune.sh for the command. Let us know if you are still having problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants