GPU Memory question #21

fangyuan-ksgk · 2024-02-29T14:23:29Z

Hello! Thanks for the open-sourced code release.
I have been trying to run the fine-tuning with a phi-2 3B model on a 40GB A100 GPU, while running
accelerate launch spin/run_spin.py configs/config.yaml
I get GPU run out of memory errors, which really confuses me, since I have set the batch size to 1, and 1 for all the number of processes here. I can not imagine what is consuming so much memory:

[INFO|trainer.py:571] 2024-02-29 14:04:29,359 >> Using auto half precision backend
[INFO|trainer.py:1721] 2024-02-29 14:04:32,728 >> ***** Running training *****
[INFO|trainer.py:1722] 2024-02-29 14:04:32,728 >> Num examples = 20
[INFO|trainer.py:1723] 2024-02-29 14:04:32,728 >> Num Epochs = 3
[INFO|trainer.py:1724] 2024-02-29 14:04:32,728 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1727] 2024-02-29 14:04:32,728 >> Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:1728] 2024-02-29 14:04:32,728 >> Gradient Accumulation steps = 1
[INFO|trainer.py:1729] 2024-02-29 14:04:32,728 >> Total optimization steps = 60
[INFO|trainer.py:1730] 2024-02-29 14:04:32,729 >> Number of trainable parameters = 2,779,683,840
0% 0/60 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
[WARNING|modeling_utils.py:1126] 2024-02-29 14:04:34,067 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed
Traceback (most recent call last):
File "/content/SPIN/spin/run_spin.py", line 206, in
main()
File "/content/SPIN/spin/run_spin.py", line 169, in main
train_result = spin_trainer.train()
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1917, in _inner_training_loop
self.optimizer.step()
File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 145, in step
self.optimizer.step(closure)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 68, in wrapper
return wrapped(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 373, in wrapper
out = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 76, in _use_grad
ret = func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/rmsprop.py", line 115, in step
self._init_group(group, params_with_grad, grads, square_avgs, momentum_buffer_list, grad_avgs)
File "/usr/local/lib/python3.10/dist-packages/torch/optim/rmsprop.py", line 72, in _init_group
state["square_avg"] = torch.zeros_like(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU 0 has a total capacty of 39.56 GiB of which 64.81 MiB is free. Process 468145 has 39.49 GiB memory in use. Of the allocated memory 37.54 GiB is allocated by PyTorch, and 1.44 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
0% 0/60 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 986, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 628, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'spin/run_spin.py', 'configs/config.yaml']' returned non-zero exit status 1.

The text was updated successfully, but these errors were encountered:

angelahzyuan · 2024-04-07T07:06:18Z

Hello! Thanks for the open-sourced code release. I have been trying to run the fine-tuning with a phi-2 3B model on a 40GB A100 GPU, while running accelerate launch spin/run_spin.py configs/config.yaml I get GPU run out of memory errors, which really confuses me, since I have set the batch size to 1, and 1 for all the number of processes here. I can not imagine what is consuming so much memory:

[INFO|trainer.py:571] 2024-02-29 14:04:29,359 >> Using auto half precision backend [INFO|trainer.py:1721] 2024-02-29 14:04:32,728 >> ***** Running training ***** [INFO|trainer.py:1722] 2024-02-29 14:04:32,728 >> Num examples = 20 [INFO|trainer.py:1723] 2024-02-29 14:04:32,728 >> Num Epochs = 3 [INFO|trainer.py:1724] 2024-02-29 14:04:32,728 >> Instantaneous batch size per device = 1 [INFO|trainer.py:1727] 2024-02-29 14:04:32,728 >> Total train batch size (w. parallel, distributed & accumulation) = 1 [INFO|trainer.py:1728] 2024-02-29 14:04:32,728 >> Gradient Accumulation steps = 1 [INFO|trainer.py:1729] 2024-02-29 14:04:32,728 >> Total optimization steps = 60 [INFO|trainer.py:1730] 2024-02-29 14:04:32,729 >> Number of trainable parameters = 2,779,683,840 0% 0/60 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( [WARNING|modeling_utils.py:1126] 2024-02-29 14:04:34,067 >> Could not estimate the number of tokens of the input, floating-point operations will not be computed Traceback (most recent call last): File "/content/SPIN/spin/run_spin.py", line 206, in main() File "/content/SPIN/spin/run_spin.py", line 169, in main train_result = spin_trainer.train() File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1917, in _inner_training_loop self.optimizer.step() File "/usr/local/lib/python3.10/dist-packages/accelerate/optimizer.py", line 145, in step self.optimizer.step(closure) File "/usr/local/lib/python3.10/dist-packages/torch/optim/lr_scheduler.py", line 68, in wrapper return wrapped(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 373, in wrapper out = func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py", line 76, in _use_grad ret = func(self, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/rmsprop.py", line 115, in step self._init_group(group, params_with_grad, grads, square_avgs, momentum_buffer_list, grad_avgs) File "/usr/local/lib/python3.10/dist-packages/torch/optim/rmsprop.py", line 72, in _init_group state["square_avg"] = torch.zeros_like( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU 0 has a total capacty of 39.56 GiB of which 64.81 MiB is free. Process 468145 has 39.49 GiB memory in use. Of the allocated memory 37.54 GiB is allocated by PyTorch, and 1.44 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 0% 0/60 [00:01<?, ?it/s] Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 986, in launch_command simple_launcher(args) File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 628, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', 'spin/run_spin.py', 'configs/config.yaml']' returned non-zero exit status 1.

You might need to specify DeepSpeed configuration. Check scripts/finetune.sh for the command. Let us know if you are still having problems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Memory question #21

GPU Memory question #21

fangyuan-ksgk commented Feb 29, 2024

angelahzyuan commented Apr 7, 2024

GPU Memory question #21

GPU Memory question #21

Comments

fangyuan-ksgk commented Feb 29, 2024

angelahzyuan commented Apr 7, 2024