We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU: a100-80g CUDA Version: 12.1 python:3.8 pytorch:2.2.1
@1049451037
一张a100-80g显存微调cogagent,batchsize修改为1,将脚本中disable_untrainable_params修改为
def disable_untrainable_params(self): total_trainable = 0 enable = [] # enable = ["encoder"] # enable = ["encoder", "cross_attention", "linear_proj", 'mlp.vision', 'rotary.vision', 'eoi', 'boi', 'vit'] if self.args.use_ptuning: enable.extend(["ptuning"]) if self.args.use_lora or self.args.use_qlora: pass enable.extend(["matrix_A", "matrix_B"]) out_file = open("named_parameters.txt", "w", encoding="utf-8") for n, p in self.named_parameters(): out_file.write("named_parameters: " + n) flag = False # 只微调语言模型部分 if n.lower().startswith("transformer.layers"): flag = "matrix_" in n.lower() elif n.lower().startswith("mixins.rotary.vision_"): flag = True if not flag: p.requires_grad_(False) else: total_trainable += p.numel() if "encoder" in n or "vit" in n: p.lr_scale = 0.1 print_rank0(n) out_file.write(" enable: " + str(flag)) out_file.write("\n") out_file.close() print_rank0("***** Total trainable parameters: " + str(total_trainable) + " *****")
运行微调脚本,显存只占用到72G,然后报错RuntimeError: CUDA error: an illegal memory access was encountered
报错日志:
[2024-04-07 10:55:05,908] [INFO] [checkpointing.py:539:forward] Activation Checkpointing Information [2024-04-07 10:55:05,909] [INFO] [checkpointing.py:540:forward] ----Partition Activations False, CPU CHECKPOINTING False [2024-04-07 10:55:05,909] [INFO] [checkpointing.py:541:forward] ----contiguous Memory Checkpointing False with 6 total layers [2024-04-07 10:55:05,909] [INFO] [checkpointing.py:543:forward] ----Synchronization False [2024-04-07 10:55:05,909] [INFO] [checkpointing.py:544:forward] ----Profiling time in checkpointing False logits: torch.Size([1, 400, 32000]) Traceback (most recent call last): File "finetune_cogagent_demo.py", line 400, in <module> model = training_main( File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/sat/training/deepspeed_training.py", line 150, in training_main iteration, skipped = train(model, optimizer, File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/sat/training/deepspeed_training.py", line 349, in train lm_loss, skipped_iter, metrics = train_step(train_data_iterator, File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/sat/training/deepspeed_training.py", line 482, in train_step model.step() File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2169, in step self._take_model_step(lr_kwargs) File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step self.optimizer.step() File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1898, in step self._optimizer_step(i) File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1805, in _optimizer_step self.optimizer.step() File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/optim/optimizer.py", line 385, in wrapper out = func(*args, **kwargs) File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 191, in step multi_tensor_applier(self.multi_tensor_adam, self._dummy_overflow_buf, [g_32, p_32, m_32, v_32], File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/ops/adam/multi_tensor_apply.py", line 17, in __call__ return op(self.chunk_size, noop_flag_buffer, tensor_lists, *args) RuntimeError: CUDA error: an illegal memory access was encountered Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. [rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1708025829503/work/c10/cuda/CUDAException.cpp:44 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f44740f0d87 in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f44740a175f in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f44741c28a8 in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x7f44752859ec in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f4475289b08 in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x7f447528d23a in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f447528de79 in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so) frame #7: <unknown function> + 0xdbbf4 (0x7f44d1e3ebf4 in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6) frame #8: <unknown function> + 0x7ea7 (0x7f44da7b7ea7 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #9: clone + 0x3f (0x7f44da588a6f in /lib/x86_64-linux-gnu/libc.so.6) [2024-04-07 10:55:20,255] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2708 [2024-04-07 10:55:20,255] [ERROR] [launch.py:322:sigkill_handler] ['/root/miniconda3/envs/cog-agent/bin/python', '-u', 'finetune_cogagent_demo.py', '--local_rank=0', '--experiment-name', 'finetune-cogagent-chat', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '2000', '--resume-dataloader', '--from_pretrained', '../sat_models/cogagent-chat', '--max_length', '400', '--lora_rank', '50', '--use_lora', '--local_tokenizer', '../pretrained_models/lmsys/vicuna-7b-v1.5', '--version', 'chat', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023', '--batch-size', '1'] exits with return code = -6
报错不复现
The text was updated successfully, but these errors were encountered:
看起来和这个issue情况类似 #124
Sorry, something went wrong.
你这个情况却不是没有正确的安装cuda配置啊,cuda 套件要装
cuda是装了的,如果我将微调的参数设少些,例如只调["matrix_A", "matrix_B"],训练脚本是能够正常跑的
zRzRzRzRzRzRzR
No branches or pull requests
System Info / 系統信息
GPU: a100-80g CUDA Version: 12.1 python:3.8 pytorch:2.2.1
Who can help? / 谁可以帮助到您?
@1049451037
Information / 问题信息
Reproduction / 复现过程
一张a100-80g显存微调cogagent,batchsize修改为1,将脚本中disable_untrainable_params修改为
运行微调脚本,显存只占用到72G,然后报错RuntimeError: CUDA error: an illegal memory access was encountered
报错日志:
Expected behavior / 期待表现
报错不复现
The text was updated successfully, but these errors were encountered: