Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error: an illegal memory access was encountered #443

Open
1 of 2 tasks
zzh-www opened this issue Apr 7, 2024 · 3 comments
Open
1 of 2 tasks

CUDA error: an illegal memory access was encountered #443

zzh-www opened this issue Apr 7, 2024 · 3 comments
Assignees

Comments

@zzh-www
Copy link

zzh-www commented Apr 7, 2024

System Info / 系統信息

GPU: a100-80g CUDA Version: 12.1 python:3.8 pytorch:2.2.1

Who can help? / 谁可以帮助到您?

@1049451037

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

一张a100-80g显存微调cogagent,batchsize修改为1,将脚本中disable_untrainable_params修改为

def disable_untrainable_params(self):
    total_trainable = 0
    enable = []
    # enable = ["encoder"]
    # enable = ["encoder", "cross_attention", "linear_proj", 'mlp.vision', 'rotary.vision', 'eoi', 'boi', 'vit']
    if self.args.use_ptuning:
        enable.extend(["ptuning"])
    if self.args.use_lora or self.args.use_qlora:
        pass
        enable.extend(["matrix_A", "matrix_B"])
    out_file = open("named_parameters.txt", "w", encoding="utf-8")
    for n, p in self.named_parameters():
        out_file.write("named_parameters: " + n)
        flag = False
        # 只微调语言模型部分
        if n.lower().startswith("transformer.layers"):
            flag = "matrix_" in n.lower()
        elif n.lower().startswith("mixins.rotary.vision_"):
            flag = True
        if not flag:
            p.requires_grad_(False)
        else:
            total_trainable += p.numel()
            if "encoder" in n or "vit" in n:
                p.lr_scale = 0.1
            print_rank0(n)
        out_file.write(" enable: " + str(flag))
        out_file.write("\n")
    out_file.close()
    print_rank0("***** Total trainable parameters: " + str(total_trainable) + " *****")

运行微调脚本,显存只占用到72G,然后报错RuntimeError: CUDA error: an illegal memory access was encountered

报错日志:

[2024-04-07 10:55:05,908] [INFO] [checkpointing.py:539:forward] Activation Checkpointing Information
[2024-04-07 10:55:05,909] [INFO] [checkpointing.py:540:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2024-04-07 10:55:05,909] [INFO] [checkpointing.py:541:forward] ----contiguous Memory Checkpointing False with 6 total layers
[2024-04-07 10:55:05,909] [INFO] [checkpointing.py:543:forward] ----Synchronization False
[2024-04-07 10:55:05,909] [INFO] [checkpointing.py:544:forward] ----Profiling time in checkpointing False
logits:  torch.Size([1, 400, 32000])
Traceback (most recent call last):
  File "finetune_cogagent_demo.py", line 400, in <module>
    model = training_main(
  File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/sat/training/deepspeed_training.py", line 150, in training_main
    iteration, skipped = train(model, optimizer,
  File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/sat/training/deepspeed_training.py", line 349, in train
    lm_loss, skipped_iter, metrics = train_step(train_data_iterator,
  File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/sat/training/deepspeed_training.py", line 482, in train_step
    model.step()
  File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2169, in step
    self._take_model_step(lr_kwargs)
  File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
    self.optimizer.step()
  File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1898, in step
    self._optimizer_step(i)
  File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1805, in _optimizer_step
    self.optimizer.step()
  File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/optim/optimizer.py", line 385, in wrapper
    out = func(*args, **kwargs)
  File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 191, in step
    multi_tensor_applier(self.multi_tensor_adam, self._dummy_overflow_buf, [g_32, p_32, m_32, v_32],
  File "/root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/deepspeed/ops/adam/multi_tensor_apply.py", line 17, in __call__
    return op(self.chunk_size, noop_flag_buffer, tensor_lists, *args)
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1708025829503/work/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f44740f0d87 in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f44740a175f in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f44741c28a8 in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x7f44752859ec in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f4475289b08 in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x7f447528d23a in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f447528de79 in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdbbf4 (0x7f44d1e3ebf4 in /root/miniconda3/envs/cog-agent/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #8: <unknown function> + 0x7ea7 (0x7f44da7b7ea7 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x3f (0x7f44da588a6f in /lib/x86_64-linux-gnu/libc.so.6)

[2024-04-07 10:55:20,255] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 2708
[2024-04-07 10:55:20,255] [ERROR] [launch.py:322:sigkill_handler] ['/root/miniconda3/envs/cog-agent/bin/python', '-u', 'finetune_cogagent_demo.py', '--local_rank=0', '--experiment-name', 'finetune-cogagent-chat', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '2000', '--resume-dataloader', '--from_pretrained', '../sat_models/cogagent-chat', '--max_length', '400', '--lora_rank', '50', '--use_lora', '--local_tokenizer', '../pretrained_models/lmsys/vicuna-7b-v1.5', '--version', 'chat', '--train-data', './archive_split/train', '--valid-data', './archive_split/valid', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--vit_checkpoint_activations', '--save-interval', '200', '--eval-interval', '200', '--save', './checkpoints', '--eval-iters', '10', '--eval-batch-size', '1', '--split', '1.', '--deepspeed_config', 'test_config_bf16.json', '--skip-init', '--seed', '2023', '--batch-size', '1'] exits with return code = -6

Expected behavior / 期待表现

报错不复现

@zzh-www
Copy link
Author

zzh-www commented Apr 7, 2024

看起来和这个issue情况类似 #124

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR self-assigned this Apr 8, 2024
@zRzRzRzRzRzRzR
Copy link
Collaborator

你这个情况却不是没有正确的安装cuda配置啊,cuda 套件要装

@zzh-www
Copy link
Author

zzh-www commented Apr 8, 2024

你这个情况却不是没有正确的安装cuda配置啊,cuda 套件要装

cuda是装了的,如果我将微调的参数设少些,例如只调["matrix_A", "matrix_B"],训练脚本是能够正常跑的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants