v1.8.3版本在Windows平台训练Lora报错：Distributed package doesn't have NCCL built in #439

WalkerMe · 2024-05-29T14:58:11Z

完整错误日志如下：

number of images (including repeats) / 各bucketの画像枚数（繰り返し回数を含む）
bucket 0: resolution (512, 768), count: 20
mean ar error (without repeats): 0.0
preparing accelerator
[W socket.cpp:697] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - 在其上下文中，该请求的地址无效。).
Traceback (most recent call last):
  File "D:\AI\lora-scripts\sd-scripts\train_network.py", line 996, in <module>
    trainer.train(args)
  File "D:\AI\lora-scripts\sd-scripts\train_network.py", line 226, in train
    accelerator = train_util.prepare_accelerator(args)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\AI\lora-scripts\sd-scripts\library\train_util.py", line 3896, in prepare_accelerator
    accelerator = Accelerator(
                  ^^^^^^^^^^^^
  File "D:\AI\Python-Lora\Lib\site-packages\accelerate\accelerator.py", line 371, in __init__
    self.state = AcceleratorState(
                 ^^^^^^^^^^^^^^^^^
  File "D:\AI\Python-Lora\Lib\site-packages\accelerate\state.py", line 758, in __init__
    PartialState(cpu, **kwargs)
  File "D:\AI\Python-Lora\Lib\site-packages\accelerate\state.py", line 217, in __init__
    torch.distributed.init_process_group(backend=self.backend, **kwargs)
  File "D:\AI\Python-Lora\Lib\site-packages\torch\distributed\c10d_logger.py", line 86, in wrapper
    func_return = func(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^
  File "D:\AI\Python-Lora\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1184, in init_process_group
    default_pg, _ = _new_process_group_helper(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\AI\Python-Lora\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1302, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL built in")
RuntimeError: Distributed package doesn't have NCCL built in
[2024-05-29 22:28:05,467] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 19252) of binary: D:\AI\Python-Lora\python.exe
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "D:\AI\Python-Lora\Lib\site-packages\accelerate\commands\launch.py", line 1027, in <module>
    main()
  File "D:\AI\Python-Lora\Lib\site-packages\accelerate\commands\launch.py", line 1023, in main
    launch_command(args)
  File "D:\AI\Python-Lora\Lib\site-packages\accelerate\commands\launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "D:\AI\Python-Lora\Lib\site-packages\accelerate\commands\launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "D:\AI\Python-Lora\Lib\site-packages\torch\distributed\run.py", line 803, in run
    elastic_launch(
  File "D:\AI\Python-Lora\Lib\site-packages\torch\distributed\launcher\api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\AI\Python-Lora\Lib\site-packages\torch\distributed\launcher\api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
./sd-scripts/train_network.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-05-29_22:28:05
  host      : LAPTOP-KI4AR1BE
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 19252)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
22:28:08-904154 ERROR    Training failed / 训练失败

我环境是Win10系统、Python3.11
有搜到说NCCL 不支持Windows，有什么参数能控制这个吗？

The text was updated successfully, but these errors were encountered:

WalkerMe changed the title ~~Windows平台训练Lora报错：Distributed package doesn't have NCCL built in~~ v1.8.3版本在Windows平台训练Lora报错：Distributed package doesn't have NCCL built in May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.8.3版本在Windows平台训练Lora报错：Distributed package doesn't have NCCL built in #439

v1.8.3版本在Windows平台训练Lora报错：Distributed package doesn't have NCCL built in #439

WalkerMe commented May 29, 2024

v1.8.3版本在Windows平台训练Lora报错：Distributed package doesn't have NCCL built in #439

v1.8.3版本在Windows平台训练Lora报错：Distributed package doesn't have NCCL built in #439

Comments

WalkerMe commented May 29, 2024