Distributed training doesn't work. #18

dimeldo · 2019-11-26T16:35:55Z

At least using xlnet model. When using high max_len, it doesn't print any error just crashes. Training with 1 GPU works well. When setting low max_len I get the error below. I'm using 4 Nvidia V100.

Traceback (most recent call last):
  File "src/train.py", line 830, in <module>
    main()
  File "src/train.py", line 690, in main
    train_step(dummy_batch)
  File "src/train.py", line 566, in train_step
    loss, acc, ppl = forward_step(batch)
  File "src/train.py", line 556, in forward_step
    acc = reduce_tensor(acc)
  File "src/train.py", line 530, in reduce_tensor
    reduced = tensor.clone()
AttributeError: 'float' object has no attribute 'clone'
Traceback (most recent call last):
  File "src/train.py", line 830, in <module>
    main()
  File "src/train.py", line 690, in main
    train_step(dummy_batch)
  File "src/train.py", line 566, in train_step
    loss, acc, ppl = forward_step(batch)
  File "src/train.py", line 556, in forward_step
    acc = reduce_tensor(acc)
  File "src/train.py", line 530, in reduce_tensor
    reduced = tensor.clone()
AttributeError: 'float' object has no attribute 'clone'
Traceback (most recent call last):
  File "src/train.py", line 830, in <module>
Traceback (most recent call last):
  File "src/train.py", line 830, in <module>
    main()
  File "src/train.py", line 690, in main
    main()
  File "src/train.py", line 690, in main
    train_step(dummy_batch)
    train_step(dummy_batch)
  File "src/train.py", line 566, in train_step
  File "src/train.py", line 566, in train_step
    loss, acc, ppl = forward_step(batch)
  File "src/train.py", line 556, in forward_step
    loss, acc, ppl = forward_step(batch)
  File "src/train.py", line 556, in forward_step
    acc = reduce_tensor(acc)
  File "src/train.py", line 530, in reduce_tensor
    acc = reduce_tensor(acc)
  File "src/train.py", line 530, in reduce_tensor
    reduced = tensor.clone()
AttributeError: 'float' object has no attribute 'clone'
    reduced = tensor.clone()
AttributeError: 'float' object has no attribute 'clone'

The text was updated successfully, but these errors were encountered:

dimeldo · 2019-12-13T11:43:38Z

Still not working... :(
Reproducible with python -m torch.distributed.launch --nproc_per_node=8 src/train.py --config src/configs/gpt2-dailydialog.json on AWS p3dn.24xlarge (8 volta v100). The program just crashes... Works on 1 GPU tho.

dimeldo · 2019-12-13T11:47:47Z

Using this environment: https://docs.nvidia.com/deeplearning/frameworks/tensorflow-release-notes/rel_19.10.html

Mrpatekful added a commit that referenced this issue Nov 26, 2019

#18 fix

de2dfa5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training doesn't work. #18

Distributed training doesn't work. #18

dimeldo commented Nov 26, 2019

dimeldo commented Dec 13, 2019

dimeldo commented Dec 13, 2019

Distributed training doesn't work. #18

Distributed training doesn't work. #18

Comments

dimeldo commented Nov 26, 2019

dimeldo commented Dec 13, 2019

dimeldo commented Dec 13, 2019