Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed training doesn't work. #18

Open
dimeldo opened this issue Nov 26, 2019 · 2 comments
Open

Distributed training doesn't work. #18

dimeldo opened this issue Nov 26, 2019 · 2 comments

Comments

@dimeldo
Copy link

dimeldo commented Nov 26, 2019

At least using xlnet model. When using high max_len, it doesn't print any error just crashes. Training with 1 GPU works well. When setting low max_len I get the error below. I'm using 4 Nvidia V100.

Traceback (most recent call last):
  File "src/train.py", line 830, in <module>
    main()
  File "src/train.py", line 690, in main
    train_step(dummy_batch)
  File "src/train.py", line 566, in train_step
    loss, acc, ppl = forward_step(batch)
  File "src/train.py", line 556, in forward_step
    acc = reduce_tensor(acc)
  File "src/train.py", line 530, in reduce_tensor
    reduced = tensor.clone()
AttributeError: 'float' object has no attribute 'clone'
Traceback (most recent call last):
  File "src/train.py", line 830, in <module>
    main()
  File "src/train.py", line 690, in main
    train_step(dummy_batch)
  File "src/train.py", line 566, in train_step
    loss, acc, ppl = forward_step(batch)
  File "src/train.py", line 556, in forward_step
    acc = reduce_tensor(acc)
  File "src/train.py", line 530, in reduce_tensor
    reduced = tensor.clone()
AttributeError: 'float' object has no attribute 'clone'
Traceback (most recent call last):
  File "src/train.py", line 830, in <module>
Traceback (most recent call last):
  File "src/train.py", line 830, in <module>
    main()
  File "src/train.py", line 690, in main
    main()
  File "src/train.py", line 690, in main
    train_step(dummy_batch)
    train_step(dummy_batch)
  File "src/train.py", line 566, in train_step
  File "src/train.py", line 566, in train_step
    loss, acc, ppl = forward_step(batch)
  File "src/train.py", line 556, in forward_step
    loss, acc, ppl = forward_step(batch)
  File "src/train.py", line 556, in forward_step
    acc = reduce_tensor(acc)
  File "src/train.py", line 530, in reduce_tensor
    acc = reduce_tensor(acc)
  File "src/train.py", line 530, in reduce_tensor
    reduced = tensor.clone()
AttributeError: 'float' object has no attribute 'clone'
    reduced = tensor.clone()
AttributeError: 'float' object has no attribute 'clone'
Mrpatekful added a commit that referenced this issue Nov 26, 2019
@dimeldo
Copy link
Author

dimeldo commented Dec 13, 2019

Still not working... :(
Reproducible with python -m torch.distributed.launch --nproc_per_node=8 src/train.py --config src/configs/gpt2-dailydialog.json on AWS p3dn.24xlarge (8 volta v100). The program just crashes... Works on 1 GPU tho.

@dimeldo
Copy link
Author

dimeldo commented Dec 13, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant