Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execution error #4

Open
hxngiee opened this issue Jan 6, 2021 · 5 comments
Open

Execution error #4

hxngiee opened this issue Jan 6, 2021 · 5 comments

Comments

@hxngiee
Copy link

hxngiee commented Jan 6, 2021

Thank you for your surprising work.

During the SBERT-to-BigGAN, SBERT-to-BigBiGAN and SBERT-to-AE (COCO) execution, I received the following error:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "translation.py", line 531, in
melk()
NameError: name 'melk' is not defined

I'd appreciate it if you could check.

@rromb
Copy link
Collaborator

rromb commented Jan 7, 2021

Hi, thanks for checking out our code!
What you describe is most likely triggered by another error that occurs during the initialization of the script. Please check the full stack trace and if that doesn't help, post it here.

@hxngiee
Copy link
Author

hxngiee commented Jan 8, 2021

Thanks for your reply

These are the error which I got

Is this a cuda library related error?

Now, I'm using GeForce RTX 2080, CUDA 10.2 Version

2021-01-08 16:07:22.553627: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-01-08 16:07:22.553653: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Note: Conditioning network uses batch-normalization. Make sure to train with a sufficiently large batch size
Missing keys in state-dict: ['encoder.resnet.1.num_batches_tracked', 'encoder.resnet.4.0.bn1.num_batches_tracked', 'encoder.resnet.4.0.bn2.num_batches_tracked', 'encoder.resnet.4.0.bn3.num_batches_tracked', 'encoder.resnet.4.0.downsample.1.num_batches_tracked', 'encoder.resnet.4.1.bn1.num_batches_tracked', 'encoder.resnet.4.1.bn2.num_batches_tracked', 'encoder.resnet.4.1.bn3.num_batches_tracked', 'encoder.resnet.4.2.bn1.num_batches_tracked', 'encoder.resnet.4.2.bn2.num_batches_tracked', 'encoder.resnet.4.2.bn3.num_batches_tracked', 'encoder.resnet.5.0.bn1.num_batches_tracked', 'encoder.resnet.5.0.bn2.num_batches_tracked', 'encoder.resnet.5.0.bn3.num_batches_tracked', 'encoder.resnet.5.0.downsample.1.num_batches_tracked', 'encoder.resnet.5.1.bn1.num_batches_tracked', 'encoder.resnet.5.1.bn2.num_batches_tracked', 'encoder.resnet.5.1.bn3.num_batches_tracked', 'encoder.resnet.5.2.bn1.num_batches_tracked', 'encoder.resnet.5.2.bn2.num_batches_tracked', 'encoder.resnet.5.2.bn3.num_batches_tracked', 'encoder.resnet.5.3.bn1.num_batches_tracked', 'encoder.resnet.5.3.bn2.num_batches_tracked', 'encoder.resnet.5.3.bn3.num_batches_tracked', 'encoder.resnet.6.0.bn1.num_batches_tracked', 'encoder.resnet.6.0.bn2.num_batches_tracked', 'encoder.resnet.6.0.bn3.num_batches_tracked', 'encoder.resnet.6.0.downsample.1.num_batches_tracked', 'encoder.resnet.6.1.bn1.num_batches_tracked', 'encoder.resnet.6.1.bn2.num_batches_tracked', 'encoder.resnet.6.1.bn3.num_batches_tracked', 'encoder.resnet.6.2.bn1.num_batches_tracked', 'encoder.resnet.6.2.bn2.num_batches_tracked', 'encoder.resnet.6.2.bn3.num_batches_tracked', 'encoder.resnet.6.3.bn1.num_batches_tracked', 'encoder.resnet.6.3.bn2.num_batches_tracked', 'encoder.resnet.6.3.bn3.num_batches_tracked', 'encoder.resnet.6.4.bn1.num_batches_tracked', 'encoder.resnet.6.4.bn2.num_batches_tracked', 'encoder.resnet.6.4.bn3.num_batches_tracked', 'encoder.resnet.6.5.bn1.num_batches_tracked', 'encoder.resnet.6.5.bn2.num_batches_tracked', 'encoder.resnet.6.5.bn3.num_batches_tracked', 'encoder.resnet.6.6.bn1.num_batches_tracked', 'encoder.resnet.6.6.bn2.num_batches_tracked', 'encoder.resnet.6.6.bn3.num_batches_tracked', 'encoder.resnet.6.7.bn1.num_batches_tracked', 'encoder.resnet.6.7.bn2.num_batches_tracked', 'encoder.resnet.6.7.bn3.num_batches_tracked', 'encoder.resnet.6.8.bn1.num_batches_tracked', 'encoder.resnet.6.8.bn2.num_batches_tracked', 'encoder.resnet.6.8.bn3.num_batches_tracked', 'encoder.resnet.6.9.bn1.num_batches_tracked', 'encoder.resnet.6.9.bn2.num_batches_tracked', 'encoder.resnet.6.9.bn3.num_batches_tracked', 'encoder.resnet.6.10.bn1.num_batches_tracked', 'encoder.resnet.6.10.bn2.num_batches_tracked', 'encoder.resnet.6.10.bn3.num_batches_tracked', 'encoder.resnet.6.11.bn1.num_batches_tracked', 'encoder.resnet.6.11.bn2.num_batches_tracked', 'encoder.resnet.6.11.bn3.num_batches_tracked', 'encoder.resnet.6.12.bn1.num_batches_tracked', 'encoder.resnet.6.12.bn2.num_batches_tracked', 'encoder.resnet.6.12.bn3.num_batches_tracked', 'encoder.resnet.6.13.bn1.num_batches_tracked', 'encoder.resnet.6.13.bn2.num_batches_tracked', 'encoder.resnet.6.13.bn3.num_batches_tracked', 'encoder.resnet.6.14.bn1.num_batches_tracked', 'encoder.resnet.6.14.bn2.num_batches_tracked', 'encoder.resnet.6.14.bn3.num_batches_tracked', 'encoder.resnet.6.15.bn1.num_batches_tracked', 'encoder.resnet.6.15.bn2.num_batches_tracked', 'encoder.resnet.6.15.bn3.num_batches_tracked', 'encoder.resnet.6.16.bn1.num_batches_tracked', 'encoder.resnet.6.16.bn2.num_batches_tracked', 'encoder.resnet.6.16.bn3.num_batches_tracked', 'encoder.resnet.6.17.bn1.num_batches_tracked', 'encoder.resnet.6.17.bn2.num_batches_tracked', 'encoder.resnet.6.17.bn3.num_batches_tracked', 'encoder.resnet.6.18.bn1.num_batches_tracked', 'encoder.resnet.6.18.bn2.num_batches_tracked', 'encoder.resnet.6.18.bn3.num_batches_tracked', 'encoder.resnet.6.19.bn1.num_batches_tracked', 'encoder.resnet.6.19.bn2.num_batches_tracked', 'encoder.resnet.6.19.bn3.num_batches_tracked', 'encoder.resnet.6.20.bn1.num_batches_tracked', 'encoder.resnet.6.20.bn2.num_batches_tracked', 'encoder.resnet.6.20.bn3.num_batches_tracked', 'encoder.resnet.6.21.bn1.num_batches_tracked', 'encoder.resnet.6.21.bn2.num_batches_tracked', 'encoder.resnet.6.21.bn3.num_batches_tracked', 'encoder.resnet.6.22.bn1.num_batches_tracked', 'encoder.resnet.6.22.bn2.num_batches_tracked', 'encoder.resnet.6.22.bn3.num_batches_tracked', 'encoder.resnet.7.0.bn1.num_batches_tracked', 'encoder.resnet.7.0.bn2.num_batches_tracked', 'encoder.resnet.7.0.bn3.num_batches_tracked', 'encoder.resnet.7.0.downsample.1.num_batches_tracked', 'encoder.resnet.7.1.bn1.num_batches_tracked', 'encoder.resnet.7.1.bn2.num_batches_tracked', 'encoder.resnet.7.1.bn3.num_batches_tracked', 'encoder.resnet.7.2.bn1.num_batches_tracked', 'encoder.resnet.7.2.bn2.num_batches_tracked', 'encoder.resnet.7.2.bn3.num_batches_tracked']
Traceback (most recent call last):
  File "translation.py", line 455, in <module>
    trainer_kwargs["checkpoint_callback"] = instantiate_from_config(modelckpt_cfg)
  File "translation.py", line 110, in instantiate_from_config
    return get_obj_from_str(config["target"])(**config.get("params", dict()))
  File "/home/hongiee/anaconda3/envs/Net2Net/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 190, in __init__
    self.__validate_init_configuration()
  File "/home/hongiee/anaconda3/envs/Net2Net/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 261, in __validate_init_configuration
    raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: ModelCheckpoint(save_top_k=3, monitor=None) is not a valid configuration. No quantity for top_k to track.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "translation.py", line 531, in <module>
    melk()
NameError: name 'melk' is not defined

@rromb
Copy link
Collaborator

rromb commented Jan 11, 2021

Which version of pytorch-lightning are you using? This code still uses pl=0.9 and is not compatible with lightning-versions >= 1.0.
Additionally, you can try to set the save_top_key = 0, see

default_modelckpt_cfg = {

@hxngiee
Copy link
Author

hxngiee commented Jan 14, 2021

Thanks,

I solve the problem by changing pytorch-lighting version

but there was another error while running SBERT-to-AE

Traceback (most recent call last):
  File "translation.py", line 521, in <module>
    trainer.fit(model, data)
  File "/home/ubuntu/anaconda3/envs/net2net/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py",                                                line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/net2net/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py",                                                line 1058, in fit
    results = self.accelerator_backend.spawn_ddp_children(model)
  File "/home/ubuntu/anaconda3/envs/net2net/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_bac                                               kend.py", line 123, in spawn_ddp_children
    results = self.ddp_train(local_rank, mp_queue=None, model=model, is_master=True)
  File "/home/ubuntu/anaconda3/envs/net2net/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_bac                                               kend.py", line 224, in ddp_train
    results = self.trainer.run_pretrain_routine(model)
  File "/home/ubuntu/anaconda3/envs/net2net/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py",                                                line 1224, in run_pretrain_routine
    self._run_sanity_check(ref_model, model)
  File "/home/ubuntu/anaconda3/envs/net2net/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py",                                                line 1257, in _run_sanity_check
    eval_results = self._evaluate(model, self.val_dataloaders, max_batches, False)
  File "/home/ubuntu/anaconda3/envs/net2net/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_l                                               oop.py", line 333, in _evaluate
    output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
  File "/home/ubuntu/anaconda3/envs/net2net/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_l                                               oop.py", line 661, in evaluation_forward
    output = model(*args)
  File "/home/ubuntu/anaconda3/envs/net2net/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547,                                                in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/net2net/lib/python3.7/site-packages/pytorch_lightning/overrides/data_paral                                               lel.py", line 174, in forward
    output = self.module.validation_step(*inputs[0], **kwargs[0])
  File "/home/ubuntu/hongiee/net2net/net2net/models/flows/flow.py", line 194, in validation_step
    loss, log_dict = self.shared_step(batch, batch_idx, split="val")
  File "/home/ubuntu/hongiee/net2net/net2net/models/flows/flow.py", line 181, in shared_step
    x = self.get_input(self.first_stage_key, batch)
  File "/home/ubuntu/hongiee/net2net/net2net/models/flows/flow.py", line 174, in get_input
    x = x.permute(0, 3, 1, 2).to(memory_format=torch.contiguous_format)
TypeError: to() received an invalid combination of arguments - got (memory_format=torch.memory_format, ), but                                                expected one of:
 * (torch.device device, torch.dtype dtype, bool non_blocking, bool copy)
 * (torch.dtype dtype, bool non_blocking, bool copy)
 * (Tensor tensor, bool non_blocking, bool copy)

@liting1018
Copy link

Thanks,

I solve the problem by changing pytorch-lighting version

but there was another error while running SBERT-to-AE

Traceback (most recent call last):
  File "translation.py", line 521, in <module>
    trainer.fit(model, data)
  File "/home/ubuntu/anaconda3/envs/net2net/lib/python3.7/site-packages/pytorch_lightning/trainer/states.py",                                                line 48, in wrapped_fn
    result = fn(self, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/net2net/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py",                                                line 1058, in fit
    results = self.accelerator_backend.spawn_ddp_children(model)
  File "/home/ubuntu/anaconda3/envs/net2net/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_bac                                               kend.py", line 123, in spawn_ddp_children
    results = self.ddp_train(local_rank, mp_queue=None, model=model, is_master=True)
  File "/home/ubuntu/anaconda3/envs/net2net/lib/python3.7/site-packages/pytorch_lightning/accelerators/ddp_bac                                               kend.py", line 224, in ddp_train
    results = self.trainer.run_pretrain_routine(model)
  File "/home/ubuntu/anaconda3/envs/net2net/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py",                                                line 1224, in run_pretrain_routine
    self._run_sanity_check(ref_model, model)
  File "/home/ubuntu/anaconda3/envs/net2net/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py",                                                line 1257, in _run_sanity_check
    eval_results = self._evaluate(model, self.val_dataloaders, max_batches, False)
  File "/home/ubuntu/anaconda3/envs/net2net/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_l                                               oop.py", line 333, in _evaluate
    output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
  File "/home/ubuntu/anaconda3/envs/net2net/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_l                                               oop.py", line 661, in evaluation_forward
    output = model(*args)
  File "/home/ubuntu/anaconda3/envs/net2net/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547,                                                in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/net2net/lib/python3.7/site-packages/pytorch_lightning/overrides/data_paral                                               lel.py", line 174, in forward
    output = self.module.validation_step(*inputs[0], **kwargs[0])
  File "/home/ubuntu/hongiee/net2net/net2net/models/flows/flow.py", line 194, in validation_step
    loss, log_dict = self.shared_step(batch, batch_idx, split="val")
  File "/home/ubuntu/hongiee/net2net/net2net/models/flows/flow.py", line 181, in shared_step
    x = self.get_input(self.first_stage_key, batch)
  File "/home/ubuntu/hongiee/net2net/net2net/models/flows/flow.py", line 174, in get_input
    x = x.permute(0, 3, 1, 2).to(memory_format=torch.contiguous_format)
TypeError: to() received an invalid combination of arguments - got (memory_format=torch.memory_format, ), but                                                expected one of:
 * (torch.device device, torch.dtype dtype, bool non_blocking, bool copy)
 * (torch.dtype dtype, bool non_blocking, bool copy)
 * (Tensor tensor, bool non_blocking, bool copy)

I also encountered this problem, which version of pytorch-lightning are you using? Looking forward to your reply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants