Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError with CUDA assertion failure when resuming model training from checkpoint #499

Open
fancling opened this issue Mar 22, 2024 · 1 comment

Comments

@fancling
Copy link

I encountered a RuntimeError with an internal assertion failure when trying to resume training of a custom model from a checkpoint:

RuntimeError: t == DeviceType::CUDAINTERNAL ASSERT FAILED at "../c10/cuda/impl/CUDAGuardImpl.h":24, please report a bug to PyTorch.

This error occurred during the execution of an estimate_loss() function which is supposed to run before the actual model training resumes on CUDA. It seems to be triggered when the iteration number coincidentally matches a modulus of 2000.

I am willing to assist in resolving this issue if I can be of any help.

@fancling
Copy link
Author

Update on the issue

After further investigation, I've identified the source of the problem that leads to the assertion failure:
The comparison operation if losses["val"] < best_val_loss or always_save_checkpoint: fails because losses["val"] is located on the CPU, while best_val_loss is loaded from the checkpoint directly onto the CUDA device due to checkpoint = torch.load(ckpt_path, map_location=device).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant