Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A Strange Problem: RuntimeError: CUDA error: an illegal memory access was encountered #13

Open
Friest-a11y opened this issue Oct 19, 2023 · 2 comments

Comments

@Friest-a11y
Copy link

Hi, Thanks for your great work firstly and There's always a strange problem when I try to run the code on my own dataset.
I have already changed my dataset to the VSPW dataset format. But a strange bug I can't solve.
File "./tools/train.py", line 188, in
main()
File "./tools/train.py", line 177, in main
train_segmentor(
File "/root/fuzhouquan/VSS-CFFM-main/mmseg/apis/train.py", line 115, in train_segmentor
runner.run(data_loaders, cfg.workflow)
File "/root/anaconda3/envs/cffm/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 131, in run
iter_runner(iter_loaders[i], **kwargs)
File "/root/anaconda3/envs/cffm/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 60, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/root/anaconda3/envs/cffm/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 51, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/root/fuzhouquan/VSS-CFFM-main/mmseg/models/segmentors/base.py", line 160, in train_step
print('loss:', losses)
File "/root/anaconda3/envs/cffm/lib/python3.8/site-packages/torch/tensor.py", line 193, in repr
return torch._tensor_str._str(self)
File "/root/anaconda3/envs/cffm/lib/python3.8/site-packages/torch/_tensor_str.py", line 383, in _str
return _str_intern(self)
File "/root/anaconda3/envs/cffm/lib/python3.8/site-packages/torch/_tensor_str.py", line 358, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/root/anaconda3/envs/cffm/lib/python3.8/site-packages/torch/_tensor_str.py", line 242, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/root/anaconda3/envs/cffm/lib/python3.8/site-packages/torch/_tensor_str.py", line 90, in init
nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
RuntimeError: CUDA error: an illegal memory access was encountered

I assure you that the GPU memory is adequate, because the VSPW raw dataset is capable of running on my 4 A800s.
But in my dataset, there always seems to be a problem with calculating the loss.

When I use your debug line in ./mmseg/models/segmentors/base.py line 155 to line 157
print(type(data_batch)) print(data_batch.keys()) print(data_batch['img'].shape, data_batch['gt_semantic_seg'].shape) # torch.Size([1, 3, 3, 480, 480]) torch.Size([1, 3, 1, 480, 480])
In my dataset, it prints
<class 'dict'>
dict_keys(['img_metas', 'img', 'gt_semantic_seg'])
torch.Size([4, 4, 3, 480, 480]) torch.Size([4, 4, 1, 480, 480])
<class 'dict'>
dict_keys(['img_metas', 'img', 'gt_semantic_seg'])
torch.Size([4, 4, 3, 480, 480]) torch.Size([4, 4, 1, 480, 480])
<class 'dict'>
dict_keys(['img_metas', 'img', 'gt_semantic_seg'])
torch.Size([4, 4, 3, 480, 480]) torch.Size([4, 4, 1, 480, 480])
<class 'dict'>
dict_keys(['img_metas', 'img', 'gt_semantic_seg'])
torch.Size([4, 4, 3, 480, 480]) torch.Size([4, 4, 1, 480, 480])

In VSPW dataset, it prints
<class 'dict'>
dict_keys(['img_metas', 'img', 'gt_semantic_seg'])
torch.Size([2, 4, 3, 480, 480]) torch.Size([2, 4, 1, 480, 480])
<class 'dict'>
dict_keys(['img_metas', 'img', 'gt_semantic_seg'])
torch.Size([2, 4, 3, 480, 480]) torch.Size([2, 4, 1, 480, 480])
<class 'dict'>
dict_keys(['img_metas', 'img', 'gt_semantic_seg'])
torch.Size([2, 4, 3, 480, 480]) torch.Size([2, 4, 1, 480, 480])
<class 'dict'>
dict_keys(['img_metas', 'img', 'gt_semantic_seg'])
torch.Size([2, 4, 3, 480, 480]) torch.Size([2, 4, 1, 480, 480])

it's true that the first dimension is not equal, why? How can I solve it?

Thanks so much! I'd appreciate it if you could help me out.

@GuoleiSun
Copy link
Owner

Thanks for your interest. I assume you have tried our code on VSPW dataset and you would like to use it for your own dataset. If not, please try our dataset first.

I have not met this issue before. But I think you might forget to adjust some dimensions (specific for your own dataset), like number of class, ... I would suggest you to check the dimensions of tensors. Please let me know if you have further questions

@Friest-a11y
Copy link
Author

Friest-a11y commented Feb 3, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants