Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About reproducing the paper #86

Open
ngoductuanlhp opened this issue Jun 10, 2024 · 2 comments
Open

About reproducing the paper #86

ngoductuanlhp opened this issue Jun 10, 2024 · 2 comments

Comments

@ngoductuanlhp
Copy link

ngoductuanlhp commented Jun 10, 2024

Hi @nikitakaraevv,

Thank you for your excellent work.

I have a question regarding the training pipeline. I'm currently trying to reproduce the results in Table 3 of your paper. When I trained the model from scratch on the Kubric dataset, the best evaluation result on the Tapvid Davis dataset is as follows:

"occlusion_accuracy": 0.8503666396802487
"average_jaccard": 0.5575681919643163
"average_pts_within_thresh": 0.7087581437592014
These results are significantly lower than those obtained with your provided checkpoint. I'm using Torch 2.1.0 with CUDA 12.3, and trained the model on 8 A100 GPUs with 200000 iterations, and accumulate gradient of 4 to mimic your setting.

Do you think the issue could be due to mismatched library versions, or might I be missing something else? I appreciate any guidance you can provide.

Thank you.

@nikitakaraevv
Copy link
Contributor

Hi @ngoductuanlhp, I don't think there could be such a big gap due to mismatched library versions.

We either train it on 32 GPUs for 50k iterations or on 8 GPUs for 200k. I obtained similar performance with both settings, but 32 GPUs is slightly better. So, have you tried to train the model on 8 GPUs for 200k without gradient accumulations?

Also, how do you evaluate the model?

@ngoductuanlhp
Copy link
Author

I haven't tried training the model with 200k iterations without gradient accumulations. But I did train the model with 50k iterations on 8gpus with the same learning rate of 0.0005 and the results are not good.

I use your evaluate script to evaluate on Tapvid-davis first/strided, and the dynamic replica validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants