You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a question regarding the training pipeline. I'm currently trying to reproduce the results in Table 3 of your paper. When I trained the model from scratch on the Kubric dataset, the best evaluation result on the Tapvid Davis dataset is as follows:
"occlusion_accuracy": 0.8503666396802487
"average_jaccard": 0.5575681919643163
"average_pts_within_thresh": 0.7087581437592014
These results are significantly lower than those obtained with your provided checkpoint. I'm using Torch 2.1.0 with CUDA 12.3, and trained the model on 8 A100 GPUs with 200000 iterations, and accumulate gradient of 4 to mimic your setting.
Do you think the issue could be due to mismatched library versions, or might I be missing something else? I appreciate any guidance you can provide.
Thank you.
The text was updated successfully, but these errors were encountered:
Hi @ngoductuanlhp, I don't think there could be such a big gap due to mismatched library versions.
We either train it on 32 GPUs for 50k iterations or on 8 GPUs for 200k. I obtained similar performance with both settings, but 32 GPUs is slightly better. So, have you tried to train the model on 8 GPUs for 200k without gradient accumulations?
I haven't tried training the model with 200k iterations without gradient accumulations. But I did train the model with 50k iterations on 8gpus with the same learning rate of 0.0005 and the results are not good.
I use your evaluate script to evaluate on Tapvid-davis first/strided, and the dynamic replica validation.
Hi @nikitakaraevv,
Thank you for your excellent work.
I have a question regarding the training pipeline. I'm currently trying to reproduce the results in Table 3 of your paper. When I trained the model from scratch on the Kubric dataset, the best evaluation result on the Tapvid Davis dataset is as follows:
"occlusion_accuracy": 0.8503666396802487
"average_jaccard": 0.5575681919643163
"average_pts_within_thresh": 0.7087581437592014
These results are significantly lower than those obtained with your provided checkpoint. I'm using Torch 2.1.0 with CUDA 12.3, and trained the model on 8 A100 GPUs with 200000 iterations, and accumulate gradient of 4 to mimic your setting.
Do you think the issue could be due to mismatched library versions, or might I be missing something else? I appreciate any guidance you can provide.
Thank you.
The text was updated successfully, but these errors were encountered: