-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pairwise postprocessing validation is very slow - possible optimizations #599
Comments
Hey, @korotaS, thank you for you interest in OML. A general note on post-processing: it's expected that validation takes longer than training "epoch":
By the way, what is the dataset size and what task are you solving? |
agree, even without post-processing context, it would be a good optimization but are you really sure that we spent 26h from 63h doing this operation? I'm very skeptical... You can do this optimization locally first and check if numbers are really like this. Anyway, it may be a good contribution, if you would like to contribute. |
As for the second point. You are right. There is a duplicated image loading when using Lightning. Basically, we need the first loader (built on I would say, there a few root problems.
If you take a look on pure python examples you will see, that there is no duplicated image reading in the validation. (Despite we create See the next comment for the possible solution. |
Ideally, we should use The update is needed in order not to loose visalisation functionality which happens in validation. (When we used image dataset we had
If you with you can contribute. The plan is:
|
Thanks for the quick reply!
I have a trained embedder with cmc@1 at about 0.96 and I want to train a reranker on "hard" items to further improve metrics. This train run was kind of PoC (I didn't pick "hard" items, I used all of them just to see the training process) and the query size was about 37k and gallery size was about 110k.
I am not 100% sure but I tend to believe the numbers that |
@korotaS got it, what is the domain? retail?
yep, it should be quick and informative see my other comments above and have a nice day :) |
The domain is indeed a retail. In fact, we were on the same meeting with Epoch8 last Tuesday and we were discussing this same problem after your presentation of new features of OML 😁 |
I thought so, nice to meet you here! Don't hesitate to join us, especially if you are going to contribute: https://t.me/+lqsKu2af8xcyMjEy |
As for your proposed solution about embeddings dataset - maybe I misunderstood something but I think that it won't work that way. If we create an We can use |
Oh, you are right. So, the problem is that we want to use the same dataset first to deliver embeddings, second to be a base for |
Your solution significantly changes the signatures, but we can have another easy-to-go option to start with. Our datasets support cache on reading images (so, we cache bytes). Unfortunately, cache is not parsed in postprocessing config as it's done in the main training config. I can easily add it in a few lines of code. It's not a full solution (we still need some time to stack decode and stack images), but it should speed up the process a lot. ps. implemented and merged proposed changes here: #604. I will add it in pypi soon (after tests are green). Could you please check how much it helps? |
As for the first issue, the solution is already in the main branch and will be released in pypi soon. Thank you for the report! |
I've fetched your updates and set However, I look in profiling logs and I don't see |
Correct me if I'm wrong, but Lest Recently Used cache will not help you even if it's size just a bit smaller than the dataset size. If your dataset's size is 1000 and the cache size is 999, when you start the second epoch (which you do without shuffle) the very first element is the oldest one and has already been removed from the cache. Our situation is a bit more complicated, but, anyway, could you try yo set cache size bigger than the dataset size? |
Anyway, @korotaS , what percentage of speedup are we talking about if we avoid loading images the second time? (You can replace |
When I run validation as is, the embeddings accumulation part (before I will now try increasing cache_size to be bigger than val dataset size (if val dataset size is 100, than in embeddings accumulation 100 images will be loaded and then in |
@korotaS got it! The profit from that wrong ids handling was much higher) Yep, in that case 101 cache size seems enough |
@korotaS what should we do with this issue? have you tried bigger cache size in order to speed up the process? |
Agree, sounds like too much changes for 7% |
Hi! I've been trying to train a Reranking pairwise model (using this guide and OML version 3.1.0) and it seems to train ok but the validation takes too much time - one epoch of training takes about 5 minutes and one validation cycle takes about 4 hours (the dataset is pretty big). I've used
cProfile
to profile a part of training and here are the top 100 slowest functions: train_rerank_cprofile.txt. The whole training took about 63 hours (229k seconds), validation is every 10 epochs.There are 2 things that I've noticed and they definitely can be optimized:
First - look at line
...48866.403...oml/datasets/images.py:281(get_query_ids)
and below...48784.334...images.py:284(get_gallery_ids)
(line numbers might not match because I made some other minor changes), so two methodsget_query_ids
andget_gallery_ids
that are called inoml/retrieval/postprocessors/pairwise.py:121-122
take 97k seconds. query_ids and gallery_ids don't change during training, so we can insideImageQueryGalleryLabeledDataset.__init__
calculate query_ids and gallery_ids once and save about 97k seconds during training in my case.Second - it is not obvious from the
cProfile
txt file but from ClearML logs it seemed interesting that PL validation step take about 40 minutes although validation step looks like this:open-metric-learning/oml/lightning/modules/pairwise_postprocessing.py
Lines 84 to 86 in 05842c8
We don't use images that are read from disc directly after this step (we use only embeddings), however we load images again in the PairwiseDataset:
open-metric-learning/oml/datasets/pairs.py
Lines 34 to 35 in 05842c8
I think we can pass some kind of
load_images
parameter to the dataset__init__
method and load them only when needed - for example, in training dataset and in pairs dataset, but not in validation dataset (or as alternative - make a wrapper to__getitem__
method and pass this parameter directly there), it can save 40 minutes in each validation in my case.There were some other issues during training but they aren't related to optimization so I will create another issue.
The text was updated successfully, but these errors were encountered: