Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using lhotse when training a hybrid fast conformer model fails #9462

Closed
dhoore123 opened this issue Jun 13, 2024 · 10 comments
Closed

Using lhotse when training a hybrid fast conformer model fails #9462

dhoore123 opened this issue Jun 13, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@dhoore123
Copy link

dhoore123 commented Jun 13, 2024

I am trying to use lhotse when training a hybrid fast conformer model. The error is:
File "/usr/local/lib/python3.10/dist-packages/nemo/core/optim/lr_scheduler.py", line 870, in prepare_lr_scheduler
num_samples = len(train_dataloader.dataset)
TypeError: object of type 'LhotseSpeechToTextBpeDataset' has no len()

The motivation is that I want to make use of the dynamic batching and capability to equally weigh several languages for the training of my multilingual hybrid fast conformer model, which the lhotse integration is advertised to provide.

I am using a singularity container, built from the nemo-24.01 docker container of nvidia. I also tried nemo-24.05 with the same result. This all runs in a slurm environment using multiple GPUs on a single node, on an on-premises grid.
I zipped and attached my yaml configuration file. When not using lhotse, the config works. "not using lhotse" means setting use_lhotse to false and commenting out the following three lhotse related lines in the trainer
use_distributed_sampler: false
limit_train_batches: 20000
val_check_interval: 20000

The error suggests that there could be something missing in the code somewhere (method len not implemented for class LhotseSpeechToTextBpeDataset?), which would suggest some incomplete integration of lhotse for my use case. If you would find something missing or incorrect in my config, I would be happy to learn.
I am not in a position to share the data itself nor parts of it though, hoping the error message rings a bell what could be wrong here.

=======================

FastConformer-Hybrid-Transducer-CTC-BPE-Streaming-multi-60-lhotse.zip

@dhoore123 dhoore123 added the bug Something isn't working label Jun 13, 2024
@qmgzhao
Copy link

qmgzhao commented Jun 22, 2024

I have the same problem.

@pzelasko
Copy link
Collaborator

pzelasko commented Jun 24, 2024

Can you also set max_steps to something else than -1? E.g. 100000. Let us know if this helps.

@dhoore123
Copy link
Author

Setting max_steps as suggested seems to do the trick. Training now runs. Thanks!
I'll close the ticket once I see some epochs completing successfully.

@dhoore123
Copy link
Author

I finally got a training running for a few (pseudo-)epochs now. Even though I am running on 2 80GB GPUs, I had to tune down the batch_duration to 750, with batch_size removed from the configuration. The GPU ran out of RAM with higher values. I did not expect this as the example in the nvidia docs suggests using a batch_duration of 1100 for a 32GB GPU.

@pzelasko
Copy link
Collaborator

I had to tune down the batch_duration to 750, with batch_size removed from the configuration.

It seems that your actual batch sizes became larger after removing batch_size constraint, leading to this outcome. This is a net benefit - despite decreasing batch_duration, you are still enjoying larger batch sizes.

I did not expect this as the example in the nvidia docs suggests using a batch_duration of 1100 for a 32GB GPU.

The maximum possible batch_duration setting is determined by several factors:

  • available GPU RAM
  • model size
  • objective function
  • data duration distribution / max_duration / number of buckets / optional quadratic_duration penalty

The setting of 1100s was specific to FastConformer-L CTC+RNN-T trained on ASRSet 3. It is expected that with a different model, data, objective function, etc. you may need to tune it again. I am hoping to simplify the tuning process in the future.

@dhoore123
Copy link
Author

Thanks for your reply, pzelasko. It reassures me that this batch_duration value does not seem odd to you, and does not point to something I did wrong.
On a different note: the effective batch size is normally defined by batch_size x accumulate_grad_batches (or fused_batch_size in case of hybrid training?) x nr_of_gpus. This causes the number of steps per epoch to be a function of the number of GPUs.
When using lhotse, the number of steps in a "pseudo" epoch looks to be the same, independent of the number of GPUs. Does this mean that the amount of data seen in one "pseudo" epoch depends on the number of GPUs one uses, or is lhotse spreading the same amount of data over fewer effective batches when running on more GPUs with each step?

@pzelasko
Copy link
Collaborator

It means that if you keep the “pseudoepoch” size constant, the amount of data seen during a “pseudoepoch” is proportional to the number of GPUs. Generally I don’t encourage thinking in epochs in this flavor of data loading, the only thing that counts is the number of updates. And yeah the total batch duration is the product of num GPUs, batch duration, and grad accumulation factor.

@dhoore123
Copy link
Author

I figured out why I had to set batch_duration to a much lower value than expected. The parameter "use_bucketing" was not defined and defaults to false. After setting it to true, it looks like I am getting the behavior I was aiming for. Note that this parameter is not mentioned in the documentation page about lhotse. I found it by inspecting the code itself.
In any case, thanks for the tips and suggestions, Piotr. Nemo and the tools it builds from can be complex at times, but it is amazing in what it can do.

@dhoore123
Copy link
Author

Closing as the reported problem is solved. Parameter max_steps should be set to some big value.
As a side remark: use_bucketing parameter should be set to true (if you want to do dynamic bucketing with lhotse).

@pzelasko
Copy link
Collaborator

pzelasko commented Jul 2, 2024

Thanks for your feedback. I’ll try to improve the documentation to be clearer about this. The option is there if you keep scrolling down, but its indeed missing in the code snippet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants