Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optionally make data loading more deterministic #71

Open
isamu-isozaki opened this issue May 8, 2023 · 2 comments
Open

Optionally make data loading more deterministic #71

isamu-isozaki opened this issue May 8, 2023 · 2 comments

Comments

@isamu-isozaki
Copy link
Collaborator

isamu-isozaki commented May 8, 2023

We randomly resample the shards (with replacement) and sample examples in buffer for training every time we resume/start the training run. This means our data loading is not determinitsic. We also don't do epoch based training but just using this for book keeping and being able to reuse the same training loop with other datasets/loaders.

Optionally make this more deterministic for reproducibility

@pcuenca
Copy link
Member

pcuenca commented May 8, 2023

I recently came across Mosaic ML's dataset streaming library: https://github.com/mosaicml/streaming. Haven't used it yet, but it looks interesting.

@isamu-isozaki
Copy link
Collaborator Author

@pcuenca Thanks for the link!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants