Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train/val split #478

Open
DavidHerel opened this issue Feb 6, 2024 · 0 comments
Open

Train/val split #478

DavidHerel opened this issue Feb 6, 2024 · 0 comments

Comments

@DavidHerel
Copy link

DavidHerel commented Feb 6, 2024

Hi,

I want to ask how one can split a dataset to train/val splits. In the tinystories.py I don't quite understand the comment:

train/test split. let's use only shard 0 for test split, rest train

So how many tokens from train data are selected to be validation split?

It seems that @karpathy uses 10shards and if only 0 shard is used as a test split then it means that 1/10 of the data is used as a test set?
e.g. if I have dataset with 10B tokens then 1B tokens are used for test/val set?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant