Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hierarchical universes and a tokenizer config #25

Open
nleroy917 opened this issue Jun 7, 2024 · 0 comments
Open

Hierarchical universes and a tokenizer config #25

nleroy917 opened this issue Jun 7, 2024 · 0 comments
Labels
brainstorming enhancement New feature or request

Comments

@nleroy917
Copy link
Member

NLP/huggingface tokenizer vocabularies are often distributed as .json configuration files. The reason for this is that modern, language tokenizers are configurable beyond their respective vocabularies (e.g. pre-processors, special tokens, post-processors, etc).

Should we distributegtokenizers the same way? Instead of a single BED-file, its a .yaml file that points to a BED-file, in addition to other things like maybe a list of exclude_ranges, secondary universes (hierarchical tokenization), etc.

Could be a way to implement hierarchical universes in addition to enhancing the fragment tokenizers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
brainstorming enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant