Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a soft tokenizer #7

Open
nleroy917 opened this issue Dec 8, 2023 · 1 comment
Open

Implement a soft tokenizer #7

nleroy917 opened this issue Dec 8, 2023 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@nleroy917
Copy link
Member

It would be cool to implement a soft tokenizer so we can use it in some of our actual models. The soft tokenizer considers overlap between the query and the universe (vocab). Using this information, we can randomly sample (with replacement) using the overlap score as a probability distribution.

Here is a rust crate that will let you sample form distributions: https://docs.rs/rand_distr/latest/rand_distr/

I would use it similarly in Python:

tokenizer = SoftTokenizer("path/to/universe.bed")
rs = RegionSet("path/to/file.bed")

tokens = tokenizer.tokenize(rs)

x = torch.tensor(tokens.to_ids())

out = model(x)

print(out)
@nleroy917 nleroy917 added the enhancement New feature or request label Dec 8, 2023
@nleroy917 nleroy917 self-assigned this Dec 8, 2023
@nleroy917
Copy link
Member Author

From the meeting, it was noted that smaller regions would show up more often than large regions since their overlap percentage would always be larger (they are smaller)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant