Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a return_tensors="pt" function for the Tokenizers #12

Open
nleroy917 opened this issue Apr 16, 2024 · 1 comment
Open

Implement a return_tensors="pt" function for the Tokenizers #12

nleroy917 opened this issue Apr 16, 2024 · 1 comment

Comments

@nleroy917
Copy link
Member

nleroy917 commented Apr 16, 2024

I'm trying to optimize some things for more efficient processing of bed files through our tokenizers and models in actual production environments (like bedbase).

One bottleneck I encounter is creating tensors from lists of integers. I explain more detail in a PR over in geniml but, briefly, the current tokenizers are only capable of returning lists of integers for tokenized BED files. It could be more efficient to emit a Tensor directly. I think that this is possible using some combination of the following rust crates:

With this, users can just return a torch.Tensor object directly and there is no need to convert between types -- potentially saving time. Additionally, we could offer options for returning np.array objects with rust-numpy.

@nleroy917
Copy link
Member Author

I've actually implemented a to_numpy() function, but to_tensor() might be a little complicated...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant