Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenize all the things - genomic omni-model #19

Open
nleroy917 opened this issue May 18, 2024 · 0 comments
Open

Tokenize all the things - genomic omni-model #19

nleroy917 opened this issue May 18, 2024 · 0 comments

Comments

@nleroy917
Copy link
Member

OpenAI's GPT-4o is not open-source. I was reading a reddit thread recently speculating on it's architecture which enables text, vision, and voice modalities into one model.

One user speculates:

I wonder if it's something closer to the original DALL-E where the image was decomposed into image tokens ... The embeddings of the image tokens and text tokens could share the same latent space, so that model was "natively" multimodal.

Another replies:

Yes, I think that's exactly it ... 'Just' train a encoder tokenizer for each modality, maybe define some of the extra 100k BPEs as modality-specific delimiters similar to delimiting prompts/end-of-text tokens - and then it's just 'tokenize all the things'

Which all got me thinking... what would it look like to "tokenize all the things" in a genomic context? We have modalities like, scATAC-seq, scRNA-seq, methylation, and then even textual metadata associated with these datasets. I've proposed two multi-modal architectures in the past: one being the scRNA-seq tokenizer (https://github.com/databio/geniml_dev/issues/123), and then a CLIP-like model. But could we come up with ideas to "tokenize all the things" such that a model could take anything as input (scRNA-seq, scATAC-seq, methylation, or text) and then either 1) output an embedding, or 2) generate another modality.

Of course at the end of the day... we need the datasets :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant