Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be able to cache embeddings and load them #946

Open
orionw opened this issue Jun 17, 2024 · 5 comments
Open

Be able to cache embeddings and load them #946

orionw opened this issue Jun 17, 2024 · 5 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@orionw
Copy link
Contributor

orionw commented Jun 17, 2024

For most users, being able to cache their embedded docs and/or provide a cached embedding file is probably overkill.

However, there are many situations where it would be helpful to have an option to cache them. For example, experiments where you alter the query/document set to for speedups (as I'm doing now) or if you're testing the effect of different prefixes/instructions over the same dataset.

I typically use pyserini for caching the index so that we can quickly search over it later, but that doesn't integrate nicely with mteb. I think it would fairly straightforward to implement this: (1) we need to take in a flag of whether to cache the embeddings, cache them to a file that corresponds to the dataset and model name and (2) provide an option to read in a cached embedding file.

I don't have bandwidth for this right now, but if anyone does it would be an excellent addition.

@orionw orionw added good first issue Good for newcomers enhancement New feature or request labels Jun 17, 2024
@tenzu15
Copy link

tenzu15 commented Jun 17, 2024

Hey @orionw ,

I would like to try this if possible!

@orionw
Copy link
Contributor Author

orionw commented Jun 17, 2024

Awesome @tenzu15! It would be great to be able to pass the two flags in the mteb.run command. Something like cache_embeddings: bool = True and cached_embedding_file: str

This would need to be changed in the RetrievalEvaluator class for now. If it's useful for other tasks, we can implement it there also. Also cc'ing @KennethEnevoldsen who may have opinions on where this should be added/what the names should be.

But feel free to start @tenzu15. If you have any questions feel free to make a draft PR and cc me!

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Jun 18, 2024

@orionw, wouldn't it be better to implement a more general model wrapping for this so that it works for all tasks?

class ModelWrap():
  def __init__(self, model):
     self.model = model
  
  def encode(...):
     embeddings = self.model.encode(...)
     self.store_embeddings(sentences, embeddings)   
     return embeddings

@isaac-chung
Copy link
Collaborator

There's some background discussion related to the topic from #354 (comment) as well.

@orionw
Copy link
Contributor Author

orionw commented Jun 18, 2024

+1 @KennethEnevoldsen, I think a wrapper is a great idea and even simpler to implement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

4 participants