Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate InstructIR with MTEB #905

Open
henilp105 opened this issue Jun 12, 2024 · 10 comments
Open

Integrate InstructIR with MTEB #905

henilp105 opened this issue Jun 12, 2024 · 10 comments

Comments

@henilp105
Copy link
Contributor

henilp105 commented Jun 12, 2024

I am interested in integrating InstructIR into MTEB. Currently, the dataset for InstructIR is only available on GitHub (https://github.com/kaistAI/InstructIR) and not on Hugging Face. Could you advise on the best approach to integrate it directly from GitHub? Should we use the dataset_transform to download and run it directly, or is there an alternative method that you recommend?

Thanks and Regards,
Henil

CC: @Muennighoff @KennethEnevoldsen

@KennethEnevoldsen
Copy link
Contributor

Hi @henilp105, this seems related to @orionw work on followIR.

Generally the code does not allow datasets not available through huggingface. Huggingface however does a dataset script which essentially just fetches from GitHub. However, since it is available on GitHub under an open license I would just upload it to HF.

@henilp105
Copy link
Contributor Author

Thanks @KennethEnevoldsen , I will refer the followIR PR, I will be uploading the dataset on hf and start the implementation of it.

@Muennighoff
Copy link
Contributor

Amazing! 🙌 @hanseokOh following up on kaistAI/InstructIR#3 have you already started on the integration / maybe you can coordinate with @henilp105 ?

@orionw
Copy link
Contributor

orionw commented Jun 12, 2024

This is exciting, thanks @henilp105! I think it'd be a great addition -- I wanted to add it myself but never had the time. @hanseokOh if you already started doing it, let me know and we can adjust these details!

For the technical details to put this in MTEB:

  • I was thinking we could modify the InstructionRetrieval abstract task to handle multiple instructions per query (instead of the current only two that it allows) as IIRC InstructIR has 10 per query.
  • We'd need to change the task definition to have a new data structure for instructions which maps queries -> list of instructions. We'd remove the original and altered instructions and just contain that list.
  • We'd change the eval to add InstructIR's Robustness metric over all instructions while still also reporting p-MRR from the first instruction to the others in a paired fashion if there are multiple sets of qrels (so we'll have to modify the qrels structure to support either only one qrels or multiples)
  • If there is only one set of qrels, we'd want to cache the corpus to avoid repetitive embedding (so a small change to the current caching logic in the abstract task)
  • I can change the format of the FollowIR datasets to match the new format

Let me know if this makes sense @henilp105 and we can discuss further!

@henilp105
Copy link
Contributor Author

Thanks @orionw these are great insights. I have uploaded the dataset to Hugging Face henilp105/InstructIR. I'll keep this thread updated and reach out if I encounter any roadblocks.

I was thinking we could modify the InstructionRetrieval abstract task to handle multiple instructions per query (instead of the current only two that it allows) as IIRC InstructIR has 10 per query.

We would need to have a common naming for all the instructions across datasets as in followIR we have instruction_og and instruction_changed whereas in InstructIR we have {qid}_{instruction_number} . what could be suitable approach/ format for all the datasets so that it could be easier to add new datasets.

@orionw
Copy link
Contributor

orionw commented Jun 12, 2024

Great point @henilp105. I think something like {qid}_{instruction_number} is a good format, although perhaps with more "_" in the middle as some qids have these characters. Maybe three underscores or dashes or something.

@henilp105
Copy link
Contributor Author

henilp105 commented Jun 12, 2024

Thanks, I think that 3 underscores would be good. Also Since multiple datasets may contain various subsets, not all of which are present in each dataset, how should we evaluate that? like top_ranked, analysis_order_sensitivity). or should we only support the base for each instead.

@orionw
Copy link
Contributor

orionw commented Jun 12, 2024

Good question, top_ranked is the collection to be evaluated per query. So that should be optional, checking if top_ranked exists, and if it doesn't exist we can fill it with the default, assuming it's just the entire corpus for each query.

analysis_order_sensitivity is from the InstructIR paper correct, the ablation where they change the order of query and instruction? I think since we let the models choose this in MTEB, we should leave this portion out and just keep the main data. But correct me if I'm wrong!

@hanseokOh
Copy link

Sorry for being late 😂 Yes, as @Muennighoff and @orionw mentioned, I am trying to merge InstructIR dataset into MTEB repository and I also made hf dataset repo for it!
But recently I have little time for integration, it would be great if @henilp105 can help us👍.

Also, I agree that like what @orionw said, it would be good to integrate only main part (not including details for ablation, such as analysis_order_sensitivity) for here.

@henilp105
Copy link
Contributor Author

Thanks @hanseokOh, I was unable to find the dataset repo link on GitHub. I would be happy to assist with the integration part, and I also believe that integrating only the main part is the right approach. Please feel free to chime in the PR for any suggestions, I would be happy to fix them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants