Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

End-to-end integration / testing with leaderboard #932

Open
Muennighoff opened this issue Jun 15, 2024 · 9 comments
Open

End-to-end integration / testing with leaderboard #932

Muennighoff opened this issue Jun 15, 2024 · 9 comments

Comments

@Muennighoff
Copy link
Contributor

There's a few things that could be improved re: the leaderboard & codebase integration 馃

cc @orionw who had some great ideas on this & is a GitHub actions wizard 馃獎

@KennethEnevoldsen
Copy link
Contributor

Completely agree! I would love to merge the two repositories!

@orionw
Copy link
Contributor

orionw commented Jun 15, 2024

+1 to this issue @Muennighoff!

I hope to take a look at the latter 3 of these bullet points at the end of next week: making it easier to add results / mirror to Github and calculate the leaderboard automatically without refreshes.

We currently cannot automatically test the effect of changes made here on the leaderboard.

I was wondering about this myself - I think adding tests is a great starting place. It is a little tricky as the solution to the latter three involves setting the leaderboard up as a mirror on Github and doing automatic pushes, so it would draw from the main branch of wherever we store the results (mteb?). So anyone doing a branch on mteb won't be able to see the failure until it's already committed.

One potential solution to this is to add another test to mteb that runs some part of the leaderboard processing code. I think this could work, although it is not the cleanest solution (how do we sync that file so that updates to the leaderboard processing code are reflected to that test in mteb and vice versa). If others have suggestions I'd be very interested in hearing them!

@Muennighoff
Copy link
Contributor Author

Muennighoff commented Jun 15, 2024

That's amazing! 馃殌

I think ideally there'd be two tests, sth like:

  1. Run a fast model on some datasets to get a result
  2. Add the result to a toy results folder
  3. Check if the leaderboard can fetch from that result folder

  1. Run a fast model on some datasets to get a result
  2. Turn the results into metadata
  3. Check if the leaderboard can fetch from the metadata

I think we only need to make sure the leaderboard code runs without erroring out which could likely be done by just parametrizing it a bit and then we can feed in the results folder & metadata as parameters for the tests. Anyways, I think the best solution here will become clearer as we advance on the other issues 馃

@KennethEnevoldsen
Copy link
Contributor

Wherever we store the results (mteb?)

I would add the results to mteb.

To avoid influencing the existing leaderboard too much it might be ideal to keep the existing one as is for now and create a new leaderboard for development.

@orionw
Copy link
Contributor

orionw commented Jul 3, 2024

To avoid influencing the existing leaderboard too much it might be ideal to keep the existing one as is for now and create a new leaderboard for development.

Agree as I will likely break it a few times before it's fixed haha. I've created mteb/leaderboard-in-progress which we can rename when it's sync'd correctly.

@orionw
Copy link
Contributor

orionw commented Jul 5, 2024

I've created a mteb/leaderboard Github which calculates the leaderboard results daily via Github actions (a full refresh) and syncs to the Huggingface Space mteb/leaderboard-in-progress. The 1 hour refresh of all models can happen in the background at night while the space runs virtually instantaneously using those cached files!

It'd be nice to monitor it for a days or two before making it work on the main space. For those couple days, is it okay to pause any new commits to the leaderboard space? I had to make a large number of refactors and it will be a pain to try to resolve new conflicts.

Does Saturday/Sunday work for the switchover @Muennighoff? Trying to find a time that will cause the least impact if it goes down for a few hours during the transition and I'm not sure when the most active usage of the space is.

NOTE: this doesn't use the new mteb/results Github -- what is the status of that? Is that just MMTEB results?

@Muennighoff
Copy link
Contributor Author

That's amazing! Your suggestion sounds good to me & we cannot commit anything for a few days (also cc @tomaarsen). I'm not sure it would even go down but any date is fine I think.

For the mteb/results GitHub - I think we can start using it? We just need to move over all result files from the mteb/results HF repo and then sync it I think?

@KennethEnevoldsen
Copy link
Contributor

Yea I do think we should start using mteb/results. Instead of having multiple switches might it not be ideal to also add the interface update to MTEB as well?

Should we also add any updates on how to add models etc.?

@tomaarsen
Copy link
Member

Seems good! I'll abstain from commits on the HF Leaderboard Space in the next few days.

  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants