End-to-end integration / testing with leaderboard #932

Muennighoff · 2024-06-15T16:50:27Z

There's a few things that could be improved re: the leaderboard & codebase integration 🤔

We currently cannot automatically test the effect of changes made here on the leaderboard. This has caused a few issues: Changes in metadata broke backward compat of metric name #903 ; Standardizing results format #639
The leaderboard requires manual refreshes (the benefit is that it allows us to do a quick manual check that all looks okay, but maybe automatic would still be better)
Adding results via https://huggingface.co/datasets/mteb/results requires updating the paths.json file in there + adding the model specs to the leaderboard. Ideally, we would allow users to just submit all of this via PR and it would automatically update similar to how people add their models to alpaca, e.g. Add Together-MoA, Together-MoA-Lite to AlpacaEval tatsu-lab/alpaca_eval#342
Reviewing PRs on HF does not yet work as well as on GitHub - it may be easier to take them on GitHub instead and auto-push them to the LB upon changes

cc @orionw who had some great ideas on this & is a GitHub actions wizard 🪄

The text was updated successfully, but these errors were encountered:

KennethEnevoldsen · 2024-06-15T18:57:55Z

Completely agree! I would love to merge the two repositories!

orionw · 2024-06-15T19:15:07Z

+1 to this issue @Muennighoff!

I hope to take a look at the latter 3 of these bullet points at the end of next week: making it easier to add results / mirror to Github and calculate the leaderboard automatically without refreshes.

We currently cannot automatically test the effect of changes made here on the leaderboard.

I was wondering about this myself - I think adding tests is a great starting place. It is a little tricky as the solution to the latter three involves setting the leaderboard up as a mirror on Github and doing automatic pushes, so it would draw from the main branch of wherever we store the results (mteb?). So anyone doing a branch on mteb won't be able to see the failure until it's already committed.

One potential solution to this is to add another test to mteb that runs some part of the leaderboard processing code. I think this could work, although it is not the cleanest solution (how do we sync that file so that updates to the leaderboard processing code are reflected to that test in mteb and vice versa). If others have suggestions I'd be very interested in hearing them!

Muennighoff · 2024-06-15T19:35:18Z

That's amazing! 🚀

I think ideally there'd be two tests, sth like:

Run a fast model on some datasets to get a result
Add the result to a toy results folder
Check if the leaderboard can fetch from that result folder

Run a fast model on some datasets to get a result
Turn the results into metadata
Check if the leaderboard can fetch from the metadata

I think we only need to make sure the leaderboard code runs without erroring out which could likely be done by just parametrizing it a bit and then we can feed in the results folder & metadata as parameters for the tests. Anyways, I think the best solution here will become clearer as we advance on the other issues 🤔

KennethEnevoldsen · 2024-06-15T19:51:34Z

Wherever we store the results (mteb?)

I would add the results to mteb.

To avoid influencing the existing leaderboard too much it might be ideal to keep the existing one as is for now and create a new leaderboard for development.

orionw · 2024-07-03T18:34:44Z

To avoid influencing the existing leaderboard too much it might be ideal to keep the existing one as is for now and create a new leaderboard for development.

Agree as I will likely break it a few times before it's fixed haha. I've created mteb/leaderboard-in-progress which we can rename when it's sync'd correctly.

orionw · 2024-07-05T03:34:42Z

I've created a mteb/leaderboard Github which calculates the leaderboard results daily via Github actions (a full refresh) and syncs to the Huggingface Space mteb/leaderboard-in-progress. The 1 hour refresh of all models can happen in the background at night while the space runs virtually instantaneously using those cached files!

It'd be nice to monitor it for a days or two before making it work on the main space. For those couple days, is it okay to pause any new commits to the leaderboard space? I had to make a large number of refactors and it will be a pain to try to resolve new conflicts.

Does Saturday/Sunday work for the switchover @Muennighoff? Trying to find a time that will cause the least impact if it goes down for a few hours during the transition and I'm not sure when the most active usage of the space is.

NOTE: this doesn't use the new mteb/results Github -- what is the status of that? Is that just MMTEB results?

Muennighoff · 2024-07-05T04:11:00Z

That's amazing! Your suggestion sounds good to me & we cannot commit anything for a few days (also cc @tomaarsen). I'm not sure it would even go down but any date is fine I think.

For the mteb/results GitHub - I think we can start using it? We just need to move over all result files from the mteb/results HF repo and then sync it I think?

KennethEnevoldsen · 2024-07-05T09:56:42Z

Yea I do think we should start using mteb/results. Instead of having multiple switches might it not be ideal to also add the interface update to MTEB as well?

Should we also add any updates on how to add models etc.?

tomaarsen · 2024-07-05T10:07:42Z

Seems good! I'll abstain from commits on the HF Leaderboard Space in the next few days.

Tom Aarsen

Muennighoff mentioned this issue Jun 15, 2024

Fix MTEB meta and add test for retrieval tasks #931

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

End-to-end integration / testing with leaderboard #932

End-to-end integration / testing with leaderboard #932

Muennighoff commented Jun 15, 2024

KennethEnevoldsen commented Jun 15, 2024

orionw commented Jun 15, 2024 •

edited

Loading

Muennighoff commented Jun 15, 2024 •

edited

Loading

KennethEnevoldsen commented Jun 15, 2024

orionw commented Jul 3, 2024

orionw commented Jul 5, 2024

Muennighoff commented Jul 5, 2024

KennethEnevoldsen commented Jul 5, 2024

tomaarsen commented Jul 5, 2024

End-to-end integration / testing with leaderboard #932

End-to-end integration / testing with leaderboard #932

Comments

Muennighoff commented Jun 15, 2024

KennethEnevoldsen commented Jun 15, 2024

orionw commented Jun 15, 2024 • edited Loading

Muennighoff commented Jun 15, 2024 • edited Loading

KennethEnevoldsen commented Jun 15, 2024

orionw commented Jul 3, 2024

orionw commented Jul 5, 2024

Muennighoff commented Jul 5, 2024

KennethEnevoldsen commented Jul 5, 2024

tomaarsen commented Jul 5, 2024

orionw commented Jun 15, 2024 •

edited

Loading

Muennighoff commented Jun 15, 2024 •

edited

Loading