Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Voyage multilingual datasets #920

Open
Muennighoff opened this issue Jun 14, 2024 · 1 comment
Open

Add Voyage multilingual datasets #920

Muennighoff opened this issue Jun 14, 2024 · 1 comment
Labels

Comments

@Muennighoff
Copy link
Contributor

Lots of multilingual datasets listed here https://docs.google.com/spreadsheets/d/1qf0iYejG-9RgEEi13qB_SK_178-eNaeJDmSDNSj260A/edit?gid=1875159366#gid=1875159366 from https://blog.voyageai.com/2024/06/10/voyage-multilingual-2-multilingual-embedding-model/ ; I imagine some of them are not in MTEB yet; would be great to have them 🙌

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Jun 17, 2024

I know some of these are already covered, and some of them I can't seem to find (dan_news_summ_test). Do we have more references on these?

For convenience here is a list (I have not checked all in this list):

FRENCH

(@imenelydiaker can you have a look at these)

GERMAN

  • german-court-decisions
  • germanrag
  • Biology_German_DHBW
  • german_legal_sentences
  • Dialogsum-german
  • german_seahorse_dataset_with_articles
  • health_care_german
  • German_Poems
  • German_Songs
  • job_listing_german_cleaned_bert
  • german_OpenOrca
  • german_press_releases_100

JAPANESE

  • mmarco-japanese-hard-negatives
  • japanese-conala
  • sakura_japanese_dataset
  • japan-law
  • llm-japanese-dataset_wikinews
  • fujiki_49k_japanese_dataset
  • JapaneseSummalization_task
  • japan_diet_q_and_a_sessions_20k

KOREAN

  • Korean-Human-Judgements
  • korean_dialog_summary
  • KoreanSummarizeAiHub
  • alpaca-korean
  • grade_school_math_korean
  • Korean_QA_gen_datasets
  • korean_rlhf_dataset
  • korean_law_open_data_precedents

SPANISH

  • mlsum-spanish-truncated-512
  • spanish-alpaca
  • colmbian_spanish_news
  • orca-math-word-problems-0_10002-spanish
  • guanaco-spanish-dataset
  • emotional_response_spanish_dataset
  • alpaca-spanish
  • Curated-Spanish

OTHER

  • bengali_summ_test
  • dan_news_summ_test
  • dutch_policy_qa_test
  • georgian_faq
  • greek_civics_qa
  • hungarian_summ
  • ilpost_test
  • norwegian_snl_summ_test
  • polish_summ_test
  • rojtvheadlines
  • russian_reviews
  • slovak_summ_test
  • swedish_swefaq
  • thai_summ_test
  • turkish_HistQuAD_test
  • urdu_summ_test
  • viet_quad_test
  • arabic_news
  • czech_summ
  • persian_qa
  • portuguese_xlsum_test

ENGLISH

  • OneSignal
  • PyTorch1024
  • 5GEdge
  • Cohere
  • Doordash
  • Healthforcalifornia
  • Langchain
  • Huffpostscience
  • Huffpostsports
  • gov_report
  • SummScreen
  • qasper
  • qasper_abstract_doc
  • LeetCodeCpp
  • LeetCodeJava
  • LeetCodePython
  • humaneval
  • mbpp
  • LegalQuAD
  • AILA_casedocs
  • AILA_statutes
  • rag-benchmark-finance-apple-10K-2022
  • financebench
  • TAT-QA
  • ConvFinQA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants