MSMARCOv2 is excessive large and takes too long to load #1022

Muennighoff · 2024-07-01T17:12:46Z

INFO:mteb.evaluation.MTEB:

Evaluating 1 tasks:

─────────────────────────────── Selected tasks ────────────────────────────────
Retrieval
- MSMARCOv2, s2p

INFO:mteb.evaluation.MTEB:

********************** Evaluating MSMARCOv2 **********************
INFO:mteb.evaluation.MTEB:Loading dataset for MSMARCOv2
INFO:mteb.abstasks.AbsTaskRetrieval:Loading Corpus...
INFO:mteb.abstasks.AbsTaskRetrieval:Loaded 138364198 TRAIN Documents.
INFO:mteb.abstasks.AbsTaskRetrieval:Doc Example: {'id': '00_0', 'title': '0-60 Times - 0-60 | 0 to 60 Times & 1/4 Mile Times | Zero to 60 Car Reviews', 'text': '0-60 Times - 0-60 | 0 to 60 Times & 1/4 Mile Times | Zero to 60 Car Reviews.'}
INFO:mteb.abstasks.AbsTaskRetrieval:Loading Queries...

Map: 0%| | 0/284212 [00:00<?, ? examples/s]
Map: 1%|▏ | 4163/284212 [00:00<00:06, 41434.15 examples/s]
Map: 3%|▎ | 8531/284212 [00:00<00:06, 42750.00 examples/s]
Map: 5%|▌ | 14967/284212 [00:00<00:06, 42830.14 examples/s]
Map: 7%|▋ | 19848/284212 [00:00<00:05, 44902.50 examples/s]
Map: 9%|▊ | 24713/284212 [00:00<00:05, 46140.33 examples/s]
Map: 10%|█ | 29580/284212 [00:00<00:05, 46948.83 examples/s]
Map: 12%|█▏ | 34458/284212 [00:00<00:05, 47522.97 examples/s]
Map: 14%|█▍ | 39268/284212 [00:00<00:05, 47697.06 examples/s]
Map: 16%|█▋ | 46452/284212 [00:01<00:04, 47770.68 examples/s]
Map: 19%|█▉ | 53650/284212 [00:01<00:04, 47844.87 examples/s]
Map: 21%|██ | 58532/284212 [00:01<00:04, 48093.98 examples/s]
Map: 22%|██▏ | 63435/284212 [00:01<00:04, 48344.19 examples/s]
Map: 24%|██▍ | 68303/284212 [00:01<00:04, 48433.65 examples/s]
Map: 26%|██▌ | 73176/284212 [00:01<00:04, 48513.86 examples/s]
Map: 28%|██▊ | 80469/284212 [00:01<00:04, 48549.68 examples/s]
Map: 30%|███ | 85336/284212 [00:01<00:04, 48579.40 examples/s]
Map: 32%|███▏ | 90207/284212 [00:01<00:03, 48612.48 examples/s]
Map: 34%|███▍ | 97170/284212 [00:02<00:03, 47797.81 examples/s]
Map: 36%|███▌ | 102039/284212 [00:02<00:03, 48027.95 examples/s]
Map: 38%|███▊ | 106939/284212 [00:02<00:03, 48290.78 examples/s]
Map: 39%|███▉ | 111820/284212 [00:02<00:03, 48433.74 examples/s]
Map: 41%|████ | 116705/284212 [00:02<00:03, 48548.29 examples/s]
Map: 43%|████▎ | 121573/284212 [00:02<00:03, 48583.11 examples/s]
Map: 45%|████▌ | 128866/284212 [00:02<00:03, 48591.78 examples/s]
Map: 47%|████▋ | 133735/284212 [00:02<00:03, 48616.67 examples/s]
Map: 49%|████▉ | 138612/284212 [00:02<00:02, 48655.48 examples/s]
Map: 51%|█████▏ | 145786/284212 [00:03<00:02, 48345.37 examples/s]
Map: 53%|█████▎ | 150668/284212 [00:03<00:02, 48467.29 examples/s]
Map: 55%|█████▍ | 155538/284212 [00:03<00:02, 48525.46 examples/s]
Map: 56%|█████▋ | 160408/284212 [00:03<00:02, 48570.39 examples/s]
Map: 58%|█████▊ | 165303/284212 [00:03<00:02, 48675.68 examples/s]
Map: 61%|██████ | 172445/284212 [00:03<00:02, 48270.93 examples/s]
Map: 62%|██████▏ | 177285/284212 [00:03<00:02, 48301.79 examples/s]
Map: 65%|██████▍ | 184332/284212 [00:03<00:02, 47818.33 examples/s]
Map: 67%|██████▋ | 189201/284212 [00:03<00:01, 48040.27 examples/s]
Map: 69%|██████▉ | 196381/284212 [00:04<00:01, 47973.95 examples/s]
Map: 71%|███████ | 201260/284212 [00:04<00:01, 48176.70 examples/s]
Map: 73%|███████▎ | 206120/284212 [00:04<00:01, 48285.98 examples/s]
Map: 74%|███████▍ | 210985/284212 [00:04<00:01, 48383.11 examples/s]
Map: 76%|███████▌ | 215841/284212 [00:04<00:01, 48431.08 examples/s]
Map: 78%|███████▊ | 220694/284212 [00:04<00:01, 48458.66 examples/s]
Map: 79%|███████▉ | 225555/284212 [00:04<00:01, 48499.99 examples/s]
Map: 81%|████████ | 230421/284212 [00:04<00:01, 48545.98 examples/s]
Map: 83%|████████▎ | 235281/284212 [00:04<00:01, 48559.52 examples/s]
Map: 85%|████████▌ | 242331/284212 [00:05<00:00, 47948.24 examples/s]
Map: 87%|████████▋ | 247210/284212 [00:05<00:00, 48172.93 examples/s]
Map: 90%|████████▉ | 254372/284212 [00:05<00:00, 48013.79 examples/s]
Map: 91%|█████████ | 259264/284212 [00:05<00:00, 48247.89 examples/s]
Map: 93%|█████████▎| 264152/284212 [00:05<00:00, 48417.84 examples/s]
Map: 95%|█████████▍| 269041/284212 [00:05<00:00, 48544.84 examples/s]
Map: 96%|█████████▋| 273923/284212 [00:05<00:00, 48619.46 examples/s]
Map: 98%|█████████▊| 278810/284212 [00:05<00:00, 48686.68 examples/s]
Map: 100%|█████████▉| 283693/284212 [00:05<00:00, 48724.59 examples/s]
Map: 100%|██████████| 284212/284212 [00:05<00:00, 48068.59 examples/s]
INFO:mteb.abstasks.AbsTaskRetrieval:Loaded 277144 TRAIN Queries.
INFO:mteb.abstasks.AbsTaskRetrieval:Query Example: {'id': '121352', 'text': 'define extreme'}
INFO:mteb.abstasks.AbsTaskRetrieval:Loading Corpus...
INFO:mteb.abstasks.AbsTaskRetrieval:Loaded 138364198 DEV Documents.
INFO:mteb.abstasks.AbsTaskRetrieval:Doc Example: {'id': '00_0', 'title': '0-60 Times - 0-60 | 0 to 60 Times & 1/4 Mile Times | Zero to 60 Car Reviews', 'text': '0-60 Times - 0-60 | 0 to 60 Times & 1/4 Mile Times | Zero to 60 Car Reviews.'}
INFO:mteb.abstasks.AbsTaskRetrieval:Loading Queries...
Using the latest cached version of the dataset since mteb/msmarco-v2 couldn't be found on the Hugging Face Hub
WARNING:datasets.load:Using the latest cached version of the dataset since mteb/msmarco-v2 couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'queries' at /data/huggingface/datasets/mteb___msmarco-v2/queries/0.0.0/b1663124850d305ab7c470bb0548acf8e2e7ea43 (last modified on Sat Jun 29 22:47:03 2024).
WARNING:datasets.packaged_modules.cache.cache:Found the latest cached dataset configuration 'queries' at /data/huggingface/datasets/mteb___msmarco-v2/queries/0.0.0/b1663124850d305ab7c470bb0548acf8e2e7ea43 (last modified on Sat Jun 29 22:47:03 2024).
Using the latest cached version of the dataset since mteb/msmarco-v2 couldn't be found on the Hugging Face Hub
WARNING:datasets.load:Using the latest cached version of the dataset since mteb/msmarco-v2 couldn't be found on the Hugging Face Hub
ERROR:mteb.evaluation.MTEB:Error while evaluating MSMARCOv2: There are multiple 'mteb/msmarco-v2' configurations in the cache: corpus, queries, default
Please specify which configuration to reload from the cache, e.g.
load_dataset('mteb/msmarco-v2', 'corpus')
Traceback (most recent call last):
File "/env/lib/conda/gritkto/bin/mteb", line 8, in
sys.exit(main())
File "/data/niklas/mteb/mteb/cli.py", line 370, in main
args.func(args)
File "/data/niklas/mteb/mteb/cli.py", line 118, in run
eval.run(
File "/data/niklas/mteb/mteb/evaluation/MTEB.py", line 388, in run
raise e
File "/data/niklas/mteb/mteb/evaluation/MTEB.py", line 328, in run
task.load_data(eval_splits=task_eval_splits, **kwargs)
File "/data/niklas/mteb/mteb/abstasks/AbsTaskRetrieval.py", line 231, in load_data
corpus, queries, qrels = HFDataLoader(
File "/data/niklas/mteb/mteb/abstasks/AbsTaskRetrieval.py", line 96, in load
self._load_qrels(split)
File "/data/niklas/mteb/mteb/abstasks/AbsTaskRetrieval.py", line 175, in _load_qrels
qrels_ds = load_dataset(
File "/env/lib/conda/gritkto/lib/python3.10/site-packages/datasets/load.py", line 2592, in load_dataset
builder_instance = load_dataset_builder(
File "/env/lib/conda/gritkto/lib/python3.10/site-packages/datasets/load.py", line 2301, in load_dataset_builder
builder_instance: DatasetBuilder = builder_cls(
File "/env/lib/conda/gritkto/lib/python3.10/site-packages/datasets/packaged_modules/cache/cache.py", line 140, in init
config_name, version, hash = _find_hash_in_cache(
File "/env/lib/conda/gritkto/lib/python3.10/site-packages/datasets/packaged_modules/cache/cache.py", line 85, in _find_hash_in_cache
raise ValueError(
ValueError: There are multiple 'mteb/msmarco-v2' configurations in the cache: corpus, queries, default
Please specify which configuration to reload from the cache, e.g.
load_dataset('mteb/msmarco-v2', 'corpus')

KennethEnevoldsen · 2024-07-01T17:14:44Z

adding a ref here to #1014

KennethEnevoldsen · 2024-07-03T15:07:57Z

@Muennighoff I see that we have the following splits in MsMarcov2:

eval_splits=["train", "dev", "dev2"],

Is it worth reducing it to just "dev2"?
Related to #992

KennethEnevoldsen · 2024-07-03T15:37:18Z

@Muennighoff I think the easiest solution for now is just to not include MsMarcov2 (it is just too fucking huge). WDYT?

Muennighoff · 2024-07-03T17:41:41Z

I'm fine with excluding it! We can also downsample it like some of the other retrieval datasets.

rich-junwang · 2024-07-04T15:59:45Z

I have the same issue with dataset Touche2020

KennethEnevoldsen · 2024-07-05T07:20:24Z

@rich-junwang when running Msmarcov2 I never got the issue above (though it was still too big) - have to tried resetting the cache? If it keeps being and issue I would recommend opening a seperate issue

KennethEnevoldsen mentioned this issue Jul 3, 2024

Avoid try except when raise_errors = True #1035

Open

KennethEnevoldsen changed the title ~~MSMARCOv2 ds loading fails~~ MSMARCOv2 is excessive large and takes too long to load Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSMARCOv2 is excessive large and takes too long to load #1022

MSMARCOv2 is excessive large and takes too long to load #1022

Muennighoff commented Jul 1, 2024

KennethEnevoldsen commented Jul 1, 2024

KennethEnevoldsen commented Jul 3, 2024 •

edited

Loading

KennethEnevoldsen commented Jul 3, 2024

Muennighoff commented Jul 3, 2024

rich-junwang commented Jul 4, 2024

KennethEnevoldsen commented Jul 5, 2024 •

edited

Loading

MSMARCOv2 is excessive large and takes too long to load #1022

MSMARCOv2 is excessive large and takes too long to load #1022

Comments

Muennighoff commented Jul 1, 2024

Evaluating 1 tasks:

KennethEnevoldsen commented Jul 1, 2024

KennethEnevoldsen commented Jul 3, 2024 • edited Loading

KennethEnevoldsen commented Jul 3, 2024

Muennighoff commented Jul 3, 2024

rich-junwang commented Jul 4, 2024

KennethEnevoldsen commented Jul 5, 2024 • edited Loading

KennethEnevoldsen commented Jul 3, 2024 •

edited

Loading

KennethEnevoldsen commented Jul 5, 2024 •

edited

Loading