-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSMARCOv2 is excessive large and takes too long to load #1022
Comments
adding a ref here to #1014 |
@Muennighoff I see that we have the following splits in MsMarcov2:
Is it worth reducing it to just "dev2"? |
@Muennighoff I think the easiest solution for now is just to not include MsMarcov2 (it is just too fucking huge). WDYT? |
I'm fine with excluding it! We can also downsample it like some of the other retrieval datasets. |
I have the same issue with dataset |
@rich-junwang when running Msmarcov2 I never got the issue above (though it was still too big) - have to tried resetting the cache? If it keeps being and issue I would recommend opening a seperate issue |
INFO:mteb.evaluation.MTEB:
Evaluating 1 tasks:
─────────────────────────────── Selected tasks ────────────────────────────────
Retrieval
- MSMARCOv2, s2p
INFO:mteb.evaluation.MTEB:
********************** Evaluating MSMARCOv2 **********************
INFO:mteb.evaluation.MTEB:Loading dataset for MSMARCOv2
INFO:mteb.abstasks.AbsTaskRetrieval:Loading Corpus...
INFO:mteb.abstasks.AbsTaskRetrieval:Loaded 138364198 TRAIN Documents.
INFO:mteb.abstasks.AbsTaskRetrieval:Doc Example: {'id': '00_0', 'title': '0-60 Times - 0-60 | 0 to 60 Times & 1/4 Mile Times | Zero to 60 Car Reviews', 'text': '0-60 Times - 0-60 | 0 to 60 Times & 1/4 Mile Times | Zero to 60 Car Reviews.'}
INFO:mteb.abstasks.AbsTaskRetrieval:Loading Queries...
Map: 0%| | 0/284212 [00:00<?, ? examples/s]
Map: 1%|▏ | 4163/284212 [00:00<00:06, 41434.15 examples/s]
Map: 3%|▎ | 8531/284212 [00:00<00:06, 42750.00 examples/s]
Map: 5%|▌ | 14967/284212 [00:00<00:06, 42830.14 examples/s]
Map: 7%|▋ | 19848/284212 [00:00<00:05, 44902.50 examples/s]
Map: 9%|▊ | 24713/284212 [00:00<00:05, 46140.33 examples/s]
Map: 10%|█ | 29580/284212 [00:00<00:05, 46948.83 examples/s]
Map: 12%|█▏ | 34458/284212 [00:00<00:05, 47522.97 examples/s]
Map: 14%|█▍ | 39268/284212 [00:00<00:05, 47697.06 examples/s]
Map: 16%|█▋ | 46452/284212 [00:01<00:04, 47770.68 examples/s]
Map: 19%|█▉ | 53650/284212 [00:01<00:04, 47844.87 examples/s]
Map: 21%|██ | 58532/284212 [00:01<00:04, 48093.98 examples/s]
Map: 22%|██▏ | 63435/284212 [00:01<00:04, 48344.19 examples/s]
Map: 24%|██▍ | 68303/284212 [00:01<00:04, 48433.65 examples/s]
Map: 26%|██▌ | 73176/284212 [00:01<00:04, 48513.86 examples/s]
Map: 28%|██▊ | 80469/284212 [00:01<00:04, 48549.68 examples/s]
Map: 30%|███ | 85336/284212 [00:01<00:04, 48579.40 examples/s]
Map: 32%|███▏ | 90207/284212 [00:01<00:03, 48612.48 examples/s]
Map: 34%|███▍ | 97170/284212 [00:02<00:03, 47797.81 examples/s]
Map: 36%|███▌ | 102039/284212 [00:02<00:03, 48027.95 examples/s]
Map: 38%|███▊ | 106939/284212 [00:02<00:03, 48290.78 examples/s]
Map: 39%|███▉ | 111820/284212 [00:02<00:03, 48433.74 examples/s]
Map: 41%|████ | 116705/284212 [00:02<00:03, 48548.29 examples/s]
Map: 43%|████▎ | 121573/284212 [00:02<00:03, 48583.11 examples/s]
Map: 45%|████▌ | 128866/284212 [00:02<00:03, 48591.78 examples/s]
Map: 47%|████▋ | 133735/284212 [00:02<00:03, 48616.67 examples/s]
Map: 49%|████▉ | 138612/284212 [00:02<00:02, 48655.48 examples/s]
Map: 51%|█████▏ | 145786/284212 [00:03<00:02, 48345.37 examples/s]
Map: 53%|█████▎ | 150668/284212 [00:03<00:02, 48467.29 examples/s]
Map: 55%|█████▍ | 155538/284212 [00:03<00:02, 48525.46 examples/s]
Map: 56%|█████▋ | 160408/284212 [00:03<00:02, 48570.39 examples/s]
Map: 58%|█████▊ | 165303/284212 [00:03<00:02, 48675.68 examples/s]
Map: 61%|██████ | 172445/284212 [00:03<00:02, 48270.93 examples/s]
Map: 62%|██████▏ | 177285/284212 [00:03<00:02, 48301.79 examples/s]
Map: 65%|██████▍ | 184332/284212 [00:03<00:02, 47818.33 examples/s]
Map: 67%|██████▋ | 189201/284212 [00:03<00:01, 48040.27 examples/s]
Map: 69%|██████▉ | 196381/284212 [00:04<00:01, 47973.95 examples/s]
Map: 71%|███████ | 201260/284212 [00:04<00:01, 48176.70 examples/s]
Map: 73%|███████▎ | 206120/284212 [00:04<00:01, 48285.98 examples/s]
Map: 74%|███████▍ | 210985/284212 [00:04<00:01, 48383.11 examples/s]
Map: 76%|███████▌ | 215841/284212 [00:04<00:01, 48431.08 examples/s]
Map: 78%|███████▊ | 220694/284212 [00:04<00:01, 48458.66 examples/s]
Map: 79%|███████▉ | 225555/284212 [00:04<00:01, 48499.99 examples/s]
Map: 81%|████████ | 230421/284212 [00:04<00:01, 48545.98 examples/s]
Map: 83%|████████▎ | 235281/284212 [00:04<00:01, 48559.52 examples/s]
Map: 85%|████████▌ | 242331/284212 [00:05<00:00, 47948.24 examples/s]
Map: 87%|████████▋ | 247210/284212 [00:05<00:00, 48172.93 examples/s]
Map: 90%|████████▉ | 254372/284212 [00:05<00:00, 48013.79 examples/s]
Map: 91%|█████████ | 259264/284212 [00:05<00:00, 48247.89 examples/s]
Map: 93%|█████████▎| 264152/284212 [00:05<00:00, 48417.84 examples/s]
Map: 95%|█████████▍| 269041/284212 [00:05<00:00, 48544.84 examples/s]
Map: 96%|█████████▋| 273923/284212 [00:05<00:00, 48619.46 examples/s]
Map: 98%|█████████▊| 278810/284212 [00:05<00:00, 48686.68 examples/s]
Map: 100%|█████████▉| 283693/284212 [00:05<00:00, 48724.59 examples/s]
Map: 100%|██████████| 284212/284212 [00:05<00:00, 48068.59 examples/s]
INFO:mteb.abstasks.AbsTaskRetrieval:Loaded 277144 TRAIN Queries.
INFO:mteb.abstasks.AbsTaskRetrieval:Query Example: {'id': '121352', 'text': 'define extreme'}
INFO:mteb.abstasks.AbsTaskRetrieval:Loading Corpus...
INFO:mteb.abstasks.AbsTaskRetrieval:Loaded 138364198 DEV Documents.
INFO:mteb.abstasks.AbsTaskRetrieval:Doc Example: {'id': '00_0', 'title': '0-60 Times - 0-60 | 0 to 60 Times & 1/4 Mile Times | Zero to 60 Car Reviews', 'text': '0-60 Times - 0-60 | 0 to 60 Times & 1/4 Mile Times | Zero to 60 Car Reviews.'}
INFO:mteb.abstasks.AbsTaskRetrieval:Loading Queries...
Using the latest cached version of the dataset since mteb/msmarco-v2 couldn't be found on the Hugging Face Hub
WARNING:datasets.load:Using the latest cached version of the dataset since mteb/msmarco-v2 couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'queries' at /data/huggingface/datasets/mteb___msmarco-v2/queries/0.0.0/b1663124850d305ab7c470bb0548acf8e2e7ea43 (last modified on Sat Jun 29 22:47:03 2024).
WARNING:datasets.packaged_modules.cache.cache:Found the latest cached dataset configuration 'queries' at /data/huggingface/datasets/mteb___msmarco-v2/queries/0.0.0/b1663124850d305ab7c470bb0548acf8e2e7ea43 (last modified on Sat Jun 29 22:47:03 2024).
Using the latest cached version of the dataset since mteb/msmarco-v2 couldn't be found on the Hugging Face Hub
WARNING:datasets.load:Using the latest cached version of the dataset since mteb/msmarco-v2 couldn't be found on the Hugging Face Hub
ERROR:mteb.evaluation.MTEB:Error while evaluating MSMARCOv2: There are multiple 'mteb/msmarco-v2' configurations in the cache: corpus, queries, default
Please specify which configuration to reload from the cache, e.g.
load_dataset('mteb/msmarco-v2', 'corpus')
Traceback (most recent call last):
File "/env/lib/conda/gritkto/bin/mteb", line 8, in
sys.exit(main())
File "/data/niklas/mteb/mteb/cli.py", line 370, in main
args.func(args)
File "/data/niklas/mteb/mteb/cli.py", line 118, in run
eval.run(
File "/data/niklas/mteb/mteb/evaluation/MTEB.py", line 388, in run
raise e
File "/data/niklas/mteb/mteb/evaluation/MTEB.py", line 328, in run
task.load_data(eval_splits=task_eval_splits, **kwargs)
File "/data/niklas/mteb/mteb/abstasks/AbsTaskRetrieval.py", line 231, in load_data
corpus, queries, qrels = HFDataLoader(
File "/data/niklas/mteb/mteb/abstasks/AbsTaskRetrieval.py", line 96, in load
self._load_qrels(split)
File "/data/niklas/mteb/mteb/abstasks/AbsTaskRetrieval.py", line 175, in _load_qrels
qrels_ds = load_dataset(
File "/env/lib/conda/gritkto/lib/python3.10/site-packages/datasets/load.py", line 2592, in load_dataset
builder_instance = load_dataset_builder(
File "/env/lib/conda/gritkto/lib/python3.10/site-packages/datasets/load.py", line 2301, in load_dataset_builder
builder_instance: DatasetBuilder = builder_cls(
File "/env/lib/conda/gritkto/lib/python3.10/site-packages/datasets/packaged_modules/cache/cache.py", line 140, in init
config_name, version, hash = _find_hash_in_cache(
File "/env/lib/conda/gritkto/lib/python3.10/site-packages/datasets/packaged_modules/cache/cache.py", line 85, in _find_hash_in_cache
raise ValueError(
ValueError: There are multiple 'mteb/msmarco-v2' configurations in the cache: corpus, queries, default
Please specify which configuration to reload from the cache, e.g.
load_dataset('mteb/msmarco-v2', 'corpus')
The text was updated successfully, but these errors were encountered: