Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSMARCOv2 is excessive large and takes too long to load #1022

Open
Muennighoff opened this issue Jul 1, 2024 · 6 comments
Open

MSMARCOv2 is excessive large and takes too long to load #1022

Muennighoff opened this issue Jul 1, 2024 · 6 comments

Comments

@Muennighoff
Copy link
Contributor

INFO:mteb.evaluation.MTEB:

Evaluating 1 tasks:

─────────────────────────────── Selected tasks ────────────────────────────────
Retrieval
- MSMARCOv2, s2p

INFO:mteb.evaluation.MTEB:

********************** Evaluating MSMARCOv2 **********************
INFO:mteb.evaluation.MTEB:Loading dataset for MSMARCOv2
INFO:mteb.abstasks.AbsTaskRetrieval:Loading Corpus...
INFO:mteb.abstasks.AbsTaskRetrieval:Loaded 138364198 TRAIN Documents.
INFO:mteb.abstasks.AbsTaskRetrieval:Doc Example: {'id': '00_0', 'title': '0-60 Times - 0-60 | 0 to 60 Times & 1/4 Mile Times | Zero to 60 Car Reviews', 'text': '0-60 Times - 0-60 | 0 to 60 Times & 1/4 Mile Times | Zero to 60 Car Reviews.'}
INFO:mteb.abstasks.AbsTaskRetrieval:Loading Queries...

Map: 0%| | 0/284212 [00:00<?, ? examples/s]
Map: 1%|▏ | 4163/284212 [00:00<00:06, 41434.15 examples/s]
Map: 3%|▎ | 8531/284212 [00:00<00:06, 42750.00 examples/s]
Map: 5%|▌ | 14967/284212 [00:00<00:06, 42830.14 examples/s]
Map: 7%|▋ | 19848/284212 [00:00<00:05, 44902.50 examples/s]
Map: 9%|▊ | 24713/284212 [00:00<00:05, 46140.33 examples/s]
Map: 10%|█ | 29580/284212 [00:00<00:05, 46948.83 examples/s]
Map: 12%|█▏ | 34458/284212 [00:00<00:05, 47522.97 examples/s]
Map: 14%|█▍ | 39268/284212 [00:00<00:05, 47697.06 examples/s]
Map: 16%|█▋ | 46452/284212 [00:01<00:04, 47770.68 examples/s]
Map: 19%|█▉ | 53650/284212 [00:01<00:04, 47844.87 examples/s]
Map: 21%|██ | 58532/284212 [00:01<00:04, 48093.98 examples/s]
Map: 22%|██▏ | 63435/284212 [00:01<00:04, 48344.19 examples/s]
Map: 24%|██▍ | 68303/284212 [00:01<00:04, 48433.65 examples/s]
Map: 26%|██▌ | 73176/284212 [00:01<00:04, 48513.86 examples/s]
Map: 28%|██▊ | 80469/284212 [00:01<00:04, 48549.68 examples/s]
Map: 30%|███ | 85336/284212 [00:01<00:04, 48579.40 examples/s]
Map: 32%|███▏ | 90207/284212 [00:01<00:03, 48612.48 examples/s]
Map: 34%|███▍ | 97170/284212 [00:02<00:03, 47797.81 examples/s]
Map: 36%|███▌ | 102039/284212 [00:02<00:03, 48027.95 examples/s]
Map: 38%|███▊ | 106939/284212 [00:02<00:03, 48290.78 examples/s]
Map: 39%|███▉ | 111820/284212 [00:02<00:03, 48433.74 examples/s]
Map: 41%|████ | 116705/284212 [00:02<00:03, 48548.29 examples/s]
Map: 43%|████▎ | 121573/284212 [00:02<00:03, 48583.11 examples/s]
Map: 45%|████▌ | 128866/284212 [00:02<00:03, 48591.78 examples/s]
Map: 47%|████▋ | 133735/284212 [00:02<00:03, 48616.67 examples/s]
Map: 49%|████▉ | 138612/284212 [00:02<00:02, 48655.48 examples/s]
Map: 51%|█████▏ | 145786/284212 [00:03<00:02, 48345.37 examples/s]
Map: 53%|█████▎ | 150668/284212 [00:03<00:02, 48467.29 examples/s]
Map: 55%|█████▍ | 155538/284212 [00:03<00:02, 48525.46 examples/s]
Map: 56%|█████▋ | 160408/284212 [00:03<00:02, 48570.39 examples/s]
Map: 58%|█████▊ | 165303/284212 [00:03<00:02, 48675.68 examples/s]
Map: 61%|██████ | 172445/284212 [00:03<00:02, 48270.93 examples/s]
Map: 62%|██████▏ | 177285/284212 [00:03<00:02, 48301.79 examples/s]
Map: 65%|██████▍ | 184332/284212 [00:03<00:02, 47818.33 examples/s]
Map: 67%|██████▋ | 189201/284212 [00:03<00:01, 48040.27 examples/s]
Map: 69%|██████▉ | 196381/284212 [00:04<00:01, 47973.95 examples/s]
Map: 71%|███████ | 201260/284212 [00:04<00:01, 48176.70 examples/s]
Map: 73%|███████▎ | 206120/284212 [00:04<00:01, 48285.98 examples/s]
Map: 74%|███████▍ | 210985/284212 [00:04<00:01, 48383.11 examples/s]
Map: 76%|███████▌ | 215841/284212 [00:04<00:01, 48431.08 examples/s]
Map: 78%|███████▊ | 220694/284212 [00:04<00:01, 48458.66 examples/s]
Map: 79%|███████▉ | 225555/284212 [00:04<00:01, 48499.99 examples/s]
Map: 81%|████████ | 230421/284212 [00:04<00:01, 48545.98 examples/s]
Map: 83%|████████▎ | 235281/284212 [00:04<00:01, 48559.52 examples/s]
Map: 85%|████████▌ | 242331/284212 [00:05<00:00, 47948.24 examples/s]
Map: 87%|████████▋ | 247210/284212 [00:05<00:00, 48172.93 examples/s]
Map: 90%|████████▉ | 254372/284212 [00:05<00:00, 48013.79 examples/s]
Map: 91%|█████████ | 259264/284212 [00:05<00:00, 48247.89 examples/s]
Map: 93%|█████████▎| 264152/284212 [00:05<00:00, 48417.84 examples/s]
Map: 95%|█████████▍| 269041/284212 [00:05<00:00, 48544.84 examples/s]
Map: 96%|█████████▋| 273923/284212 [00:05<00:00, 48619.46 examples/s]
Map: 98%|█████████▊| 278810/284212 [00:05<00:00, 48686.68 examples/s]
Map: 100%|█████████▉| 283693/284212 [00:05<00:00, 48724.59 examples/s]
Map: 100%|██████████| 284212/284212 [00:05<00:00, 48068.59 examples/s]
INFO:mteb.abstasks.AbsTaskRetrieval:Loaded 277144 TRAIN Queries.
INFO:mteb.abstasks.AbsTaskRetrieval:Query Example: {'id': '121352', 'text': 'define extreme'}
INFO:mteb.abstasks.AbsTaskRetrieval:Loading Corpus...
INFO:mteb.abstasks.AbsTaskRetrieval:Loaded 138364198 DEV Documents.
INFO:mteb.abstasks.AbsTaskRetrieval:Doc Example: {'id': '00_0', 'title': '0-60 Times - 0-60 | 0 to 60 Times & 1/4 Mile Times | Zero to 60 Car Reviews', 'text': '0-60 Times - 0-60 | 0 to 60 Times & 1/4 Mile Times | Zero to 60 Car Reviews.'}
INFO:mteb.abstasks.AbsTaskRetrieval:Loading Queries...
Using the latest cached version of the dataset since mteb/msmarco-v2 couldn't be found on the Hugging Face Hub
WARNING:datasets.load:Using the latest cached version of the dataset since mteb/msmarco-v2 couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'queries' at /data/huggingface/datasets/mteb___msmarco-v2/queries/0.0.0/b1663124850d305ab7c470bb0548acf8e2e7ea43 (last modified on Sat Jun 29 22:47:03 2024).
WARNING:datasets.packaged_modules.cache.cache:Found the latest cached dataset configuration 'queries' at /data/huggingface/datasets/mteb___msmarco-v2/queries/0.0.0/b1663124850d305ab7c470bb0548acf8e2e7ea43 (last modified on Sat Jun 29 22:47:03 2024).
Using the latest cached version of the dataset since mteb/msmarco-v2 couldn't be found on the Hugging Face Hub
WARNING:datasets.load:Using the latest cached version of the dataset since mteb/msmarco-v2 couldn't be found on the Hugging Face Hub
ERROR:mteb.evaluation.MTEB:Error while evaluating MSMARCOv2: There are multiple 'mteb/msmarco-v2' configurations in the cache: corpus, queries, default
Please specify which configuration to reload from the cache, e.g.
load_dataset('mteb/msmarco-v2', 'corpus')
Traceback (most recent call last):
File "/env/lib/conda/gritkto/bin/mteb", line 8, in
sys.exit(main())
File "/data/niklas/mteb/mteb/cli.py", line 370, in main
args.func(args)
File "/data/niklas/mteb/mteb/cli.py", line 118, in run
eval.run(
File "/data/niklas/mteb/mteb/evaluation/MTEB.py", line 388, in run
raise e
File "/data/niklas/mteb/mteb/evaluation/MTEB.py", line 328, in run
task.load_data(eval_splits=task_eval_splits, **kwargs)
File "/data/niklas/mteb/mteb/abstasks/AbsTaskRetrieval.py", line 231, in load_data
corpus, queries, qrels = HFDataLoader(
File "/data/niklas/mteb/mteb/abstasks/AbsTaskRetrieval.py", line 96, in load
self._load_qrels(split)
File "/data/niklas/mteb/mteb/abstasks/AbsTaskRetrieval.py", line 175, in _load_qrels
qrels_ds = load_dataset(
File "/env/lib/conda/gritkto/lib/python3.10/site-packages/datasets/load.py", line 2592, in load_dataset
builder_instance = load_dataset_builder(
File "/env/lib/conda/gritkto/lib/python3.10/site-packages/datasets/load.py", line 2301, in load_dataset_builder
builder_instance: DatasetBuilder = builder_cls(
File "/env/lib/conda/gritkto/lib/python3.10/site-packages/datasets/packaged_modules/cache/cache.py", line 140, in init
config_name, version, hash = _find_hash_in_cache(
File "/env/lib/conda/gritkto/lib/python3.10/site-packages/datasets/packaged_modules/cache/cache.py", line 85, in _find_hash_in_cache
raise ValueError(
ValueError: There are multiple 'mteb/msmarco-v2' configurations in the cache: corpus, queries, default
Please specify which configuration to reload from the cache, e.g.
load_dataset('mteb/msmarco-v2', 'corpus')

@KennethEnevoldsen
Copy link
Contributor

adding a ref here to #1014

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Jul 3, 2024

@Muennighoff I see that we have the following splits in MsMarcov2:

eval_splits=["train", "dev", "dev2"],

Is it worth reducing it to just "dev2"?
Related to #992

@KennethEnevoldsen
Copy link
Contributor

@Muennighoff I think the easiest solution for now is just to not include MsMarcov2 (it is just too fucking huge). WDYT?

@Muennighoff
Copy link
Contributor Author

I'm fine with excluding it! We can also downsample it like some of the other retrieval datasets.

@rich-junwang
Copy link

I have the same issue with dataset Touche2020

@KennethEnevoldsen KennethEnevoldsen changed the title MSMARCOv2 ds loading fails MSMARCOv2 is excessive large and takes too long to load Jul 5, 2024
@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Jul 5, 2024

@rich-junwang when running Msmarcov2 I never got the issue above (though it was still too big) - have to tried resetting the cache? If it keeps being and issue I would recommend opening a seperate issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants