Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix train data script #156

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Fix train data script #156

wants to merge 2 commits into from

Conversation

natolambert
Copy link
Contributor

@natolambert natolambert commented Apr 29, 2024

Closes #153, makes it so token isnt needed (can use huggingface-cli), I tested this to retrain OLMo 1.7


echo "Downloading WizardLM dataset..."
wget -P data/raw_train/wizardlm/ https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k/resolve/main/WizardLM_evol_instruct_V2_143k.json
# original data removed wget -P data/raw_train/wizardlm/ https://huggingface.co/datasets/WizardLM/WizardLM_evol_instruct_V2_196k/resolve/main/WizardLM_evol_instruct_V2_143k.json
wget -P data/raw_train/wizardlm/ https://huggingface.co/datasets/Leon-Leee/Wizardlm_Evol_Instruct_v2_196K_backuped/resolve/main/data/train-00000-of-00001-004cd1ba9dc05e6c.parquet
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you verified this is the same data?

@hamishivi
Copy link
Collaborator

LGTM so long as you've verified the data is the same as our released mixture.

@natolambert
Copy link
Contributor Author

Code for comparing diffs:

from datasets import load_dataset

def load_and_compare_datasets(dataset_name1, dataset_name2, split='train'):
    # Load datasets
    dataset1 = load_dataset(dataset_name1, split=split)
    dataset2 = load_dataset(dataset_name2, split=split)

    # Create dictionaries indexed by 'id'
    dict1 = {row['id']: row for row in dataset1}
    dict2 = {row['id']: row for row in dataset2}

    # Find unique ids in each dataset
    unique_ids_to_set1 = set(dict1.keys()) - set(dict2.keys())
    unique_ids_to_set2 = set(dict2.keys()) - set(dict1.keys())

    # Print unique entries
    print("Entries unique to dataset 1:")
    for id in unique_ids_to_set1:
        print(dict1[id])

    print("\nEntries unique to dataset 2:")
    for id in unique_ids_to_set2:
        print(dict2[id])

# Example usage with dataset names and optional split specification
load_and_compare_datasets('ai2-adapt-dev/tulu2-tmp', 'allenai/tulu-v2-sft-mixture', split='train')

@natolambert
Copy link
Contributor Author

There are minor differences, mostly via wizardlm taking data down and maybe slight changes to the dataset processing script from the v2 version on huggingface. New backups made:

@hamishivi
Copy link
Collaborator

So now the script points to this backup wizardlm version that yields some small differences when you run the train prep script?

@natolambert
Copy link
Contributor Author

@hamishivi, yes.
Backup was taken from your nfs raw train files. I can get the date it was last modified, but I think there are multiple things that have potentially changed since we uploaded tulu v2 sft mix

@cbfcbf
Copy link

cbfcbf commented Jun 30, 2024

I found a typo at line 86 reformat_dataset.py : "if num_few_shot_examples > 0" should be "if num_zero_shot_examples > 0"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WizardLM Data Gone (prep data script error)
3 participants