-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finetune script #11
Comments
Here's how I just did it, would be curious to see if there's anything model-specific I can use to make the training go faster. I ran into a bunch of issues, see comments. I had to use a single letter for the padding (N instead of [PAD]), I had to split my fastas into pieces of 500bp or the A100 would run out of memory. I'm using mitogenome pieces of fastas of fish; not prokaryotes but of prokaryotic origin :)
Load the model via from_pretrained('./finetuned.model'). It's still finetuning and a bit slow; will take about 50 hours for 304,354 pieces of DNA of length 500 bp that come from 64,573 mitogenomes and 12S/16S pieces (total size: 140Mb) |
@philippbayer Nice job! I also found there are no special token in this tokenizer, but the paper says they used EOS tokens to split individual DNA sequences. |
@philippbayer I tried to use bf16, seems to work. Another way might be freezing some layers. I am using Human cds sequences to finetune the last layer. Not sure will the Hyena works like Transformers that finetune some layers can work. But this allow me to finetune import transformers
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, DataCollatorForLanguageModeling
from transformers import AutoConfig, TrainingArguments, Trainer
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
os.environ["WANDB_DISABLED"] = "true"
model_name = 'togethercomputer/evo-1-8k-base'
model_config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
model_config.use_cache = False
model = AutoModelForCausalLM.from_pretrained(
model_name,
config=model_config,
trust_remote_code=True,
device_map={"":0},
torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = "X"
# frezze most parameters
for p in model.parameters():
p.requires_grad = False
for p in model.backbone.blocks[-1].parameters():
p.requires_grad = True
from datasets import load_dataset
dataset = load_dataset("gonzalobenegas/human-genome-cds")
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
def preprocess_function(sample):
return tokenizer(sample['seq'], padding="longest", truncation=True, max_length=3000)
tokenized_ds = dataset.map(
preprocess_function,
batched=True,
num_proc=12,
)
training_args = TrainingArguments(
output_dir="./evo_results",
evaluation_strategy="epoch",
learning_rate=2e-5,
weight_decay=0.01,
gradient_accumulation_steps=2,
per_device_train_batch_size=4,
warmup_steps=10,
max_steps=100, # only a demo
logging_steps=10,
eval_steps=100,
bf16=True
# fp16=True, # This didn't work.
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_ds["train"],
eval_dataset=tokenized_ds["test"],
data_collator=data_collator,
)
trainer.train() |
Thanks!!! The freezing is great, it cut my finetuning town down from ~50 hours to ~14 hours which makes sense considering I'm retraining the entire thing :) It also lets me increase my I did ran out of space over night so don't forget to set your fp16 is very hardware-dependent, I just always turn it on as it promises faster training and less memory usage https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/ might just be that your GPU doesn't support it. With the freezing of most parameters the 'full' sequences which include entire mitogenomes still make my A100 run out of memory. I haven't run any experiments on what the 'longest' possible length is. I can use 5000 bp pieces without crashing. |
This is all great. In the future (if there is sufficient interest) we are also going to support finetuning of Evo on the Together API, which hopefully will make it a lot easier to perform full model finetunes at full context (131k and beyond). There is also planned support of the architecture on other open frameworks for LLM finetuning. |
I'd definitely be interested in that!! |
interested! |
Interestingly, freezing everything but the last layer as above in @JinyuanSun's script doesn't work well with my data: As you can see, the training loss is identical. Doing it 'my' way works better, but takes about three times the time (so far....) I'm sure there's some optimal middle way where some, not all layers are finetuned. |
Has anyone explored FSDP finetuning on multiple GPUs? I got error "ValueError: Must flatten tensors with uniform dtype but got torch.bfloat16 and torch.float32" Seems due to the fact FSDP requires the tensor dtype being uniform but stripedhyena is a mixed precision model? |
Could you provide a script/notebook demo to show how to finetune this model?
The text was updated successfully, but these errors were encountered: