【bug】can not load cambrian-34b #12

CSEEduanyu · 2024-06-28T09:51:15Z

in load_pretrained_model
model = CambrianLlamaForCausalLM.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3531, in from_pretrained
) = cls._load_pretrained_model(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3958, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 812, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([1024, 1152]) in "weight" (which has shape torch.Size([1024, 1024])), this look incorrect.

penghao-wu · 2024-06-29T00:16:16Z

Hi, could you please provide more information about your case (e.g. the device_map for loading and the number of GPUs you are using). Also, can you try to load the 8/13b model to see whether the same problem happens?

CSEEduanyu · 2024-06-30T02:56:09Z

transformers in my env is 4.39 , why must transformers==4.37.0 in dependencies ?

CSEEduanyu · 2024-06-30T02:57:35Z

all dependencies is "== " ， I wonder if ">" is OK？

penghao-wu · 2024-06-30T05:59:43Z

Our training and evaluation are mainly conducted with the specified versions and haven't been extensively tested with higher versions to ensure correctness. But I have tested to run the 34B model with transformers==4.39.0 and it works fine. Could you provide the information about device_map for loading and the number of GPUs you are using? Also, what is the version of your accelerate?

CSEEduanyu · 2024-06-30T07:56:30Z

Our training and evaluation are mainly conducted with the specified versions and haven't been extensively tested with higher versions to ensure correctness. But I have tested to run the 34B model with transformers==4.39.0 and it works fine. Could you provide the information about device_map for loading and the number of GPUs you are using? Also, what is the version of your accelerate?

A100*8

CSEEduanyu · 2024-06-30T07:56:53Z

Loading checkpoint shards: 94%|███████████████████████████████████████████████████████████████████████████████████████████████████▍ | 30/32 [00:19<00:01, 1.57it/s]
Traceback (most recent call last):
model = CambrianLlamaForCausalLM.from_pretrained(
File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3852, in from_pretrained
) = cls._load_pretrained_model(
File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4286, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 807, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 285, in set_module_tensor_to_device
raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([1024, 1152]) in "weight" (which has shape torch.Size([1024, 1024])), this look incorrect.

CSEEduanyu · 2024-06-30T08:14:06Z

when i add some log , is loding "model.mm_projector_aux_0.0.weight"
@penghao-wu

CSEEduanyu · 2024-06-30T08:17:59Z

Is it because I only kept the second one in mm_vision_tower_aux_list?

penghao-wu · 2024-06-30T09:21:50Z

Is it because I only kept the second one in mm_vision_tower_aux_list?

What do you mean by this? You don't need to modify the config if you want to load our trained model.

CSEEduanyu · 2024-06-30T09:53:34Z

Is it because I only kept the second one in mm_vision_tower_aux_list?

What do you mean by this? You don't need to modify the config if you want to load our trained model.

because i can only load local path model，Can you list the huggingface download addresses for these four vision models?

"mm_vision_tower_aux_list": [
"siglip/CLIP-ViT-SO400M-14-384",
"openai/clip-vit-large-patch14-336",
"facebook/dinov2-giant-res378",
"clip-convnext-XXL-multi-stage"
],

CSEEduanyu · 2024-06-30T09:55:40Z

For example, CLIP-ViT-SO400M-14-384 seems to have many versions, and I can't search clip-conv-xxL-multi-stage in huggfing face

penghao-wu · 2024-06-30T11:10:35Z

CLIP-ViT-SO400M-14-384 should be hf-hub:timm/ViT-SO400M-14-SigLIP-384 and clip-conv-xxL-multi-stage should be hf-hub:laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup . If you use local path, you might need to look into the loading code for each of the vision encoders in the cambrian/model/multimodal_encoder folder to ensure the correctness.

dionren · 2024-06-30T15:22:30Z

Hi, how can I set 2 48G gpus?

2024-06-30 15:21:12 PID=57 init.py:49 setup_logging() INFO → 'standard' logger initialized.
2024-06-30 15:21:13 PID=57 model_worker.py:274 () INFO → args: Namespace(host='0.0.0.0', port=40000, worker_address='http://localhost:40000', controller_address='http://localhost:10000', model_path='/mnt/cpn-pod/models/nyu-visionx/cambrian-34b', model_base=None, model_name=None, device='cuda', multi_modal=False, limit_model_concurrency=5, stream_interval=1, no_register=False, load_8bit=False, load_4bit=False)
2024-06-30 15:21:13 PID=57 model_worker.py:66 init() INFO → Loading the model cambrian-34b on worker b48646 ...
2024-06-30 15:21:13 PID=57 builder.py:119 load_pretrained_model() INFO → Loading Cambrian from /mnt/cpn-pod/models/nyu-visionx/cambrian-34b
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/root/cambrian/cambrian/serve/model_worker.py", line 279, in
worker = ModelWorker(args.controller_address,
File "/root/cambrian/cambrian/serve/model_worker.py", line 67, in init
self.tokenizer, self.model, self.image_processor, self.context_len = load_pretrained_model(
File "/root/cambrian/cambrian/model/builder.py", line 120, in load_pretrained_model
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 814, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2029, in from_pretrained
return cls._from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2261, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama.py", line 178, in init
self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/tokenization_llama.py", line 203, in get_spm_processor
tokenizer.Load(self.vocab_file)
File "/usr/local/lib/python3.10/dist-packages/sentencepiece/init.py", line 905, in Load
return self.LoadFromFile(model_file)
File "/usr/local/lib/python3.10/dist-packages/sentencepiece/init.py", line 310, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

penghao-wu · 2024-06-30T15:47:23Z

return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

This error seems not related to multiple GPUs. Make sure that all model files are downloaded correctly (e.g. tokenizer.model)

penghao-wu · 2024-06-30T16:15:59Z

@dionren Some of the vision encoders are not from transformers and do not support device_map, so there are some problems setting device_map=auto using multiple GPUs. And we are still working to convert the vision encoders to support this.

But I have a workaround for your case with 2 48G gpus. This includes the following modifications:

Modify the beginning of cambrian/model/builder.py

from accelerate import infer_auto_device_map, dispatch_model

def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, load_4bit=False, device_map="auto", device="cuda", **kwargs):
    device_map='sequential'
    kwargs = {"device_map": device_map, "max_memory":{0: "30GIB", 1: "49GIB"}, **kwargs}

Change

cambrian/cambrian/model/language_model/cambrian_llama.py

Line 252 in 9d38222

cur_latent_query_with_newline = torch.cat([cur_latent_query, cur_newline_embd], 2).flatten(1,2)

to
cur_latent_query_with_newline = torch.cat([cur_latent_query, cur_newline_embd.to(cur_latent_query.device)], 2).flatten(1,2)

dionren · 2024-06-30T16:24:26Z

@dionren Some of the vision encoders are not from transformers and do not support device_map, so there are some problems setting device_map=auto using multiple GPUs. And we are still working to convert the vision encoders to support this.

But I have a workaround for your case with 2 48G gpus. This includes the following modifications:

Modify the beginning of cambrian/model/builder.py
from accelerate import infer_auto_device_map, dispatch_model

def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, load_4bit=False, device_map="auto", device="cuda", **kwargs):
    device_map='sequential'
    kwargs = {"device_map": device_map, "max_memory":{0: "30GIB", 1: "49GIB"}, **kwargs}
Change

cambrian/cambrian/model/language_model/cambrian_llama.py

Line 252 in 9d38222

cur_latent_query_with_newline = torch.cat([cur_latent_query, cur_newline_embd], 2).flatten(1,2)

to
cur_latent_query_with_newline = torch.cat([cur_latent_query, cur_newline_embd.to(cur_latent_query.device)], 2).flatten(1,2)

I'm gonna try it out. Thanks a ton for your help and the awesome work you've done. It's truly impressive.

ellisbrown added the bug Something isn't working label Jun 28, 2024

ellisbrown assigned penghao-wu Jun 28, 2024

penghao-wu mentioned this issue Jul 2, 2024

Error in inference.py when multiple GPUs are available. [BUG] #25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【bug】can not load cambrian-34b #12

【bug】can not load cambrian-34b #12

CSEEduanyu commented Jun 28, 2024

penghao-wu commented Jun 29, 2024

CSEEduanyu commented Jun 30, 2024

CSEEduanyu commented Jun 30, 2024

penghao-wu commented Jun 30, 2024

CSEEduanyu commented Jun 30, 2024

CSEEduanyu commented Jun 30, 2024

CSEEduanyu commented Jun 30, 2024

CSEEduanyu commented Jun 30, 2024

penghao-wu commented Jun 30, 2024

CSEEduanyu commented Jun 30, 2024

CSEEduanyu commented Jun 30, 2024

penghao-wu commented Jun 30, 2024

dionren commented Jun 30, 2024

penghao-wu commented Jun 30, 2024

penghao-wu commented Jun 30, 2024

dionren commented Jun 30, 2024

【bug】can not load cambrian-34b #12

【bug】can not load cambrian-34b #12

Comments

CSEEduanyu commented Jun 28, 2024

penghao-wu commented Jun 29, 2024

CSEEduanyu commented Jun 30, 2024

CSEEduanyu commented Jun 30, 2024

penghao-wu commented Jun 30, 2024

CSEEduanyu commented Jun 30, 2024

CSEEduanyu commented Jun 30, 2024

CSEEduanyu commented Jun 30, 2024

CSEEduanyu commented Jun 30, 2024

penghao-wu commented Jun 30, 2024

CSEEduanyu commented Jun 30, 2024

CSEEduanyu commented Jun 30, 2024

penghao-wu commented Jun 30, 2024

dionren commented Jun 30, 2024

penghao-wu commented Jun 30, 2024

penghao-wu commented Jun 30, 2024

dionren commented Jun 30, 2024