Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: shape '[1, 512, 1, 32, 2]' is invalid for input of size 16448 #21

Open
beniz opened this issue Jun 14, 2024 · 5 comments
Open

Comments

@beniz
Copy link

beniz commented Jun 14, 2024

Hi, thanks for the interesting work.

I'm playing a bit with the code on a simple single-class dataset of 256x256 images, and I've modified basic things (imagenet hardcoded numbers, etc...).

I'm hitting the error above on the rope embedding:

freqs_cis = freqs_cis.view(1, xshaped.size(1), 1, xshaped.size(3), 2) # (1, seq_len, 1, head_dim//2, 2)

Went chasing the issue, and it seems this is due to a mismatch between the precomputed freqs_cis and the reshaping of the attention vectors. This mismatch appears to mostly be due to the number of augmentations (I went from 10 to 2 during debug).

If this error rings a bell, I'd appreciate any hint :) I see how to fix it with a hack (reducing aug to none), but I believe something else is wrong, otherwise it wouldn't work at all.

Thanks!

@PeizeSun
Copy link
Collaborator

Hi~
Can you provide more descriptions, like

  • what is the running common you use?
  • What basic things do you change?

@beniz
Copy link
Author

beniz commented Jun 17, 2024

Hi @PeizeSun thanks, and apologies for not having provided more details.

I've pre-computed the codes with (removing the ten_crop flag for debug):

bash scripts/autoregressive/extract_codes_c2i.sh --vq-ckpt /path/to/models/vq_ds16_c2i.pt --data-path /path/to/butterflies/ --code-path /path/to/butterflies/codes_256/ --image-size 256

The toy dataset is single class, available from https://www.joligen.com/datasets/butterflies.tar

The generated codes in the codes_256 dir seems to be OK:

ls -l codes_256/
total 44
drwxrwxr-x 2 b b 20480 Jun 14 17:19 imagenet256_codes
drwxrwxr-x 2 b b 20480 Jun 14 17:19 imagenet256_labels

From printing the shapes, features are (correctly afaik) of shape [1,2,256], and labels of shape [1].

My diff on the training code is below, I've downsized the training to a single GPU for debug purposes.

diff --git a/autoregressive/train/train_c2i.py b/autoregressive/train/train_c2i.py
index 3b43aa5..031b868 100644
--- a/autoregressive/train/train_c2i.py
+++ b/autoregressive/train/train_c2i.py
@@ -15,6 +15,8 @@ import time
 import inspect
 import argparse

+import sys
+sys.path.append(os.path.join(os.path.dirname(__file__), '../..'))
 from utils.logger import create_logger
 from utils.distributed import init_distributed_mode
 from utils.ema import update_ema, requires_grad
diff --git a/dataset/imagenet.py b/dataset/imagenet.py
index c07f6cb..6d0e185 100644
--- a/dataset/imagenet.py
+++ b/dataset/imagenet.py
@@ -23,8 +23,8 @@ class CustomDataset(Dataset):
         # self.feature_files = sorted(os.listdir(feature_dir))
         # self.label_files = sorted(os.listdir(label_dir))
         # TODO: make it configurable
-        self.feature_files = [f"{i}.npy" for i in range(1281167)]
-        self.label_files = [f"{i}.npy" for i in range(1281167)]
+        self.feature_files = [f"{i}.npy" for i in range(951)]
+        self.label_files = [f"{i}.npy" for i in range(951)]

     def __len__(self):
         assert len(self.feature_files) == len(self.label_files), \
@@ -58,4 +58,4 @@ def build_imagenet_code(args):
     label_dir = f"{args.code_path}/imagenet{args.image_size}_labels"
     assert os.path.exists(feature_dir) and os.path.exists(label_dir), \
         f"please first run: bash scripts/autoregressive/extract_codes_c2i.sh ..."
-    return CustomDataset(feature_dir, label_dir)
\ No newline at end of file
+    return CustomDataset(feature_dir, label_dir)
diff --git a/scripts/autoregressive/train_c2i.sh b/scripts/autoregressive/train_c2i.sh
index ecc6a98..5638ebb 100644
--- a/scripts/autoregressive/train_c2i.sh
+++ b/scripts/autoregressive/train_c2i.sh
@@ -1,6 +1,12 @@
 # !/bin/bash
 set -x

+nnodes=1
+nproc_per_node=1
+node_rank=0
+master_addr=127.0.0.1
+master_port=29500
+
 torchrun \
 --nnodes=$nnodes --nproc_per_node=$nproc_per_node --node_rank=$node_rank \
 --master_addr=$master_addr --master_port=$master_port \

I run training with

bash scripts/autoregressive/train_c2i.sh --cloud-save-path /path/to/models/gpt_b/ --code-path /data1/path/to/butterflies/codes_256/ --image-size 256 --global-batch-size 256 --gpt-model GPT-B --num-classes 1  --no-compile

The full error is below:

Traceback (most recent call last):
  File "/path/to/apps/LlamaGen/autoregressive/train/train_c2i.py", line 296, in <module>
    main(args)
  File "/path/to/apps/LlamaGen/autoregressive/train/train_c2i.py", line 196, in main
    _, loss = model(cond_idx=c_indices, idx=z_indices[:,:-1], targets=z_indices)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/to/apps/LlamaGen/autoregressive/train/../../autoregressive/models/gpt.py", line 364, in forward
    h = layer(h, freqs_cis, input_pos, mask)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/to/apps/LlamaGen/autoregressive/train/../../autoregressive/models/gpt.py", line 255, in forward
    h = x + self.drop_path(self.attention(self.attention_norm(x), freqs_cis, start_pos, mask))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/to/apps/LlamaGen/autoregressive/train/../../autoregressive/models/gpt.py", line 220, in forward
    xq = apply_rotary_emb(xq, freqs_cis)
  File "/path/to/apps/LlamaGen/autoregressive/train/../../autoregressive/models/gpt.py", line 424, in apply_rotary_emb
    freqs_cis = freqs_cis.view(1, xshaped.size(1), 1, xshaped.size(3), 2) # (1, seq_len, 1, head_dim//2, 2)
RuntimeError: shape '[1, 512, 1, 32, 2]' is invalid for input of size 16448
[2024-06-17 11:04:43,428] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 263727) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
autoregressive/train/train_c2i.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-17_11:04:43
  host      : neptune10
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 263727)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

I may have missed something, regarding the rope embedding and replicating across the number of crops.

@Baijiong-Lin
Copy link

the same issue

@Baijiong-Lin
Copy link

I first run the following command to generate codes on the imagenet dataset
torchrun --nproc_per_node 2 autoregressive/train/extract_codes_c2i.py --vq-model VQ-16 --vq-ckpt ./vq_ds16_c2i.pt --data-path xxx --code-path xxx --image-size 256

and then run the following command to train
torchrun --nproc_per_node 8 autoregressive/train/train_c2i.py --code-path xxx --results-dir xxx --no-compile --image-size 256

it raises an error in the apply_rotary_emb function in autoregressive/models/gpt.py

freqs_cis = freqs_cis.view(1, xshaped.size(1), 1, xshaped.size(3), 2) # (1, seq_len, 1, head_dim//2, 2)
RuntimeError: shape '[1, 512, 1, 32, 2]' is invalid for input of size 16448

I have printed the size of variables before line xq = apply_rotary_emb(xq, freqs_cis) and found that the size of xq is torch.Size([128, 512, 12, 64]) and the size of freqs_cis is torch.Size([257, 32, 2])

if i comment out xq = apply_rotary_emb(xq, freqs_cis) and xk = apply_rotary_emb(xk, freqs_cis), it can train normally.


@PeizeSun could you help to solve this problem? thanks.

@Menoly-xin
Copy link

I have met the same issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants