RuntimeError: shape '[1, 512, 1, 32, 2]' is invalid for input of size 16448 #21

beniz · 2024-06-14T17:35:28Z

Hi, thanks for the interesting work.

I'm playing a bit with the code on a simple single-class dataset of 256x256 images, and I've modified basic things (imagenet hardcoded numbers, etc...).

I'm hitting the error above on the rope embedding:

freqs_cis = freqs_cis.view(1, xshaped.size(1), 1, xshaped.size(3), 2) # (1, seq_len, 1, head_dim//2, 2)

Went chasing the issue, and it seems this is due to a mismatch between the precomputed freqs_cis and the reshaping of the attention vectors. This mismatch appears to mostly be due to the number of augmentations (I went from 10 to 2 during debug).

If this error rings a bell, I'd appreciate any hint :) I see how to fix it with a hack (reducing aug to none), but I believe something else is wrong, otherwise it wouldn't work at all.

Thanks!

The text was updated successfully, but these errors were encountered:

PeizeSun · 2024-06-16T15:39:16Z

Hi~
Can you provide more descriptions, like

what is the running common you use?
What basic things do you change?

beniz · 2024-06-17T11:09:35Z

Hi @PeizeSun thanks, and apologies for not having provided more details.

I've pre-computed the codes with (removing the ten_crop flag for debug):

bash scripts/autoregressive/extract_codes_c2i.sh --vq-ckpt /path/to/models/vq_ds16_c2i.pt --data-path /path/to/butterflies/ --code-path /path/to/butterflies/codes_256/ --image-size 256

The toy dataset is single class, available from https://www.joligen.com/datasets/butterflies.tar

The generated codes in the codes_256 dir seems to be OK:

ls -l codes_256/
total 44
drwxrwxr-x 2 b b 20480 Jun 14 17:19 imagenet256_codes
drwxrwxr-x 2 b b 20480 Jun 14 17:19 imagenet256_labels

From printing the shapes, features are (correctly afaik) of shape [1,2,256], and labels of shape [1].

My diff on the training code is below, I've downsized the training to a single GPU for debug purposes.

diff --git a/autoregressive/train/train_c2i.py b/autoregressive/train/train_c2i.py
index 3b43aa5..031b868 100644
--- a/autoregressive/train/train_c2i.py
+++ b/autoregressive/train/train_c2i.py
@@ -15,6 +15,8 @@ import time
 import inspect
 import argparse

+import sys
+sys.path.append(os.path.join(os.path.dirname(__file__), '../..'))
 from utils.logger import create_logger
 from utils.distributed import init_distributed_mode
 from utils.ema import update_ema, requires_grad
diff --git a/dataset/imagenet.py b/dataset/imagenet.py
index c07f6cb..6d0e185 100644
--- a/dataset/imagenet.py
+++ b/dataset/imagenet.py
@@ -23,8 +23,8 @@ class CustomDataset(Dataset):
         # self.feature_files = sorted(os.listdir(feature_dir))
         # self.label_files = sorted(os.listdir(label_dir))
         # TODO: make it configurable
-        self.feature_files = [f"{i}.npy" for i in range(1281167)]
-        self.label_files = [f"{i}.npy" for i in range(1281167)]
+        self.feature_files = [f"{i}.npy" for i in range(951)]
+        self.label_files = [f"{i}.npy" for i in range(951)]

     def __len__(self):
         assert len(self.feature_files) == len(self.label_files), \
@@ -58,4 +58,4 @@ def build_imagenet_code(args):
     label_dir = f"{args.code_path}/imagenet{args.image_size}_labels"
     assert os.path.exists(feature_dir) and os.path.exists(label_dir), \
         f"please first run: bash scripts/autoregressive/extract_codes_c2i.sh ..."
-    return CustomDataset(feature_dir, label_dir)
\ No newline at end of file
+    return CustomDataset(feature_dir, label_dir)
diff --git a/scripts/autoregressive/train_c2i.sh b/scripts/autoregressive/train_c2i.sh
index ecc6a98..5638ebb 100644
--- a/scripts/autoregressive/train_c2i.sh
+++ b/scripts/autoregressive/train_c2i.sh
@@ -1,6 +1,12 @@
 # !/bin/bash
 set -x

+nnodes=1
+nproc_per_node=1
+node_rank=0
+master_addr=127.0.0.1
+master_port=29500
+
 torchrun \
 --nnodes=$nnodes --nproc_per_node=$nproc_per_node --node_rank=$node_rank \
 --master_addr=$master_addr --master_port=$master_port \

I run training with

bash scripts/autoregressive/train_c2i.sh --cloud-save-path /path/to/models/gpt_b/ --code-path /data1/path/to/butterflies/codes_256/ --image-size 256 --global-batch-size 256 --gpt-model GPT-B --num-classes 1  --no-compile

The full error is below:

Traceback (most recent call last):
  File "/path/to/apps/LlamaGen/autoregressive/train/train_c2i.py", line 296, in <module>
    main(args)
  File "/path/to/apps/LlamaGen/autoregressive/train/train_c2i.py", line 196, in main
    _, loss = model(cond_idx=c_indices, idx=z_indices[:,:-1], targets=z_indices)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1523, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/to/apps/LlamaGen/autoregressive/train/../../autoregressive/models/gpt.py", line 364, in forward
    h = layer(h, freqs_cis, input_pos, mask)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/to/apps/LlamaGen/autoregressive/train/../../autoregressive/models/gpt.py", line 255, in forward
    h = x + self.drop_path(self.attention(self.attention_norm(x), freqs_cis, start_pos, mask))
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/path/to/apps/LlamaGen/autoregressive/train/../../autoregressive/models/gpt.py", line 220, in forward
    xq = apply_rotary_emb(xq, freqs_cis)
  File "/path/to/apps/LlamaGen/autoregressive/train/../../autoregressive/models/gpt.py", line 424, in apply_rotary_emb
    freqs_cis = freqs_cis.view(1, xshaped.size(1), 1, xshaped.size(3), 2) # (1, seq_len, 1, head_dim//2, 2)
RuntimeError: shape '[1, 512, 1, 32, 2]' is invalid for input of size 16448
[2024-06-17 11:04:43,428] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 263727) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
autoregressive/train/train_c2i.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-17_11:04:43
  host      : neptune10
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 263727)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html


I may have missed something, regarding the rope embedding and replicating across the number of crops.

Baijiong-Lin · 2024-06-19T11:05:36Z

the same issue

Baijiong-Lin · 2024-07-01T07:48:14Z

I first run the following command to generate codes on the imagenet dataset
torchrun --nproc_per_node 2 autoregressive/train/extract_codes_c2i.py --vq-model VQ-16 --vq-ckpt ./vq_ds16_c2i.pt --data-path xxx --code-path xxx --image-size 256

and then run the following command to train
torchrun --nproc_per_node 8 autoregressive/train/train_c2i.py --code-path xxx --results-dir xxx --no-compile --image-size 256

it raises an error in the apply_rotary_emb function in autoregressive/models/gpt.py

freqs_cis = freqs_cis.view(1, xshaped.size(1), 1, xshaped.size(3), 2) # (1, seq_len, 1, head_dim//2, 2)
RuntimeError: shape '[1, 512, 1, 32, 2]' is invalid for input of size 16448

I have printed the size of variables before line xq = apply_rotary_emb(xq, freqs_cis) and found that the size of xq is torch.Size([128, 512, 12, 64]) and the size of freqs_cis is torch.Size([257, 32, 2])

if i comment out xq = apply_rotary_emb(xq, freqs_cis) and xk = apply_rotary_emb(xk, freqs_cis), it can train normally.

@PeizeSun could you help to solve this problem? thanks.

Menoly-xin · 2024-07-02T04:51:18Z

I have met the same issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: shape '[1, 512, 1, 32, 2]' is invalid for input of size 16448 #21

RuntimeError: shape '[1, 512, 1, 32, 2]' is invalid for input of size 16448 #21

beniz commented Jun 14, 2024

PeizeSun commented Jun 16, 2024

beniz commented Jun 17, 2024

Baijiong-Lin commented Jun 19, 2024

Baijiong-Lin commented Jul 1, 2024

Menoly-xin commented Jul 2, 2024

RuntimeError: shape '[1, 512, 1, 32, 2]' is invalid for input of size 16448 #21

RuntimeError: shape '[1, 512, 1, 32, 2]' is invalid for input of size 16448 #21

Comments

beniz commented Jun 14, 2024

PeizeSun commented Jun 16, 2024

beniz commented Jun 17, 2024

Baijiong-Lin commented Jun 19, 2024

Baijiong-Lin commented Jul 1, 2024

Menoly-xin commented Jul 2, 2024