Nicer error message for undefined symbol #1339

dakinggg · 2024-07-04T01:32:38Z

Adds a nicer error message for the most common case of the flash attention install getting messed up.

Before:

ImportError:
/usr/lib/python3/dist-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so
: undefined symbol: _ZN3c104cuda9SetDeviceEi

After:

ImportError:
/usr/lib/python3/dist-packages/flash_attn_2_cuda.cpython-311-x86_64-linux-gnu.so
: undefined symbol: _ZN3c104cuda9SetDeviceEi

The above exception was the direct cause of the following exception:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /workspace/llm-foundry/scripts/train/train.py:25 in <module>                 │
│                                                                              │
│    22 from omegaconf import DictConfig                                       │
│    23 from omegaconf import OmegaConf as om                                  │
│    24                                                                        │
│ ❱  25 from llmfoundry.callbacks import AsyncEval, HuggingFaceCheckpointer    │
│    26 from llmfoundry.data.dataloader import build_dataloader                │
│    27 from llmfoundry.eval.metrics.nlp import InContextLearningMetric        │
│    28 from llmfoundry.layers_registry import ffns_with_megablocks            │
│                                                                              │
│ /workspace/llm-foundry/llmfoundry/__init__.py:17 in <module>                 │
│                                                                              │
│   14 │   del flash_attn_func                                                 │
│   15 except ImportError as e:                                                │
│   16 │   if "undefined symbol" in str(e):                                    │
│ ❱ 17 │   │   raise ImportError(                                              │
│   18 │   │   │   "The flash_attn package is not installed correctly. Usually │
│   19 │   │   │   " of PyTorch is different from the version that flash_attn  │
│   20 │   │   │   " workflow has resulted in PyTorch being reinstalled. This  │
╰──────────────────────────────────────────────────────────────────────────────╯
ImportError: The flash_attn package is not installed correctly. Usually this
means that your runtime version. of PyTorch is different from the version that
flash_attn was installed with, which can occur when your workflow has resulted
in PyTorch being reinstalled. This probably happened because you are using an
old docker image with the latest version of LLM Foundry. Check that the PyTorch
version in your Docker image matches the PyTorch version in LLM Foundry setup.py
and update accordingly. The latest Docker image can be found in the README.

snarayan21

hella useful, ty

llmfoundry/__init__.py

dakinggg added 3 commits July 3, 2024 18:08

nice

6bac701

nice

2ea01c7

move up

603e214

dakinggg requested a review from a team as a code owner July 4, 2024 01:32

dakinggg requested a review from mvpatel2000 July 4, 2024 01:33

dakinggg enabled auto-merge (squash) July 4, 2024 01:33

snarayan21 approved these changes Jul 4, 2024

View reviewed changes

dakinggg added 2 commits July 3, 2024 19:27

oops

f4186ca

fix

dc3b4d1

dakinggg commented Jul 4, 2024

View reviewed changes

llmfoundry/__init__.py Outdated Show resolved Hide resolved

llmfoundry/__init__.py Outdated Show resolved Hide resolved

dakinggg added 2 commits July 3, 2024 22:48

Update llmfoundry/__init__.py

86ed0a1

Update llmfoundry/__init__.py

b3df3f2

dakinggg merged commit 22e243a into mosaicml:main Jul 4, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nicer error message for undefined symbol #1339

Nicer error message for undefined symbol #1339

dakinggg commented Jul 4, 2024

snarayan21 left a comment

Nicer error message for undefined symbol #1339

Nicer error message for undefined symbol #1339

Conversation

dakinggg commented Jul 4, 2024

snarayan21 left a comment

Choose a reason for hiding this comment