Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when fine-tuning Ambernet #9561

Closed
Oscaarjs opened this issue Jun 28, 2024 · 3 comments
Closed

Segmentation fault when fine-tuning Ambernet #9561

Oscaarjs opened this issue Jun 28, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@Oscaarjs
Copy link

Oscaarjs commented Jun 28, 2024

Describe the bug

I'm trying to fine-tune the Ambernet model on Cantonese and Chinese.

But I'm getting this Segmentation fault;

Sanity Checking DataLoader 0:   0%|                                                               | 0/2 [00:00<?, ?it/s]
Segmentation fault

full_output.txt

Steps/Code to reproduce bug

The config I'm using; (converted to json to be able to upload it here)
ambernet_config.json

My manifests look like this;

{"audio_filepath": "/home/<masked>/finetune-lid/data/cv-corpus-18.0-2024-06-14/yue/clips/common_voice_yue_32329425.mp3", "duration": 3.78, "label": "yue"}
{"audio_filepath": "/home/<masked>/finetune-lid/data/cv-corpus-18.0-2024-06-14/yue/clips/common_voice_yue_38973572.mp3", "duration": 3.276, "label": "yue"}
{"audio_filepath": "/home/<masked>/finetune-lid/data/cv-corpus-18.0-2024-06-14/yue/clips/common_voice_yue_38956269.mp3", "duration": 3.168, "label": "yue"}
{"audio_filepath": "/home/<masked>/finetune-lid/data/cv-corpus-18.0-2024-06-14/zh-CN/clips/common_voice_zh-CN_32666586.mp3", "duration": 2.88, "label": "zh-CN"}
{"audio_filepath": "/home/<masked>/finetune-lid/data/cv-corpus-18.0-2024-06-14/zh-CN/clips/common_voice_zh-CN_33111477.mp3", "duration": 5.796, "label": "zh-CN"}

I run this script;

!python speech_to_label.py --config-path="configs/" --config-name="ambernet_config" \
    model.train_ds.manifest_filepath="manifests/train_manifest.json" \
    model.validation_ds.manifest_filepath="manifests/val_manifest.json" \
    model.test_ds.manifest_filepath="manifests/test_manifest.json" \
    model.decoder.num_classes=2 \
    trainer.devices=1 \
    trainer.max_epochs=40 \
    trainer.accelerator="gpu" \
    exp_manager.create_wandb_logger=False \
    exp_manager.wandb_logger_kwargs.name="titanet" \
    exp_manager.wandb_logger_kwargs.project="langid" \
    +exp_manager.checkpoint_callback_params.monitor="val_acc_macro" \
    +exp_manager.checkpoint_callback_params.mode="max" \
    +trainer.precision=16

Environment overview

  • Environment location: GCE VM

  • Method of NeMo install: [pip install or from source]. Please specify exact commands you used to install.

apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython packaging
pip install nemo_toolkit['asr']

Environment details

  • OS version
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
  • PyTorch version: 2.3.1+cu121
  • Nemo Version: 2.0.0rc0 (also on 1.23.0)
  • Python version: 3.10.14

Additional context

Add any other context about the problem here.
Example: GPU model

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:00:04.0 Off |                    0 |
| N/A   36C    P8              9W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jul_11_02:20:44_PDT_2023
Cuda compilation tools, release 12.2, V12.2.128
Build cuda_12.2.r12.2/compiler.33053471_0
@Oscaarjs Oscaarjs added the bug Something isn't working label Jun 28, 2024
@Oscaarjs
Copy link
Author

@fayejf maybe you have an idea? Thanks in advance

@Oscaarjs Oscaarjs changed the title TypeError when fine-tuning Ambernet Segmentation fault when fine-tuning Ambernet Jun 28, 2024
@Oscaarjs
Copy link
Author

Oscaarjs commented Jul 1, 2024

Updating from Cuda 12.2 to Cuda 12.3 seems to have solved the issue.

@Oscaarjs
Copy link
Author

Oscaarjs commented Jul 1, 2024

Closing

@Oscaarjs Oscaarjs closed this as completed Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant