Segmentation fault when fine-tuning Ambernet #9561

Oscaarjs · 2024-06-28T10:36:48Z

Describe the bug

I'm trying to fine-tune the Ambernet model on Cantonese and Chinese.

But I'm getting this Segmentation fault;

Sanity Checking DataLoader 0:   0%|                                                               | 0/2 [00:00<?, ?it/s]
Segmentation fault

full_output.txt

Steps/Code to reproduce bug

The config I'm using; (converted to json to be able to upload it here)
ambernet_config.json

My manifests look like this;

{"audio_filepath": "/home/<masked>/finetune-lid/data/cv-corpus-18.0-2024-06-14/yue/clips/common_voice_yue_32329425.mp3", "duration": 3.78, "label": "yue"}
{"audio_filepath": "/home/<masked>/finetune-lid/data/cv-corpus-18.0-2024-06-14/yue/clips/common_voice_yue_38973572.mp3", "duration": 3.276, "label": "yue"}
{"audio_filepath": "/home/<masked>/finetune-lid/data/cv-corpus-18.0-2024-06-14/yue/clips/common_voice_yue_38956269.mp3", "duration": 3.168, "label": "yue"}
{"audio_filepath": "/home/<masked>/finetune-lid/data/cv-corpus-18.0-2024-06-14/zh-CN/clips/common_voice_zh-CN_32666586.mp3", "duration": 2.88, "label": "zh-CN"}
{"audio_filepath": "/home/<masked>/finetune-lid/data/cv-corpus-18.0-2024-06-14/zh-CN/clips/common_voice_zh-CN_33111477.mp3", "duration": 5.796, "label": "zh-CN"}

I run this script;

!python speech_to_label.py --config-path="configs/" --config-name="ambernet_config" \
    model.train_ds.manifest_filepath="manifests/train_manifest.json" \
    model.validation_ds.manifest_filepath="manifests/val_manifest.json" \
    model.test_ds.manifest_filepath="manifests/test_manifest.json" \
    model.decoder.num_classes=2 \
    trainer.devices=1 \
    trainer.max_epochs=40 \
    trainer.accelerator="gpu" \
    exp_manager.create_wandb_logger=False \
    exp_manager.wandb_logger_kwargs.name="titanet" \
    exp_manager.wandb_logger_kwargs.project="langid" \
    +exp_manager.checkpoint_callback_params.monitor="val_acc_macro" \
    +exp_manager.checkpoint_callback_params.mode="max" \
    +trainer.precision=16

Environment overview

Environment location: GCE VM
Method of NeMo install: [pip install or from source]. Please specify exact commands you used to install.

apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython packaging
pip install nemo_toolkit['asr']

Environment details

OS version

PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

PyTorch version: 2.3.1+cu121
Nemo Version: 2.0.0rc0 (also on 1.23.0)
Python version: 3.10.14

Additional context

Add any other context about the problem here.
Example: GPU model

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:00:04.0 Off |                    0 |
| N/A   36C    P8              9W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jul_11_02:20:44_PDT_2023
Cuda compilation tools, release 12.2, V12.2.128
Build cuda_12.2.r12.2/compiler.33053471_0

The text was updated successfully, but these errors were encountered:

Oscaarjs · 2024-06-28T10:38:01Z

@fayejf maybe you have an idea? Thanks in advance

Oscaarjs · 2024-07-01T08:55:28Z

Updating from Cuda 12.2 to Cuda 12.3 seems to have solved the issue.

Oscaarjs · 2024-07-01T08:56:01Z

Closing

Oscaarjs added the bug Something isn't working label Jun 28, 2024

Oscaarjs changed the title ~~TypeError when fine-tuning Ambernet~~ Segmentation fault when fine-tuning Ambernet Jun 28, 2024

Oscaarjs closed this as completed Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault when fine-tuning Ambernet #9561

Segmentation fault when fine-tuning Ambernet #9561

Oscaarjs commented Jun 28, 2024 •

edited

Loading

Oscaarjs commented Jun 28, 2024

Oscaarjs commented Jul 1, 2024

Oscaarjs commented Jul 1, 2024

Segmentation fault when fine-tuning Ambernet #9561

Segmentation fault when fine-tuning Ambernet #9561

Comments

Oscaarjs commented Jun 28, 2024 • edited Loading

Oscaarjs commented Jun 28, 2024

Oscaarjs commented Jul 1, 2024

Oscaarjs commented Jul 1, 2024

Oscaarjs commented Jun 28, 2024 •

edited

Loading