main.py 수행중 11시간후 Killed 문제 #151

merliongolden · 2022-04-20T15:52:30Z

Title

main.py 수행중 11시간후 Killed 문제

Description

python ./bin/main.py model=ds2 train=ds2_train train.dataset_path=$DATASET_PATH

위 명령어 수행시 i5 desk top은 8시간만에 Killed
재수행시 동일 시간대 Killed됩니다.
i7 Surface Pro 8 11세대에서 실행시 11시간에서 Killed
Docker run -it DeepSpeech2Korean bash
실행후
cd /home/user/kospeech
위에서 실행한 python ./bin/main.py … 실행
0 Epoch started 후
380/38750 단계에서
Killed
[커서]
이렇게 나옵니다. 혹시 메모리 부족 으로 보여 재실행중입니다. Desktop & Surface Pro 8
메모리채크시# free 체크
메모리 8GB
스왑 1GB
여기서 Killed 된듯하여 다시 실행했어요
C:> docker run —memory=32g —memory-swap=-1 -it Deepspeech:latest bash
cd /home/user/kospeech
python ./bin/main.py … &
free 체크 해봐도 메모리 스왑은 그대론데 용량증가는 실행하면서 늘리나 생각하며 살행상태 확인중이며 내일 사무실에서 재확인예정입니다.
한 2주 데이터셋만들고 PCM->wav 변환
train 96%, test 2%, val 2% 스크립트 사용이동 랜덤함수
transcript.txt, ai_all_vocab.csv? 등을만들고
prerequisite shell 돌리니(main.py) 자꾸 killed 되네요
혹시나 해서 서버 구입중입니다
i9-12세대 32GB DDR5, 1TB SSD OS용, RTX3080TI
쉽지 않네요 데이터 처리는 엄청시간들어가고요
항상감사합니다
보통 1 Epoch 당 14시간소요라는데 70번 반복 Train 학습으로 되어 거의 40일 걸리던데 맞는지요?
aihub.or.kr에 AI 컴퓨팅하려니 5월에나 신청가능예상, didim365는 연락중이먀 구글 CoLab 컴퓨팅도 검토 중입니다.
영어로 올려야 하면 다시 올리겠습니다 감사합니다

Linked Issues

resolved #

merlionfish · 2022-04-22T01:45:02Z

%UsersProfile%.wslconfig파일에서 Memory 8GB를 16GB로 늘리니 잘되기는 합니다. 시간이 엄청오래 걸리네요.
아래와 같이 설정을 했는데도 docker 실행해서 보면 8GB mem, 2GB swap mem에서 16GB mem, 4GB swap mem로 늘어남
swap mem -1은 무제한의미
config/train/ds2_train.yaml에서
use_cuda: true -> use_cuda: false로 수정(NVIDIA RTX 3080 Ti 구매후 수정해서 결과확인 예정), 현재는 i5 UHD Graphics 630)
save_result_every: 1000 -> 10
checkpoint_every: 5000 -> 300
resume: false -> 두개의 장치에서 시험하고 있기 때문에 Surface Pro 8은 5시간 또는 11시간 실행중 Killed, resume: true로 수정, 기존 checkpoint.py에 버그가 있는듯 해서 수정함
kospeech/checkpoint/checkpoint.py 파일에서 수정후 resume: true시 정상동작 확인함. Surface Pro 8에서 리줌 잘됨.
금일 i9-12세대, 64GB, ASUS, 1TB SSD, RTX 3080 Ti에서 데이터셋 트레이닝시 얼마나 걸리는지 올리겠습니다.
동시에 디딤365서버사용(Tesla V100 그래픽카드 4장) 얼마걸리는지도 시험해보겠습니다. 가능하면요.
STT -> TTS 시간이 무척 많이 들어가네요.

[중략]
def get_latest_checkpoint(self):
"""
returns the path to the last saved checkpoint's subdirectory.
Precondition: at least one checkpoint has been made (i.e., latest checkpoint subdirectory exists).
"""
checkpoints_path = sorted(os.listdir(self.LOAD_PATH), reverse=True)[0]
sorted_listdir = sorted(os.listdir(os.path.join(self.LOAD_PATH, checkpoints_path)), reverse=True)
print("sorted_listdir[0]= ", sorted_listdir[0])
print("sorted_listdir[1]= ", sorted_listdir[1])
# print("sorted_listdir[2]= ", sorted_listdir[2])
checkpoints_path = os.path.join(checkpoints_path, sorted_listdir[1])
print("checkpoints_path: ", checkpoints_path)
checkpoints_path = os.path.join(self.LOAD_PATH, checkpoints_path)
print("Checkpoints_path: ", checkpoints_path)
return checkpoints_path

C:\Users\username>type .wslconfig
[wsl2]
memory=32GB
swap=-1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

main.py 수행중 11시간후 Killed 문제 #151

main.py 수행중 11시간후 Killed 문제 #151

merliongolden commented Apr 20, 2022 •

edited

Loading

merlionfish commented Apr 22, 2022

main.py 수행중 11시간후 Killed 문제 #151

main.py 수행중 11시간후 Killed 문제 #151

Comments

merliongolden commented Apr 20, 2022 • edited Loading

Title

Description

Linked Issues

merlionfish commented Apr 22, 2022

merliongolden commented Apr 20, 2022 •

edited

Loading