Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在训练488万张图片时突然出现OS.ERROR错误 #453

Open
2575044704 opened this issue Jun 24, 2024 · 4 comments
Open

在训练488万张图片时突然出现OS.ERROR错误 #453

2575044704 opened this issue Jun 24, 2024 · 4 comments

Comments

@2575044704
Copy link

在双卡H100 80G机器上训练488万张图片,然后开始了两个小时左右中间突然报了个错。然后就训练失败了。

爬的图都是我花十几个小时完整检查过的,没有图片损坏什么的,突然出现这个报错摸不着头脑。

日志信息:

2024-06-24 20:19:52 INFO     found directory /train3/1_data contains 4880036 image files                                                         train_util.py:1519
2024-06-24 20:19:52 INFO     found directory /train3/1_data contains 4880036 image files                                                         train_util.py:1519
2024-06-24 20:21:41 WARNING  No caption file found for 16580 images. Training will continue without captions for these images. If class token    train_util.py:1550
                             exists, it will be used. /                                                                                                            
                             16580枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習を続行します                   
                             。class tokenが存在する場合はそれを使います。                                                                                         
                    WARNING  /train3/1_data/10060.webp                                                                                           train_util.py:1557
                    WARNING  /train3/1_data/10067.webp                                                                                           train_util.py:1557
                    WARNING  /train3/1_data/10068.webp                                                                                           train_util.py:1557
                    WARNING  /train3/1_data/10069.webp                                                                                           train_util.py:1557
                    WARNING  /train3/1_data/10075.webp                                                                                           train_util.py:1557
                    WARNING  /train3/1_data/10090.webp... and 16575 more                                                                         train_util.py:1555
2024-06-24 20:21:41 WARNING  No caption file found for 16580 images. Training will continue without captions for these images. If class token    train_util.py:1550
                             exists, it will be used. /                                                                                                            
                             16580枚の画像にキャプションファイルが見つかりませんでした。これらの画像についてはキャプションなしで学習を続行します                   
                             。class tokenが存在する場合はそれを使います。                                                                                         
                    WARNING  /train3/1_data/10060.webp                                                                                           train_util.py:1557
                    WARNING  /train3/1_data/10067.webp                                                                                           train_util.py:1557
                    WARNING  /train3/1_data/10068.webp                                                                                           train_util.py:1557
                    WARNING  /train3/1_data/10069.webp                                                                                           train_util.py:1557
                    WARNING  /train3/1_data/10075.webp                                                                                           train_util.py:1557
                    WARNING  /train3/1_data/10090.webp... and 16575 more                                                                         train_util.py:1555
2024-06-24 20:21:55 INFO     4880036 train images with repeating.                                                                                train_util.py:1613
                    INFO     0 reg images.                                                                                                       train_util.py:1616
                    WARNING  no regularization images / 正則化画像が見つかりませんでした                                                         train_util.py:1621
2024-06-24 20:21:55 INFO     4880036 train images with repeating.                                                                                train_util.py:1613
                    INFO     0 reg images.                                                                                                       train_util.py:1616
                    WARNING  no regularization images / 正則化画像が見つかりませんでした                                                         train_util.py:1621
                    INFO     [Dataset 0]                                                                                                         config_util.py:565
                               batch_size: 16                                                                                                                      
                               resolution: (1024, 1024)                                                                                                            
                               enable_bucket: True                                                                                                                 
                               network_multiplier: 1.0                                                                                                             
                               min_bucket_reso: 64                                                                                                                 
                               max_bucket_reso: 2048                                                                                                               
                               bucket_reso_steps: 64                                                                                                               
                               bucket_no_upscale: False                                                                                                            
                                                                                                                                                                   
                               [Subset 0 of Dataset 0]                                                                                                             
                                 image_dir: "/train3/1_data"                                                                                                       
                                 image_count: 4880036                                                                                                              
                                 num_repeats: 1                                                                                                                    
                                 shuffle_caption: True                                                                                                             
                                 keep_tokens: 0                                                                                                                    
                                 keep_tokens_separator: |||                                                                                                        
                                 secondary_separator: None                                                                                                         
                                 enable_wildcard: False                                                                                                            
                                 caption_dropout_rate: 0.0                                                                                                         
                                 caption_dropout_every_n_epoches: 0                                                                                                
                                 caption_tag_dropout_rate: 0.1                                                                                                     
                                 caption_prefix: None                                                                                                              
                                 caption_suffix: None                                                                                                              
                                 color_aug: False                                                                                                                  
                                 flip_aug: False                                                                                                                   
                                 face_crop_aug_range: None                                                                                                         
                                 random_crop: False                                                                                                                
                                 token_warmup_min: 1,                                                                                                              
                                 token_warmup_step: 0,                                                                                                             
                                 is_reg: False                                                                                                                     
                                 class_tokens: data                                                                                                                
                                 caption_extension: .txt                                                                                                           
                                                                                                                                                                   
                                                                                                                                                                   
                    INFO     [Dataset 0]                                                                                                         config_util.py:571
                    INFO     loading image sizes.                                                                                                 train_util.py:853
                    INFO     [Dataset 0]                                                                                                         config_util.py:565
                               batch_size: 16                                                                                                                      
                               resolution: (1024, 1024)                                                                                                            
                               enable_bucket: True                                                                                                                 
                               network_multiplier: 1.0                                                                                                             
                               min_bucket_reso: 64                                                                                                                 
                               max_bucket_reso: 2048                                                                                                               
                               bucket_reso_steps: 64                                                                                                               
                               bucket_no_upscale: False                                                                                                            
                                                                                                                                                                   
                               [Subset 0 of Dataset 0]                                                                                                             
                                 image_dir: "/train3/1_data"                                                                                                       
                                 image_count: 4880036                                                                                                              
                                 num_repeats: 1                                                                                                                    
                                 shuffle_caption: True                                                                                                             
                                 keep_tokens: 0                                                                                                                    
                                 keep_tokens_separator: |||                                                                                                        
                                 secondary_separator: None                                                                                                         
                                 enable_wildcard: False                                                                                                            
                                 caption_dropout_rate: 0.0                                                                                                         
                                 caption_dropout_every_n_epoches: 0                                                                                                
                                 caption_tag_dropout_rate: 0.1                                                                                                     
                                 caption_prefix: None                                                                                                              
                                 caption_suffix: None                                                                                                              
                                 color_aug: False                                                                                                                  
                                 flip_aug: False                                                                                                                   
                                 face_crop_aug_range: None                                                                                                         
                                 random_crop: False                                                                                                                
                                 token_warmup_min: 1,                                                                                                              
                                 token_warmup_step: 0,                                                                                                             
                                 is_reg: False                                                                                                                     
                                 class_tokens: data                                                                                                                
                                 caption_extension: .txt                                                                                                           
                                                                                                                                                                   
                                                                                                                                                                   
                    INFO     [Dataset 0]                                                                                                         config_util.py:571
                    INFO     loading image sizes.                                                                                                 train_util.py:853

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4880036/4880036 [00:47<00:00, 103732.16it/s]2024-06-24 20:22:42 INFO     make buckets                                                                                                         train_util.py:859

2024-06-24 20:22:42 INFO     make buckets                                                                                                         train_util.py:859
2024-06-24 20:23:02 INFO     number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む)                                      train_util.py:905
                    INFO     bucket 0: resolution (64, 2048), count: 432                                                                          train_util.py:910
                    INFO     bucket 1: resolution (128, 2048), count: 877                                                                         train_util.py:910
                    INFO     bucket 2: resolution (192, 2048), count: 1172                                                                        train_util.py:910
                    INFO     bucket 3: resolution (256, 2048), count: 1240                                                                        train_util.py:910
                    INFO     bucket 4: resolution (320, 2048), count: 1414                                                                        train_util.py:910
                    INFO     bucket 5: resolution (384, 2048), count: 1588                                                                        train_util.py:910
                    INFO     bucket 6: resolution (448, 2048), count: 2826                                                                        train_util.py:910
                    INFO     bucket 7: resolution (512, 1856), count: 2316                                                                        train_util.py:910
                    INFO     bucket 8: resolution (512, 1920), count: 628                                                                         train_util.py:910
                    INFO     bucket 9: resolution (512, 1984), count: 509                                                                         train_util.py:910
                    INFO     bucket 10: resolution (512, 2048), count: 1526                                                                       train_util.py:910
                    INFO     bucket 11: resolution (576, 1664), count: 16267                                                                      train_util.py:910
                    INFO     bucket 12: resolution (576, 1728), count: 12123                                                                      train_util.py:910
                    INFO     bucket 13: resolution (576, 1792), count: 13783                                                                      train_util.py:910
                    INFO     bucket 14: resolution (640, 1536), count: 17673                                                                      train_util.py:910
                    INFO     bucket 15: resolution (640, 1600), count: 13667                                                                      train_util.py:910
                    INFO     bucket 16: resolution (704, 1408), count: 60986                                                                      train_util.py:910
                    INFO     bucket 17: resolution (704, 1472), count: 30709                                                                      train_util.py:910
                    INFO     bucket 18: resolution (768, 1280), count: 228754                                                                     train_util.py:910
                    INFO     bucket 19: resolution (768, 1344), count: 137415                                                                     train_util.py:910
                    INFO     bucket 20: resolution (832, 1216), count: 1792153                                                                    train_util.py:910
                    INFO     bucket 21: resolution (896, 1152), count: 732682                                                                     train_util.py:910
                    INFO     bucket 22: resolution (960, 1088), count: 307066                                                                     train_util.py:910
                    INFO     bucket 23: resolution (1024, 1024), count: 417711                                                                    train_util.py:910
                    INFO     bucket 24: resolution (1088, 960), count: 160702                                                                     train_util.py:910
                    INFO     bucket 25: resolution (1152, 896), count: 315880                                                                     train_util.py:910
                    INFO     bucket 26: resolution (1216, 832), count: 347554                                                                     train_util.py:910
                    INFO     bucket 27: resolution (1280, 768), count: 81520                                                                      train_util.py:910
                    INFO     bucket 28: resolution (1344, 768), count: 125354                                                                     train_util.py:910
                    INFO     bucket 29: resolution (1408, 704), count: 23818                                                                      train_util.py:910
                    INFO     bucket 30: resolution (1472, 704), count: 10988                                                                      train_util.py:910
2024-06-24 20:23:03 INFO     bucket 31: resolution (1536, 640), count: 7141                                                                       train_util.py:910
                    INFO     bucket 32: resolution (1600, 640), count: 3933                                                                       train_util.py:910
                    INFO     bucket 33: resolution (1664, 576), count: 2466                                                                       train_util.py:910
                    INFO     bucket 34: resolution (1728, 576), count: 1323                                                                       train_util.py:910
                    INFO     bucket 35: resolution (1792, 576), count: 1158                                                                       train_util.py:910
                    INFO     bucket 36: resolution (1856, 512), count: 734                                                                        train_util.py:910
                    INFO     bucket 37: resolution (1920, 512), count: 197                                                                        train_util.py:910
                    INFO     bucket 38: resolution (1984, 512), count: 153                                                                        train_util.py:910
                    INFO     bucket 39: resolution (2048, 64), count: 31                                                                          train_util.py:910
                    INFO     bucket 40: resolution (2048, 128), count: 64                                                                         train_util.py:910
                    INFO     bucket 41: resolution (2048, 192), count: 87                                                                         train_util.py:910
                    INFO     bucket 42: resolution (2048, 256), count: 127                                                                        train_util.py:910
                    INFO     bucket 43: resolution (2048, 320), count: 186                                                                        train_util.py:910
                    INFO     bucket 44: resolution (2048, 384), count: 278                                                                        train_util.py:910
                    INFO     bucket 45: resolution (2048, 448), count: 437                                                                        train_util.py:910
                    INFO     bucket 46: resolution (2048, 512), count: 388                                                                        train_util.py:910
2024-06-24 20:23:03 INFO     number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む)                                      train_util.py:905
                    INFO     bucket 0: resolution (64, 2048), count: 432                                                                          train_util.py:910
                    INFO     mean ar error (without repeats): 0.02633931813702877                                                                 train_util.py:915
                    INFO     bucket 1: resolution (128, 2048), count: 877                                                                         train_util.py:910
                    INFO     bucket 2: resolution (192, 2048), count: 1172                                                                        train_util.py:910
                    INFO     bucket 3: resolution (256, 2048), count: 1240                                                                        train_util.py:910
                    INFO     bucket 4: resolution (320, 2048), count: 1414                                                                        train_util.py:910
                    INFO     bucket 5: resolution (384, 2048), count: 1588                                                                        train_util.py:910
                    INFO     bucket 6: resolution (448, 2048), count: 2826                                                                        train_util.py:910
                    INFO     bucket 7: resolution (512, 1856), count: 2316                                                                        train_util.py:910
                    INFO     bucket 8: resolution (512, 1920), count: 628                                                                         train_util.py:910
                    INFO     bucket 9: resolution (512, 1984), count: 509                                                                         train_util.py:910
                    INFO     bucket 10: resolution (512, 2048), count: 1526                                                                       train_util.py:910
                    INFO     bucket 11: resolution (576, 1664), count: 16267                                                                      train_util.py:910
                    INFO     bucket 12: resolution (576, 1728), count: 12123                                                                      train_util.py:910
                    INFO     bucket 13: resolution (576, 1792), count: 13783                                                                      train_util.py:910
                    INFO     bucket 14: resolution (640, 1536), count: 17673                                                                      train_util.py:910
                    INFO     bucket 15: resolution (640, 1600), count: 13667                                                                      train_util.py:910
                    INFO     bucket 16: resolution (704, 1408), count: 60986                                                                      train_util.py:910
                    INFO     bucket 17: resolution (704, 1472), count: 30709                                                                      train_util.py:910
                    INFO     bucket 18: resolution (768, 1280), count: 228754                                                                     train_util.py:910
                    INFO     bucket 19: resolution (768, 1344), count: 137415                                                                     train_util.py:910
                    INFO     bucket 20: resolution (832, 1216), count: 1792153                                                                    train_util.py:910
                    INFO     bucket 21: resolution (896, 1152), count: 732682                                                                     train_util.py:910
                    INFO     bucket 22: resolution (960, 1088), count: 307066                                                                     train_util.py:910
                    INFO     bucket 23: resolution (1024, 1024), count: 417711                                                                    train_util.py:910
                    INFO     bucket 24: resolution (1088, 960), count: 160702                                                                     train_util.py:910
                    INFO     bucket 25: resolution (1152, 896), count: 315880                                                                     train_util.py:910
                    INFO     bucket 26: resolution (1216, 832), count: 347554                                                                     train_util.py:910
                    INFO     bucket 27: resolution (1280, 768), count: 81520                                                                      train_util.py:910
                    INFO     bucket 28: resolution (1344, 768), count: 125354                                                                     train_util.py:910
                    INFO     bucket 29: resolution (1408, 704), count: 23818                                                                      train_util.py:910
                    INFO     bucket 30: resolution (1472, 704), count: 10988                                                                      train_util.py:910
                    INFO     bucket 31: resolution (1536, 640), count: 7141                                                                       train_util.py:910
                    INFO     bucket 32: resolution (1600, 640), count: 3933                                                                       train_util.py:910
                    INFO     bucket 33: resolution (1664, 576), count: 2466                                                                       train_util.py:910
                    INFO     bucket 34: resolution (1728, 576), count: 1323                                                                       train_util.py:910
                    INFO     bucket 35: resolution (1792, 576), count: 1158                                                                       train_util.py:910
                    INFO     bucket 36: resolution (1856, 512), count: 734                                                                        train_util.py:910
                    INFO     bucket 37: resolution (1920, 512), count: 197                                                                        train_util.py:910
                    INFO     bucket 38: resolution (1984, 512), count: 153                                                                        train_util.py:910
                    INFO     bucket 39: resolution (2048, 64), count: 31                                                                          train_util.py:910
                    INFO     bucket 40: resolution (2048, 128), count: 64                                                                         train_util.py:910
                    INFO     bucket 41: resolution (2048, 192), count: 87                                                                         train_util.py:910
                    INFO     bucket 42: resolution (2048, 256), count: 127                                                                        train_util.py:910
                    INFO     bucket 43: resolution (2048, 320), count: 186                                                                        train_util.py:910
                    INFO     bucket 44: resolution (2048, 384), count: 278                                                                        train_util.py:910
                    INFO     bucket 45: resolution (2048, 448), count: 437                                                                        train_util.py:910
                    INFO     bucket 46: resolution (2048, 512), count: 388                                                                        train_util.py:910
                    INFO     mean ar error (without repeats): 0.02633931813702877                                                                 train_util.py:915
2024-06-24 20:23:06 INFO     preparing accelerator                                                                                             train_network.py:225
2024-06-24 20:23:07 INFO     preparing accelerator                                                                                             train_network.py:225
accelerator device: cuda:0
                    INFO     loading model for process 0/2                                                                                    sdxl_train_util.py:30
                    INFO     load StableDiffusion checkpoint: ./train.safetensors                                                             sdxl_train_util.py:70
accelerator device: cuda:1
                    INFO     building U-Net                                                                                                  sdxl_model_util.py:192
                    INFO     loading U-Net from checkpoint                                                                                   sdxl_model_util.py:196
                    INFO     U-Net: <All keys matched successfully>                                                                          sdxl_model_util.py:202
                    INFO     building text encoders                                                                                          sdxl_model_util.py:205
                    INFO     loading text encoders from checkpoint                                                                           sdxl_model_util.py:258
                    INFO     text encoder 1: <All keys matched successfully>                                                                 sdxl_model_util.py:272
2024-06-24 20:23:08 INFO     text encoder 2: <All keys matched successfully>                                                                 sdxl_model_util.py:276
                    INFO     building VAE                                                                                                    sdxl_model_util.py:279
                    INFO     loading VAE from checkpoint                                                                                     sdxl_model_util.py:284
                    INFO     VAE: <All keys matched successfully>                                                                            sdxl_model_util.py:287
2024-06-24 20:23:10 INFO     loading model for process 1/2                                                                                    sdxl_train_util.py:30
                    INFO     load StableDiffusion checkpoint: ./train.safetensors                                                             sdxl_train_util.py:70
                    INFO     building U-Net                                                                                                  sdxl_model_util.py:192
                    INFO     loading U-Net from checkpoint                                                                                   sdxl_model_util.py:196
                    INFO     U-Net: <All keys matched successfully>                                                                          sdxl_model_util.py:202
                    INFO     building text encoders                                                                                          sdxl_model_util.py:205
                    INFO     loading text encoders from checkpoint                                                                           sdxl_model_util.py:258
                    INFO     text encoder 1: <All keys matched successfully>                                                                 sdxl_model_util.py:272
                    INFO     text encoder 2: <All keys matched successfully>                                                                 sdxl_model_util.py:276
                    INFO     building VAE                                                                                                    sdxl_model_util.py:279
                    INFO     loading VAE from checkpoint                                                                                     sdxl_model_util.py:284
                    INFO     VAE: <All keys matched successfully>                                                                            sdxl_model_util.py:287
2024-06-24 20:23:11 INFO     Enable xformers for U-Net                                                                                           train_util.py:2660
2024-06-24 20:23:11 INFO     Enable xformers for U-Net                                                                                           train_util.py:2660
import network module: lycoris.kohya
2024-06-24 20:23:12|[LyCORIS]-�[0;32mINFO�[0m: Using rank adaptation algo: lokr
2024-06-24 20:23:12|[LyCORIS]-�[0;32mINFO�[0m: Use Dropout value: 0.0
2024-06-24 20:23:12|[LyCORIS]-�[0;32mINFO�[0m: Create LyCORIS Module
2024-06-24 20:23:12|[LyCORIS]-�[0;32mINFO�[0m: Using rank adaptation algo: lokr
2024-06-24 20:23:12|[LyCORIS]-�[0;32mINFO�[0m: Use Dropout value: 0.0
2024-06-24 20:23:12|[LyCORIS]-�[0;32mINFO�[0m: Create LyCORIS Module
2024-06-24 20:23:12|[LyCORIS]-�[0;32mINFO�[0m: Create LyCORIS Module
2024-06-24 20:23:12|[LyCORIS]-�[0;32mINFO�[0m: Create LyCORIS Module
2024-06-24 20:23:12|[LyCORIS]-�[0;32mINFO�[0m: create LyCORIS for Text Encoder: 264 modules.
2024-06-24 20:23:12|[LyCORIS]-�[0;32mINFO�[0m: Create LyCORIS Module
2024-06-24 20:23:13|[LyCORIS]-�[0;32mINFO�[0m: create LyCORIS for Text Encoder: 264 modules.
2024-06-24 20:23:13|[LyCORIS]-�[0;32mINFO�[0m: Create LyCORIS Module
2024-06-24 20:23:14|[LyCORIS]-�[0;32mINFO�[0m: create LyCORIS for U-Net: 1050 modules.
2024-06-24 20:23:14|[LyCORIS]-�[0;32mINFO�[0m: module type table: {'LokrModule': 1058, 'NormModule': 256}
2024-06-24 20:23:14|[LyCORIS]-�[0;32mINFO�[0m: enable LyCORIS for text encoder
2024-06-24 20:23:14|[LyCORIS]-�[0;32mINFO�[0m: enable LyCORIS for U-Net
2024-06-24 20:23:14 INFO     use Lion optimizer | {'weight_decay': 0.1, 'betas': (0.9, 0.95)}                                                    train_util.py:3878
2024-06-24 20:23:15|[LyCORIS]-�[0;32mINFO�[0m: create LyCORIS for U-Net: 1050 modules.
2024-06-24 20:23:15|[LyCORIS]-�[0;32mINFO�[0m: module type table: {'LokrModule': 1058, 'NormModule': 256}
2024-06-24 20:23:15|[LyCORIS]-�[0;32mINFO�[0m: enable LyCORIS for text encoder
2024-06-24 20:23:15|[LyCORIS]-�[0;32mINFO�[0m: enable LyCORIS for U-Net
prepare optimizer, data loader etc.
2024-06-24 20:23:15 INFO     use Lion optimizer | {'weight_decay': 0.1, 'betas': (0.9, 0.95)}                                                    train_util.py:3878
override steps. steps for 10 epochs is / 指定エポックまでのステップ数: 381280
enable full fp16 training.
fatal: not a git repository (or any of the parent directories): .git
running training / 学習開始
  num train images * repeats / 学習画像の数×繰り返し回数: 4880036
  num reg images / 正則化画像の数: 0
  num batches per epoch / 1epochのバッチ数: 152512
  num epochs / epoch数: 10
  batch size per device / バッチサイズ: 16
  gradient accumulation steps / 勾配を合計するステップ数 = 4
  total optimization steps / 学習ステップ数: 381280
fatal: not a git repository (or any of the parent directories): .git

steps:   0%|                                         | 0/381280 [00:00<?, ?it/s]
epoch 1/10
steps:   0%|                                                                                         | 373/381280 [1:58:19<2014:00:30, 19.03s/it, avr_loss=0.0848]
steps:   0%|                                                                                         | 374/381280 [1:58:25<2010:03:41, 19.00s/it, avr_loss=0.0848]
steps:   0%|                                                                                         | 374/381280 [1:58:25<2010:03:41, 19.00s/it, avr_loss=0.0848]
steps:   0%|                                                                                         | 374/381280 [1:58:30<2011:31:38, 19.01s/it, avr_loss=0.0848]
steps:   0%|                                                                                         | 374/381280 [1:58:35<2012:59:34, 19.03s/it, avr_loss=0.0848][rank1]: Traceback (most recent call last):
[rank1]:   File "/sd-scripts/sdxl_train_network.py", line 185, in <module>
[rank1]:     trainer.train(args)
[rank1]:   File "/sd-scripts/train_network.py", line 806, in train
[rank1]:     for step, batch in enumerate(train_dataloader):
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/accelerate/data_loader.py", line 458, in __iter__
[rank1]:     next_batch = next(dataloader_iter)
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
[rank1]:     data = self._next_data()
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
[rank1]:     return self._process_data(data)
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
[rank1]:     data.reraise()
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/_utils.py", line 705, in reraise
[rank1]:     raise exception
[rank1]: OSError: Caught OSError in DataLoader worker process 4.
[rank1]: Original Traceback (most recent call last):
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
[rank1]:     data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
[rank1]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
[rank1]:     data = [self.dataset[idx] for idx in possibly_batched_index]
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/utils/data/dataset.py", line 348, in __getitem__
[rank1]:     return self.datasets[dataset_idx][sample_idx]
[rank1]:   File "/sd-scripts/library/train_util.py", line 1207, in __getitem__
[rank1]:     img, face_cx, face_cy, face_w, face_h = self.load_image_with_face_info(subset, image_info.absolute_path)
[rank1]:   File "/sd-scripts/library/train_util.py", line 1092, in load_image_with_face_info
[rank1]:     img = load_image(image_path)
[rank1]:   File "/sd-scripts/library/train_util.py", line 2352, in load_image
[rank1]:     img = np.array(image, np.uint8)
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/PIL/Image.py", line 696, in __array_interface__
[rank1]:     new["data"] = self.tobytes()
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/PIL/Image.py", line 755, in tobytes
[rank1]:     self.load()
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/PIL/WebPImagePlugin.py", line 160, in load
[rank1]:     data, timestamp, duration = self._get_next()
[rank1]:   File "/root/.conda/envs/lora/lib/python3.10/site-packages/PIL/WebPImagePlugin.py", line 127, in _get_next
[rank1]:     ret = self._decoder.get_next()
[rank1]: OSError: failed to read next frame


steps:   0%|                                                                                         | 374/381280 [1:58:40<2014:26:18, 19.04s/it, avr_loss=0.0848]W0624 22:22:13.858000 140247365268672 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 75699 closing signal SIGTERM
E0624 22:22:14.275000 140247365268672 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 1 (pid: 75700) of binary: /root/.conda/envs/lora/bin/python3
Traceback (most recent call last):
  File "/root/.conda/envs/lora/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/.conda/envs/lora/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/.conda/envs/lora/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1027, in <module>
    main()
  File "/root/.conda/envs/lora/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1023, in main
    launch_command(args)
  File "/root/.conda/envs/lora/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1008, in launch_command
    multi_gpu_launcher(args)
  File "/root/.conda/envs/lora/lib/python3.10/site-packages/accelerate/commands/launch.py", line 666, in multi_gpu_launcher
    distrib_run.run(args)
  File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/.conda/envs/lora/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=============

求求作者可以帮我解决一下吗?这是国内唯一一次机会能接近或者超越nai3模型的机会。成败在此(哭)

求求求求了!!

@Rnglg2
Copy link

Rnglg2 commented Jun 30, 2024

你看看你那日志
明着告诉你图片没有标注
你还在那库库练
然后报错日志告诉你failed to read next frame
说明你的数据集有问题
可能是图片损坏造成的
拿脚本跑一下图片检查

with Image.open(image_file_path) as img:
img.verify()
except (IOError, SyntaxError) as e:
print(f"损坏的图片文件: {file_path}, 错误: {e}")

还有这么大的项目建议用kohya-sd-script
秋叶的喂饭包只能让你做点小训练
用kohya-sd-script开按步数保存训练状态
这样你不用怕半路炸炉

@Rnglg2
Copy link

Rnglg2 commented Jun 30, 2024

如果你想跳过检查潜空间这个费时的操作
可以修改sd-scripts/library/train_util.py中的is_disk_cached_latents_is_expected函数
让它直接返回True
祝你训练成功

@yuno779
Copy link

yuno779 commented Jun 30, 2024

图片没有标注,数据集有问题,你要跳过也行,看@Rnglg2直接修改,祝你好运就是了
84bf1fe0db6d066b1f6bbfaab7242b76_720

@poi6poi6
Copy link

poi6poi6 commented Jun 30, 2024

你的脚本分别用两种语言给了你一次自力更生排障的机会。
但你没把握住。
-----_-----
如果确定没有图损,写个脚本查一查有没有图token文件缺失吧。
Uploading Image_1719740652058.jpg…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants