Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

converted models producing noise #446

Open
ssube opened this issue Dec 23, 2023 · 1 comment
Open

converted models producing noise #446

ssube opened this issue Dec 23, 2023 · 1 comment
Assignees
Labels
scope/convert status/progress issues that are in progress and have a branch type/bug broken features
Milestone

Comments

@ssube
Copy link
Owner

ssube commented Dec 23, 2023

Some recently-converted are producing random noise, like:

image

This is:

  • happening with pytorch 2.x, including 2.0 and 2.1
    • even with low_cpu_mem_usage monkeypatches on most/all UNetCondition2DModel ctors
  • only happening when ONNX_WEB_CONVERT_EXTRACT=FALSE
    • which is necessary for converting some newer models
  • happening with both fp16 and fp32
  • happening with SD v1.5 models
    • does not appear to happen with SDXL

Running a diff between a good and bad copy of the same model, most/all of the weights are different:

INFO:__main__:raw data differs for onnx::Mul_9546: -0.12585449
INFO:__main__:raw data differs for onnx::Add_9547: 0.10827637
INFO:__main__:raw data differs for onnx::MatMul_9548: 0.25146484
INFO:__main__:raw data differs for onnx::MatMul_9549: 0.34448242
INFO:__main__:raw data differs for onnx::MatMul_9550: 0.16455078
INFO:__main__:raw data differs for onnx::MatMul_9557: 0.19995117
INFO:__main__:raw data differs for onnx::MatMul_9558: 0.15942383
INFO:__main__:raw data differs for onnx::MatMul_9559: 0.21325684
INFO:__main__:raw data differs for onnx::MatMul_9560: 0.13708496
INFO:__main__:raw data differs for onnx::MatMul_9567: 0.23217773
INFO:__main__:raw data differs for onnx::MatMul_9568: 0.19250488
INFO:__main__:raw data differs for onnx::MatMul_9569: 0.18237305
INFO:__main__:raw data differs for onnx::Mul_9570: -0.03488159
INFO:__main__:raw data differs for onnx::Add_9571: 2.65625
WARNING:__main__:models have 686 differences

This is true for at least the UNet and VAEs.

@ssube ssube added status/confirmed issues that have been discussed but not planned type/bug broken features scope/convert labels Dec 23, 2023
@ssube ssube added this to the v0.11 milestone Dec 23, 2023
@ssube ssube self-assigned this Dec 23, 2023
@ssube
Copy link
Owner Author

ssube commented Dec 24, 2023

When this occurs on Windows, it appears to cause a crash rather than noise:

[2023-12-23 20:16:06,568] ERROR: onnx-web worker: directml MainThread onnx_web.chain.pipeline: error while running stage pipeline, 1 retries left
Traceback (most recent call last):
  File "onnx_web\chain\pipeline.py", line 227, in __call__
  File "onnx_web\chain\source_txt2img.py", line 144, in run
  File "diffusers\pipelines\stable_diffusion\pipeline_onnx_stable_diffusion.py", line 433, in __call__
  File "diffusers\pipelines\stable_diffusion\pipeline_onnx_stable_diffusion.py", line 433, in <listcomp>
  File "onnx_web\diffusers\patches\vae.py", line 79, in __call__
  File "diffusers\pipelines\onnx_utils.py", line 60, in __call__
  File "onnxruntime\capi\onnxruntime_inference_collection.py", line 220, in run
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Add node. Name:'/decoder/mid_block/attentions.0/Add_1' Status Message: D:\a\_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\MLOperatorAuthorImpl.cpp(2759)\onnxruntime_pybind11_state.pyd!00007FF863B5DDF2: (caller: 00007FF863B5DB05) Exception(4) tid(5d98) 80070057 The parameter is incorrect.

I believe this is the same issue, and the different error message is due to DirectML.

I've written a new SD converter that uses the same optimum.main_export call that the SDXL converter is using, which seems to work on most models. Currently testing on the models included in the pre-converted set:

  • Cetus
    • fails on both v4 and Whalefall
  • Dreamshaper
    • works on v8
  • Elegant Entropy
    • works on v1.4
  • Faetastic
    • fails on v2
  • Juggernaut
    • not setup yet, not tested
  • ReV Animated
    • works on v1.2.2-EOL
Examples:

image
image
image
image
image
image

Based on the fact that some models fail to convert with both methods, it seems like there might be an issue with the model or somewhere upstream. All of the failing models (Cetus and Faetastic) convert correctly when using pipeline: txt2img-legacy and ONNX_WEB_CONVERT_EXTRACT=TRUE (which is the default again).

@ssube ssube added status/progress issues that are in progress and have a branch and removed status/confirmed issues that have been discussed but not planned labels Dec 24, 2023
@ssube ssube modified the milestones: v0.11, v0.12 Dec 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scope/convert status/progress issues that are in progress and have a branch type/bug broken features
Projects
None yet
Development

No branches or pull requests

1 participant