[Performance] Whisper model inference results incorrect after Transformer Optimizer #21150
Labels
ep:DML
issues related to the DirectML execution provider
platform:windows
issues related to the Windows platform
quantization
issues related to quantization
Describe the issue
I directly export whisper models to ONNX model from whisper module. I wrote an inference script and the results are correct.
I want to reduce the runtime so I used the bart transformer optimizer. The number of heads and the hidden size are correct because I followed the parameters mentioned in the Whisper paper. After that, the result changes with the same inference script. It cannot end correctly. I think the attention in whisper model are not correctly connected after optimization. Some bugs may exist.
To reproduce
Whisper medium model
Urgency
Yes
Platform
Windows
OS Version
11
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
latest version
ONNX Runtime API
Python
Architecture
X64
Execution Provider
DirectML
Execution Provider Library Version
No response
Model File
No response
Is this a quantized model?
Yes
The text was updated successfully, but these errors were encountered: