Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Whisper model inference results incorrect after Transformer Optimizer #21150

Open
XciciciX opened this issue Jun 24, 2024 · 2 comments
Labels
ep:DML issues related to the DirectML execution provider platform:windows issues related to the Windows platform quantization issues related to quantization

Comments

@XciciciX
Copy link

Describe the issue

I directly export whisper models to ONNX model from whisper module. I wrote an inference script and the results are correct.
I want to reduce the runtime so I used the bart transformer optimizer. The number of heads and the hidden size are correct because I followed the parameters mentioned in the Whisper paper. After that, the result changes with the same inference script. It cannot end correctly. I think the attention in whisper model are not correctly connected after optimization. Some bugs may exist.

To reproduce

Whisper medium model

Urgency

Yes

Platform

Windows

OS Version

11

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

latest version

ONNX Runtime API

Python

Architecture

X64

Execution Provider

DirectML

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

Yes

@github-actions github-actions bot added ep:DML issues related to the DirectML execution provider platform:windows issues related to the Windows platform quantization issues related to quantization labels Jun 24, 2024
@tianleiwu
Copy link
Contributor

@XciciciX, Could you share some the detail steps to reproduce the issue?

For example, command lines to export onnx model, optimize onnx model, and test script. Or share the optimized onnx model. You can also look at operator spec if you suspect some attention node is not corrected fused: https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md

@XciciciX
Copy link
Author

Thank you for your response. @tianleiwu

Here is part of the code related to model export.

model = whisper.load_model("medium")
x_mel = compute_features("./data/test.mp3")


x_audio = model.encoder(x_mel)
torch.onnx.export(
    model.encoder,
    (x_mel),
    "./models/encoder.onnx",
    input_names=["x"],
    output_names=["out"],
    dynamic_axes={
        "x": {0: "batch"},
        "out": {0: "batch"},
    },
)
torch.onnx.export(
    model.decoder,
    (x_tokens, x_audio),
    "./models/decoder.onnx",
    input_names=["tokens", "audio"],
    output_names=["out"],
    dynamic_axes={
        "tokens": {0: "batch", 1: "seq"},
        "audio": {0: "batch"},
        "out": {0: "batch", 1: "seq"},
    },
)

Then, they are optimized by: python -m onnxruntime.transformers.optimizer --input ./whisper-medium-onnx/decoder.onnx --output ./whisper-medium-onnx-test/decoder__mha.onnx --float16 --model_type bart --num_heads 16 --hidden_size 1024 --use_multi_head_attention

Here are the exported models
https://drive.google.com/drive/folders/16tbQ46OB91hQtIC4XJJvwNVnl5YaVU60?usp=drive_link

encoder.onnx and decoder.onnx are not optimized. The ones with _mha are optimized.

Here is the test script. The original models can run. The optimized models can run too but the results are wrong.

import numpy as np
import onnxruntime

sess_encoder = onnxruntime.InferenceSession("./models/encoder.onnx", providers=["CUDAExecutionProvider"])
sess_decoder = onnxruntime.InferenceSession("./models/decoder.onnx",  providers=["CUDAExecutionProvider"])

start = time.time()

x_mel_fp32 = compute_features("./data/test.mp3")
x_mel_fp16 = x_mel_fp32.to(dtype=torch.float16)


out_encoder, = sess_encoder.run(["out"], {"x": x_mel_fp32.numpy()})


tokens = list(tokenizer.sot_sequence_including_notimestamps)
next_token = tokenizer.sot

while len(tokens) <= max_tokens and next_token != tokenizer.eot:
    out_decoder, = sess_decoder.run(
        ["out"],
        {
            "tokens": np.asarray([tokens], dtype="int64"),
            "audio": out_encoder,
        },
    )

next_token = out_decoder[0, -1].argmax()
tokens.append(next_token)

print("took", time.time() - start, "seconds")

print(tokenizer.decode(tokens))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:DML issues related to the DirectML execution provider platform:windows issues related to the Windows platform quantization issues related to quantization
Projects
None yet
Development

No branches or pull requests

2 participants