Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

int8 performance #44

Open
csukuangfj opened this issue Dec 28, 2022 · 4 comments
Open

int8 performance #44

csukuangfj opened this issue Dec 28, 2022 · 4 comments

Comments

@csukuangfj
Copy link
Collaborator

csukuangfj commented Dec 28, 2022

I just tested the performance of int8 quantization on some Android phone. The phone has 8 CPUs and we use 8 threads for testing.

The following table lists the processing time for decoding a wave, which is 5.1 seconds long.

first run second run third run
cpu, fp16 5.33 seconds 5.44 seconds 5.28 seconds
cpu, int8 4.34 seconds 4.69 seconds 4.55 seconds
Click to see the screenshot of `About phone`
image image
Click to see the output of `cat /proc/cpuinfo`
Processor	: AArch64 Processor rev 4 (aarch64)
processor	: 0
BogoMIPS	: 3.84
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x0
CPU part	: 0xd03
CPU revision	: 4

processor	: 1
BogoMIPS	: 3.84
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x0
CPU part	: 0xd03
CPU revision	: 4

processor	: 2
BogoMIPS	: 3.84
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x0
CPU part	: 0xd03
CPU revision	: 4

processor	: 3
BogoMIPS	: 3.84
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x0
CPU part	: 0xd03
CPU revision	: 4

processor	: 4
BogoMIPS	: 3.84
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x0
CPU part	: 0xd03
CPU revision	: 4

processor	: 5
BogoMIPS	: 3.84
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x0
CPU part	: 0xd03
CPU revision	: 4

processor	: 6
BogoMIPS	: 3.84
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x0
CPU part	: 0xd03
CPU revision	: 4

processor	: 7
BogoMIPS	: 3.84
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x0
CPU part	: 0xd03
CPU revision	: 4
@csukuangfj
Copy link
Collaborator Author

Screenshot 2022-12-28 at 17 25 58

please see
https://www.gsmarena.com/honor_5c-8074.php
for the hardware info about the phone.

@csukuangfj
Copy link
Collaborator Author

Test commands:

run.sh

export LD_LIBRARY_PATH=$PWD

./sherpa-ncnn \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.bin \
  $@
time ./run.sh   ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/test_wavs/1.wav 8

run-8bit.sh

export LD_LIBRARY_PATH=$PWD

./sherpa-ncnn \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/tokens.txt \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.int8.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/encoder_jit_trace-pnnx.ncnn.int8.bin \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/decoder_jit_trace-pnnx.ncnn.bin \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.int8.param \
  ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/joiner_jit_trace-pnnx.ncnn.int8.bin \
  $@
time ./run-8bit.sh   ./sherpa-ncnn-conv-emformer-transducer-2022-12-06/test_wavs/1.wav 8

You can download the models from
https://huggingface.co/csukuangfj/sherpa-ncnn-conv-emformer-transducer-2022-12-06

@csukuangfj
Copy link
Collaborator Author

csukuangfj commented Dec 28, 2022

Here are the results on xiaomi 11 ultra
https://www.gsmarena.com/xiaomi_mi_11_ultra-10737.php
Screenshot 2022-12-28 at 18 26 38

The following table lists the processing time for decoding a wave, which is 5.1 seconds long.

first run second run third run
cpu, fp16 2.33 seconds 2.44 seconds 2.39 seconds
cpu, int8 2.21 seconds 2.21 seconds 2.18 seconds
gpu (vulkan), fp16 9.66 seconds 9.49 seconds 9.54 seconds
gpu (vulkan), int8 2.80 seconds 2.76 seconds 2.73 seconds

Tests screenshot

GPU (fp16)

img_v2_57536c3f-c09f-43b2-a4cc-63603343e65l_MIDDLE

GPU (int8)

img_v2_0dec8383-a241-4f68-a8e1-b17f7216bf6l_MIDDLE

CPU (fp16)

img_v2_33748c1a-bc87-49d7-90cd-e9a619b246dl_MIDDLE

CPU (int8)

img_v2_3640ddb7-bfa1-4a9a-b965-2c0c5a06f04l_MIDDLE

@csukuangfj
Copy link
Collaborator Author

Here are the benchmark results on Xiaomi 9
Screenshot 2022-12-29 at 15 13 24

Click to see the screenshot of `About phone`

Screenshot_2021-02-24-10-24-20-517_com android settings

Time (in seconds) for decoding a 5.1-second wave file:

first run second run third run
cpu, fp16, 1 thread 2.38 2.36 2.35
cpu, fp16, 2 threads 1.73 1.73 1.74
cpu, fp16, 3 threads 1.60 1.60 1.60
cpu, fp16, 4 threads 1.55 1.56 1.53
cpu, fp16, 5 threads 2.50 2.47 2.45
cpu, fp16, 6 threads 2.40 2.40 2.83
cpu, fp16, 7 threads 2.34 2.32 2.05
cpu, fp16, 8 threads 2.38 2.37 2.45
cpu, int8, 1 thread 2.37 2.37 2.37
cpu, int8, 2 threads 1.65 1.67 1.64
cpu, int8, 3 threads 1.34 1.34 1.34
cpu, int8, 4 threads 1.24 1.32 1.20 <-- best
cpu, int8, 5 threads 2.63 2.97 2.35
cpu, int8, 6 threads 2.31 2.81 2.64
cpu, int8, 7 threads 2.24 2.22 2.25
cpu, int8, 8 threads 2.32 2.29 2.33
gpu, int8, 1 thread 3.56 3.57 3.55
gpu, int8, 2 threads 2.34 2.35 2.34
gpu, int8, 3 threads 1.97 1.96 2.06
gpu, int8, 4 threads 1.96 1.85 2.01
gpu, int8, 5 threads 3.52 3.55 3.47
gpu, int8, 6 threads 3.35 3.28 3.30
gpu, int8, 7 threads 3.14 3.29 3.31
gpu, int8, 8 threads 3.07 3.13 2.97

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant