Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance on Xeon Scalable #284

Open
regmibijay opened this issue Jun 5, 2024 · 1 comment
Open

Performance on Xeon Scalable #284

regmibijay opened this issue Jun 5, 2024 · 1 comment

Comments

@regmibijay
Copy link

Hello everyone, we are seeing slower than expected inference times on one of our CPU node with Intel(R) Xeon(R) Platinum 8362 CPU @ 2.80GHz with following instruction sets:

fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg rdpid fsrm md_clear flush_l1d arch_capabilities

With latest version of neuralchat_server and neural-speed in combination with intel-extension-for-transformers with following config:

host: "0.0.0.0"
port: 8000
model_name_or_path: "/root/Intel/neural-chat-7b-v3-3"
device: cpu
tasks_list: ["textchat"]

optimization:
  use_neural_speed: true
  optimization_type: weight_only
  compute_dtype: fp32
  weight_dtype: int8

We are seeing extremely slow time to first token with example prompts like Tell me about Intel Xeon Scalable Processors.

With following measured times :

Weight Precision Max Tokens Response Time
Int8 unset 73s
Int8 128 69s
Int4 unset 73s
Int4 128 65s

Without neural-speed compression of said model, we got inference times to only around 20s.

Is there any misconfiguration on our part?

I would love to hear your feedback and appreciate any help.

@luoyu-intel
Copy link
Contributor

could you try neural-speed alone with this model? it may not be an issue of neural-speed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants