is it supported with Batch size >1 ? #269

QuPengfei · 2024-05-28T02:16:09Z

Hi all,

is it supported with bs >1? found the following:

if (batch_size > 1)
MODEL_ASSERT(
("llama arch only supports continuous batching inference when giving multi prompts.", lctx.cont_batching));

thanks

zhentaoyu · 2024-05-31T05:36:02Z

Yes, it is supported, but only for a few model architectures. Please refer to https://github.com/intel/neural-speed/blob/main/docs/continuous_batching.md

zhentaoyu · 2024-06-05T07:41:11Z

Hi, @QuPengfei, if you have no other questions, we will close this issue. Thanks.

kevinintel assigned zhentaoyu Jun 5, 2024

Provide feedback