Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Garbled characters with beam search #215

Open
jiafuzha opened this issue Apr 12, 2024 · 16 comments
Open

Garbled characters with beam search #215

jiafuzha opened this issue Apr 12, 2024 · 16 comments
Assignees

Comments

@jiafuzha
Copy link

jiafuzha commented Apr 12, 2024

`
model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = Model()
model.init(model_name, use_quant=True, weight_dtype="int4", compute_dtype="int8")

tokens = tokenizer("What's your favorite animal?", return_tensors='pt').input_ids

outputs = model.generate(tokens, num_beams=2, do_sample=False, max_new_tokens=10)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)
`
With above code, I got below garbled characters.
"What's your favorite animal? ���������"

If I generate without beam search, I can get expected result.
outputs = model.generate(tokens)
"What's your favorite animal?
everybody has a favorite animal, and it's a"

@a32543254
Copy link
Contributor

we have fixed it on this pr
#202
please try newest branch.

@jiafuzha
Copy link
Author

@a32543254 It does get fixed in single generate call. But for the cont. batching in ModelServer, the issue still exists. Here is the log after running the test_model_server.py.

=======REFERENCE RESULTS FOR COMPARISON=========
=======FOR LOOP GREEDY SEARCH GENERATION RESULTS WITH MHA==========
ARCH_REQ_XCOMP_PERM XTILE_DATA successful.
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:1 AVX512_VNNI:1 AMX_INT8:1 AMX_BF16:1 AVX512_BF16:1 AVX512_FP16:1
beam_size: 1, do_sample: 0, top_k: 40, top_p: 0.950, continuous_batching: 1, max_request_num: 1, early_stopping: 0, scratch_size_ratio: 1.000
model.cpp: loading model from runtime_outs/ne_llama_q_int4_bestla_cint8_g32.bin
Loading the bin file with NE format...
load_ne_hparams 0.hparams.n_vocab = 32000
load_ne_hparams 1.hparams.n_embd = 4096
load_ne_hparams 2.hparams.n_mult = 256
load_ne_hparams 3.hparams.n_head = 32
load_ne_hparams 4.hparams.n_head_kv = 32
load_ne_hparams 5.hparams.n_layer = 32
load_ne_hparams 6.hparams.n_rot = 128
load_ne_hparams 7.hparams.ftype = 0
load_ne_hparams 8.hparams.max_seq_len = 0
load_ne_hparams 9.hparams.alibi_bias_max = 0.000
load_ne_hparams 10.hparams.clip_qkv = 0.000
load_ne_hparams 11.hparams.par_res = 0
load_ne_hparams 12.hparams.word_embed_proj_dim = 0
load_ne_hparams 13.hparams.do_layer_norm_before = 0
load_ne_hparams 14.hparams.multi_query_group_num = 0
load_ne_hparams 15.hparams.ffn_hidden_size = 11008
load_ne_hparams 16.hparams.inner_hidden_size = 0
load_ne_hparams 17.hparams.n_experts = 0
load_ne_hparams 18.hparams.n_experts_used = 0
load_ne_hparams 19.hparams.n_embd_head_k = 0
load_ne_hparams 20.hparams.norm_eps = 0.000010
load_ne_hparams 21.hparams.freq_base = 10000.000
load_ne_hparams 22.hparams.freq_scale = 1.000
load_ne_hparams 23.hparams.rope_scaling_factor = 0.000
load_ne_hparams 24.hparams.original_max_position_embeddings = 0
load_ne_hparams 25.hparams.use_yarn = 0
load_ne_vocab 26.vocab.bos_token_id = 1
load_ne_vocab 27.vocab.eos_token_id = 2
load_ne_vocab 28.vocab.pad_token_id = 2
load_ne_vocab 29.vocab.sep_token_id = -1
init: n_vocab = 32000
init: n_ctx = 0
init: n_embd = 4096
init: n_mult = 256
init: n_head = 32
init: n_head_kv = 32
init: n_layer = 32
init: n_rot = 128
init: n_ff = 11008
init: n_parts = 1
load: ctx size = 4427.43 MB
load: scratch0 = 4096.00 MB
load: scratch1 = 2048.00 MB
load: scratch2 = 4096.00 MB
load: mem required = 14667.43 MB (+ memory per state)
...................................................................................................
model_init_from_file: support_bestla_kv = 1
model_init_from_file: kv self size = 552.00 MB
ARCH_REQ_XCOMP_PERM XTILE_DATA successful.
What's your favorite animal?
Unterscheidung between different types of animals is difficult, as different people may have different preferences and cultural backgrounds can also play a role in shaping one's preferences. However, some animals are generally considered to be popular or iconic, and these are often the ones that people mention as their favorites.

Some of the most popular animals that people tend to mention as their favorites include:

  1. Dogs: Many people consider dogs to be their favorite animals, and it's not hard to see why. Dogs are known for their loyalty, affection, and playful nature, making them
    ================================
    =======FOR LOOP BEAM SEARCH GENERATION RESULTS WITH MHA==========
    Will start to reinit model from bin due to different max request num.
    beam_size: 4, do_sample: 0, top_k: 40, top_p: 0.950, continuous_batching: 1, max_request_num: 1, early_stopping: 1, scratch_size_ratio: 1.000
    model.cpp: loading model from runtime_outs/ne_llama_q_int4_bestla_cint8_g32.bin
    Loading the bin file with NE format...
    load_ne_hparams 0.hparams.n_vocab = 32000
    load_ne_hparams 1.hparams.n_embd = 4096
    load_ne_hparams 2.hparams.n_mult = 256
    load_ne_hparams 3.hparams.n_head = 32
    load_ne_hparams 4.hparams.n_head_kv = 32
    load_ne_hparams 5.hparams.n_layer = 32
    load_ne_hparams 6.hparams.n_rot = 128
    load_ne_hparams 7.hparams.ftype = 0
    load_ne_hparams 8.hparams.max_seq_len = 0
    load_ne_hparams 9.hparams.alibi_bias_max = 0.000
    load_ne_hparams 10.hparams.clip_qkv = 0.000
    load_ne_hparams 11.hparams.par_res = 0
    load_ne_hparams 12.hparams.word_embed_proj_dim = 0
    load_ne_hparams 13.hparams.do_layer_norm_before = 0
    load_ne_hparams 14.hparams.multi_query_group_num = 0
    load_ne_hparams 15.hparams.ffn_hidden_size = 11008
    load_ne_hparams 16.hparams.inner_hidden_size = 0
    load_ne_hparams 17.hparams.n_experts = 0
    load_ne_hparams 18.hparams.n_experts_used = 0
    load_ne_hparams 19.hparams.n_embd_head_k = 0
    load_ne_hparams 20.hparams.norm_eps = 0.000010
    load_ne_hparams 21.hparams.freq_base = 10000.000
    load_ne_hparams 22.hparams.freq_scale = 1.000
    load_ne_hparams 23.hparams.rope_scaling_factor = 0.000
    load_ne_hparams 24.hparams.original_max_position_embeddings = 0
    load_ne_hparams 25.hparams.use_yarn = 0
    load_ne_vocab 26.vocab.bos_token_id = 1
    load_ne_vocab 27.vocab.eos_token_id = 2
    load_ne_vocab 28.vocab.pad_token_id = 2
    load_ne_vocab 29.vocab.sep_token_id = -1
    init: n_vocab = 32000
    init: n_ctx = 0
    init: n_embd = 4096
    init: n_mult = 256
    init: n_head = 32
    init: n_head_kv = 32
    init: n_layer = 32
    init: n_rot = 128
    init: n_ff = 11008
    init: n_parts = 1
    load: ctx size = 4427.43 MB
    load: scratch0 = 16384.00 MB
    load: scratch1 = 8192.00 MB
    load: scratch2 = 16384.00 MB
    load: mem required = 45387.43 MB (+ memory per state)
    ...................................................................................................
    model_init_from_file: support_bestla_kv = 1
    model_init_from_file: kv self size = 2208.00 MB
    What's your favorite animal? �������������������������������������������������������������������������������������������������������������������������������

@zhentaoyu
Copy link
Contributor

zhentaoyu commented Apr 15, 2024

Hi, @jiafuzha, sorry for the late response.

  1. The in your test_model_server.py script is not related to cont-batching or ModelServer. It just has different num_beams which is 4 when compared to your first " single generate call". And in fact, it is still a "single generate call".

  2. What does the mean?
    I reproduce your issue when num_beams=4, do_sample=False, max_new_token=10. The generated_tokens (with prompt) are [[1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243, 162, 147, 179, 243, 162, 147, 185, 243]]. Let's pick up the last one token 243, it maps to (from llama2 tokenizer.json):
    image
    And seems it's a hexadecimal representation. However, I'm not a big fan of it. So I don't know why these hexadecimal representations exist.

  3. Is it caused by our c++ beam search, model_eval or just model itself?

    • yes, our c++ beam_search is not as same as transformers, but the results should not be much different since we refer to their Python implementation. For example, you can check the beam search results between PyTorch FP32 and NS FP32:
      env: INTEL(R) XEON(R) PLATINUM 8580, latest NS and ITREX (both build from source). remember to clean up the runtime_outs folder when you change quant-related args
      PyTorch:
     from intel_extension_for_transformers.transformers import AutoModelForCausalLM
     model = AutoModelForCausalLM.from_pretrained(model_name, use_neural_speed=False, trust_remote_code=True).eval()
     generate_ids = itrex_model.generate(tokens, num_beams=4, do_sample=False, max_new_tokens=10)[0]
     print(generate_ids)
     print(tokenizer.decode(generate_ids, skip_special_tokens=True))

    And it outputs like:
    tensor([ 1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243, 162, 147, 185, 243, 162, 147, 180, 243])
    What's your favorite animal? ���������
    NS:

    model.init(model_name, use_quant=False)
    ....same code as above

    And it outputs like:
    [[1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243, 162, 147, 185, 243, 162, 147, 180, 243]]
    What's your favorite animal? ���������
    They are the same! And the FP32 model outputs (maybe llama2 has illusion when meets your prompt...)

    • Use ITREX RTN algo instead of NS to quant the model and generate by transformers. You can refer to this example for how quant and save low-bits model from ITREX. The quant cmd is: python run_generation.py --model xxx --woq --woq_algo Rtn --bits 4 --weight_dtype int4_clip --compute_dtype int8 --group_size 32 --benchmark. Once you finish, you will see the low-bits model in the saved_results folder.
      After running:
      from intel_extension_for_transformers.transformers import AutoModelForCausalLM
      model = AutoModelForCausalLM.from_pretrained(model_name, use_neural_speed=False, trust_remote_code=True).eval()
      generate_ids = itrex_model.generate(tokens, num_beams=4, do_sample=False, max_new_tokens=10)[0]
      print(tokenizer.decode(generate_ids, skip_special_tokens=True))

    You will see:
    What's your favorite animal? ���������

    • Change the RTN quant args. Let use per-channel this time. the python cmd is : model.init(model_name, use_quant=True, weight_dtype="int4", compute_dtype="int8", group_size=-1). And the output is: What's your favorite animal? Why? (Submitted 10:. The result seems a bit more reasonable.

So, I think this issue is more like a model related problem (RTN quantization, illusion, etc.). If you still meet this generation problem after trying more models or more quant algorithms (gptq, awq, auto-round), please let me know. Thanks.

@jiafuzha
Copy link
Author

@zhentaoyu thanks for the detailed response. I just got some new things to share with you.

  1. I am able to get correct result after I changed max_new_tokens from 10 to 50 with both vanilla transfomers and itrex.

"What's your favorite animal? 🐰🐶🐱🐷

My favorite animal is the penguin! 🐧 I think they're so cute and funny, and they're great"

tokens:
tensor([ 1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243,
162, 147, 179, 243, 162, 147, 185, 243, 162, 147,
180, 243, 162, 147, 186, 13, 13, 3421, 25448, 13019,
338, 278, 282, 19636, 262, 29991, 29871, 243, 162, 147,
170, 306, 1348, 896, 29915, 276, 577, 274, 1082, 322,
2090, 1460, 29892, 322, 896, 29915, 276, 2107])

  1. with neuralspeed, however, I still got garbled characters. After checking the token IDs, I found most of tokens are just repeating themselves. Do you think it's related to the lack of repetition penalty in ns?

[1, 1724, 29915, 29879, 596, 25448, 13019, 29973, 29871, 243, 162, 147, 179, 243, 162, 147, 185, 243, 162, 147, 180, 243, 162, 147, 186, 243, 162, 147, 183, 243, 162, 147, 184, 243, 162, 147, 185, 243, 162, 147, 180, 243, 162, 147, 186, 243, 162, 147, 183, 243, 162, 147, 184, 243, 162, 147, 185, 243]

@jiafuzha
Copy link
Author

By the way, another case of garbled character is with prompt, 'what's your favorite food?'.
ns:
[1, 1724, 29915, 29879, 596, 25448, 9687, 29973, 29871, 243, 162, 144, 151, 243, 162, 144, 162, 243, 162, 168, 167, 243, 162, 143, 177, 243, 162, 144, 152, 243, 162, 168, 171, 243, 162, 143, 177, 243, 162, 144, 151, 243, 162, 144, 162, 243, 162, 168, 167, 243, 162, 143, 177, 243, 162, 144, 152, 243]
What's your favorite food? �������������������������������������������������

vanilla transformers:
tensor([ 1, 1724, 29915, 29879, 596, 25448, 9687, 29973, 13, 13,
3421, 25448, 9687, 338, 282, 24990, 29889, 306, 5360, 278,
10296, 310, 278, 2181, 275, 2272, 2181, 504, 29892, 18806,
29891, 6454, 1219, 12507, 346, 29892, 322, 286, 2152, 287,
286, 2112, 29920, 598, 13520, 923, 968, 29889, 739, 29915,
29879, 278, 4922, 13016, 9687, 29889, 13, 13])
What's your favorite food?

My favorite food is pizza. I love the combination of the crispy crust, tangy tomato sauce, and melted mozzarella cheese. It's the perfect comfort food.

@zhentaoyu
Copy link
Contributor

  1. Are the NS results from RTN quant or FP32? RTN quant model may have bad chat quality.
  2. beam search in NS has not repetition penalty, it only has length_penalty (prefer long or short sequence results)

@jiafuzha
Copy link
Author

  1. Are the NS results from RTN quant or FP32? RTN quant model may have bad chat quality.
  2. beam search in NS has not repetition penalty, it only has length_penalty (prefer long or short sequence results)

NS result is from "model.init(model_name, use_quant=True, weight_dtype="int4", compute_dtype="int8")".

@zhentaoyu
Copy link
Contributor

Member

I see. You can use model.init(model_name. use_quant=False) to compare your vanilla transformers results.

@jiafuzha
Copy link
Author

yes, with fp32, I can get correct result from ns.

I also tried below code from https://huggingface.co/docs/transformers/main/en/quantization. It looks like also weight only quant and gives me correct result.

`from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig

model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
quantization_config = QuantoConfig(weights="int8")
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)`

@zhentaoyu
Copy link
Contributor

yes, with fp32, I can get correct result from ns.

I also tried below code from https://huggingface.co/docs/transformers/main/en/quantization. It looks like also weight only quant and gives me correct result.

`from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig

model_id = "facebook/opt-125m" tokenizer = AutoTokenizer.from_pretrained(model_id) quantization_config = QuantoConfig(weights="int8") quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)`

Hi, @jiafuzha, it's different model_id and weight dtype.

@a32543254 Does NS has some difference in RTN quant when compared to ITREX? I found the pipeline ITREX RTN QUANT -> NS LOAD -> NS BEAM SEARCH will get more reasonable results.
ITREX RTN QUANT follow this example. And the results is like What's your favorite animal? 🐰🐶🐱🐷 everybody loves animals, and there are so many amazing creatures to choose from! 😍 whether you're a cat person, a with max_new_tokens =50

@jiafuzha
Copy link
Author

yes, with fp32, I can get correct result from ns.
I also tried below code from https://huggingface.co/docs/transformers/main/en/quantization. It looks like also weight only quant and gives me correct result.
from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig model_id = "facebook/opt-125m" tokenizer = AutoTokenizer.from_pretrained(model_id) quantization_config = QuantoConfig(weights="int8") quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0", quantization_config=quantization_config)

Hi, @jiafuzha, it's different model_id and weight dtype.

@a32543254 Does NS has some difference in RTN quant when compared to ITREX? I found the pipeline ITREX RTN QUANT -> NS LOAD -> NS BEAM SEARCH will get more reasonable results. ITREX RTN QUANT follow this example. And the results is like What's your favorite animal? 🐰🐶🐱🐷 everybody loves animals, and there are so many amazing creatures to choose from! 😍 whether you're a cat person, a with max_new_tokens =50

sorry, I copied wrong code. I was actually using ,

model_name = "meta-llama/Llama-2-7b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
p = "What's your favorite food?"
quantization_config = QuantoConfig(weights="int4")
....
...

I got
"tensor([ 1, 1724, 29915, 29879, 596, 25448, 9687, 29973, 26833, 338,
282, 24990, 29991, 29871, 243, 162, 144, 152, 243, 162,
148, 143, 396, 1181, 397, 347, 396, 29886, 24990, 396,
29891, 398, 2])
What's your favorite food? Mine is pizza! 🍕👌 #foodie #pizza #yum"

@jiafuzha
Copy link
Author

@zhentaoyu @a32543254 any more comments?

@zhentaoyu
Copy link
Contributor

zhentaoyu commented Apr 19, 2024

Hi, @jiafuzha, our NS RTN quant has some regressions which need to be fixed and aligned (for example, we quant lm_head and token_embedding for llama). Will let you know if we fix it. Thanks.

@jiafuzha
Copy link
Author

jiafuzha commented May 8, 2024

any update on this?

@zhentaoyu
Copy link
Contributor

Hi, @jiafuzha, sorry for late response. We are tied up with other things recently. We will dig into it and will let you know if we have any findings. Thanks a lot.

@jiafuzha
Copy link
Author

Hi, @jiafuzha, sorry for late response. We are tied up with other things recently. We will dig into it and will let you know if we have any findings. Thanks a lot.

no worries, looking forward to your fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants