Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy on WER benchmark result in Tedlium dataset #135

Open
MLMonkATGY opened this issue Jun 4, 2024 · 1 comment
Open

Discrepancy on WER benchmark result in Tedlium dataset #135

MLMonkATGY opened this issue Jun 4, 2024 · 1 comment

Comments

@MLMonkATGY
Copy link

Hi.

I am unable to reproduce the benchmark results in the paper for test split in distil-whisper/tedlium using model distil-whisper/distil-large-v2 when using run_eval.py. However, I am able to achieve reasonable benchmark in all others dataset benchmark reported in the paper (< 1% difference). Any idea what could have caused this discrepencies ?

I followed the suggestions in issue 131 which suggested usage of EnglishTextNormalizer instead of BasicTextNormalizer .

Reported WER from paper: 9.6%
Achieved WER : 12.69%
Difference : 3.09%

Command :

python run_eval.py \
  --model_name_or_path "distil-whisper/distil-large-v2" \
  --dataset_name "distil-whisper/tedlium" \
  --dataset_config_name "release3" \
  --dataset_split_name "test" \
  --text_column_name "text" \
  --batch_size 64 \
  --dtype "bfloat16" \
  --generation_max_length 256 \
  --language "en" \
  --attn_implementation "flash_attention_2" 

Modification : Used EnglishTextNormalizer as text normalizer

Thanks in advance.

@bryanyzhu
Copy link

I'm facing the same issue, only tedium has this discrepancy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants