Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

outputs of separation module is clipping #1729

Open
faroit opened this issue Jun 19, 2024 · 4 comments · May be fixed by #1730
Open

outputs of separation module is clipping #1729

faroit opened this issue Jun 19, 2024 · 4 comments · May be fixed by #1730

Comments

@faroit
Copy link

faroit commented Jun 19, 2024

Tested versions

  • 3.3

System information

macOS, m1

Issue description

Hi @hbredin, @joonaskalda thanks for this great release!

I tried some examples on the new pixit pipeline and I find outputs of the separation module seem to produce a very high level of clipping. Is this to be expected from the way it was trained with scale-invariant losses?

Input was a downsampled 16khz mono wav file from the youtube excerpt linked below.

image

Minimal reproduction example (MRE)

https://www.youtube.com/watch?v=CGUpPyA48jE&t=182s

# instantiate the pipeline
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
  "pyannote/speech-separation-ami-1.0",
  use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")

# run the pipeline on an audio file
diarization, sources = pipeline("audio.wav")

# dump the diarization output to disk using RTTM format
with open("audio.rttm", "w") as rttm:
    diarization.write_rttm(rttm)

# dump sources to disk as SPEAKER_XX.wav files
import scipy.io.wavfile
for s, speaker in enumerate(diarization.labels()):
    scipy.io.wavfile.write(f'{speaker}.wav', 16000, sources.data[:,s])
@joonaskalda
Copy link
Contributor

Hi @faroit, thank you for your interest in PixIT! I suspect the issue is that the current version is trained only on the AMI meeting dataset. On the AMI test set this hasn’t been an issue. Finetuning on domain-specific audio would likely improve the separation performance.

@faroit
Copy link
Author

faroit commented Jun 20, 2024

@joonaskalda thanks for your reply. I am not sure if fine-tuning would really be able to fix any of this.
I digged a bit deeper and saw that the maximum output after separation is about 81.0 in that example. Also interesting is that it also drifts in terms of bias. Here is the peak-normalized output of speaker 1

image

Was the model trained on zero-mean, unit variance data?

@joonaskalda
Copy link
Contributor

Thanks for investigating. I checked and the separated sources are (massively) scaled up for AMI data too. I never noticed because I’ve peak-normalized them before use. The scale-invariant loss is indeed the likely culprit.

The training data was not normalized to zero mean and unit variance.

@faroit
Copy link
Author

faroit commented Jun 21, 2024

@joonaskalda thanks for the update. Maybe you can add a normalization to the pipeline so that users that aren't familiar with SI-SDR trained models aren't surprised

@joonaskalda joonaskalda linked a pull request Jun 21, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants