Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subband vs Duration #3

Open
zaptrem opened this issue May 21, 2024 · 17 comments
Open

Subband vs Duration #3

zaptrem opened this issue May 21, 2024 · 17 comments

Comments

@zaptrem
Copy link

zaptrem commented May 21, 2024

Hi, can you explain the difference between the subband and duration experiments and share which you've found to perform better? Also, what have you discovered about the use of classifier free guidance?

@bfs18
Copy link
Owner

bfs18 commented May 28, 2024

Hi, can you explain the difference between the subband and duration experiments and share which you've found to perform better? Also, what have you discovered about the use of classifier free guidance?

Thank you for your interest. The term 'duration' pertains to text-to-speech applications and is unrelated to the reconstruction of audio waveforms from Mel-spectrograms or EnCodec tokens. Regarding the number of subbands, our preliminary experiments indicated that using 16 subbands yields better results than 4 subbands, and 4 subbands outperform full-band processing. Employing more subbands increases the amount of computation, while maintaining a similar count of parameters and utilizing the same dataset. This approach could enhance performance according to the empirical scaling law, although it's worth noting that the scaling law has been primarily summarized from experiments with Transformer models. I've attached a figure to illustrate classifier-free guidance and STFT loss. While CFG enhances objective metrics for vocoder experiments, it does not lead to a corresponding increase in listening test scores; however, when reconstructing waveforms from EnCodec tokens, CFG substantially improves the listening experience. This may be because the abundant and deterministic information in the Mel-spectrogram renders CFG unnecessary.
09f21e15-74c2-4f60-bde3-442e6aaf072f

@zaptrem
Copy link
Author

zaptrem commented May 31, 2024

Thanks! A few more questions:

  1. You mention 4 vs 16 subbands but I think the paper and code use 8 subbands. Is there a reason you're discussing 4 vs 16 here?
  2. I see you're using STFT loss in the above which is off by default in the code. Did you find phase loss and overlap loss to produce worse results? Is energy-balanced loss on by default?
  3. Have you tried applying your multi-band approach to Vocos (i.e., no diffusion)?
  4. Have you noticed that models with CFG turned on are significantly louder than those with it off?
  5. I noticed some buzzing when the audio is quiet/silent even with time balanced loss turned on. Does this go away as training continues?
  6. I'm testing out your models but have been getting unusual loss curves. Did you see these as well?
    image

@bfs18
Copy link
Owner

bfs18 commented May 31, 2024

Hi @zaptrem

  1. Apologies for the confusion earlier. In our initial experiments, we actually tested performance with 4 vs 8 subbands, not 16 subbands.
  2. The use of STFT loss has demonstrated an advantage in mitigating water-like noise when background noise is present in our experiments. We did not observe any significant improvements by incorporating phase loss. Moreover, the incorporation of overlap loss does not compromise performance; rather, it ensures coherence among the individually modeled subbands. We have determined that a weight coefficient of 0.01 is sufficient for both STFT loss and overlap loss in our setup. Energy balanced loss is discussed in Q5.
    The following is an example for STFT loss. There are some vertical patterns in the spectrogram of waveforms generated by a model without STFT loss.
    background.zip

Groundth
gt

With STFT loss
stft

Without STFT loss
wo stft

  1. We have not yet applied the multi-band approach to the Vocos system. However, I believe it could be beneficial for Vocos as well, given that the multi-band approach generally leads to increased computational load.
  2. I noticed that. Training input audio is normalized to a range between -1 and -6 dB. Therefore, normalizing the audio to this volume level during testing should yield more consistent results.
  3. Energy-balanced loss (time balanced loss in code) is designed for this issue. It will disappear as training continues.
  4. This is my loss curve on LJSpeech dataset (22.05 kHz).
    ae2c1f4e-1557-47c0-96e5-f9816f38c63a

On Opencpop dataset (44.1 kHz).
02624bbb-1dc7-41a9-9a82-c8f98ded9496

@zaptrem
Copy link
Author

zaptrem commented Jun 1, 2024

Thanks! It looks like you probably have a similar staircase effect between 0 and 30k but I'm not certain since it's zoomed out. Also, I noticed your PQMF filter is hard-coded to 8 bands, 124 taps, and cutoff 0.071. If using 16 bands would it be better to set these to (following the trend you set going from 4 to 8) 16 bands, 248 taps, and cutoff 0.0355? Also, similar to CFG scale have you noticed any other changes that disproportionately help with generating waveforms for Encodec tokens?

@bfs18
Copy link
Owner

bfs18 commented Jun 2, 2024

Hi @zaptrem ,
The PQMF is only used for waveform equalization, and 4 subbands are utilized. These subbands are equalized and then merged into an equalized waveform. The model splits the complex spectrogram into 8 subbands by selecting the appropriate dimensions and it has nothing to do with PQMF.
I haven't experimented with splitting the complex spectrogram into 16 subbands, and I haven't observed any other factors that disproportionately enhance the generation of waveforms for Encodec tokens.

@zaptrem
Copy link
Author

zaptrem commented Jun 4, 2024

Thanks! For clarification, the original paper used 4 vs 8 (model, not PQMF) subbands, but you have since moved to 16 and that is how you got the results in this screenshot? Also, when you switched between 4/8/16 model subbands did you need to adjust the left_overlap and right_overlap parameters (which are both set to 8 by default) in RectifiedFlow?

@bfs18
Copy link
Owner

bfs18 commented Jun 5, 2024

Thanks! For clarification, the original paper used 4 vs 8 (model, not PQMF) subbands, but you have since moved to 16 and that is how you got the results in this screenshot? Also, when you switched between 4/8/16 model subbands did you need to adjust the left_overlap and right_overlap parameters (which are both set to 8 by default) in RectifiedFlow?

Hi,

  1. The image link didn't work out. I can't see it. Could you resend it?
  2. The 8-dimensional overlap is fine for varying subbands. Adjustments to left_overlap and right_overlap aren't typically needed when changing subbands.

@zaptrem
Copy link
Author

zaptrem commented Jun 5, 2024

Thanks! For clarification, the original paper used 4 vs 8 (model, not PQMF) subbands, but you have since moved to 16 and that is how you got the results in this screenshot? Also, when you switched between 4/8/16 model subbands did you need to adjust the left_overlap and right_overlap parameters (which are both set to 8 by default) in RectifiedFlow?

Hi,

  1. The image link didn't work out. I can't see it. Could you resend it?
  2. The 8-dimensional overlap is fine for varying subbands. Adjustments to left_overlap and right_overlap aren't typically needed when changing subbands.
  1. Thanks!
  2. This one:
    image

Were these with 16 bands? If not, did 16 bands improve it beyond these results?

@bfs18
Copy link
Owner

bfs18 commented Jun 5, 2024

Thanks! For clarification, the original paper used 4 vs 8 (model, not PQMF) subbands, but you have since moved to 16 and that is how you got the results in this screenshot? Also, when you switched between 4/8/16 model subbands did you need to adjust the left_overlap and right_overlap parameters (which are both set to 8 by default) in RectifiedFlow?

Hi,

  1. The image link didn't work out. I can't see it. Could you resend it?
  2. The 8-dimensional overlap is fine for varying subbands. Adjustments to left_overlap and right_overlap aren't typically needed when changing subbands.
  1. Thanks!
  2. This one:
    image

Were these with 16 bands? If not, did 16 bands improve it beyond these results?

In this table, RFWave utilizes 8 subbands, a detail noted in the paper's text but omitted from the table's title.

@zaptrem
Copy link
Author

zaptrem commented Jun 28, 2024

image

Doesn't this table (and yours above) show overlap loss actually making the results worse (PESQ, V/UV, Periodicity) or no better (ViSQOL) compared to no overlap loss?

@bfs18
Copy link
Owner

bfs18 commented Jun 28, 2024

image

Doesn't this table (and yours above) show overlap loss actually making the results worse (PESQ, V/UV, Periodicity) or no better (ViSQOL) compared to no overlap loss?

In initial trials, I noticed the presence of horizontal striations across subbands within the spectrograms occasionally. Implementing overlap loss as a countermeasure effectively mitigated this issue, leading to its adoption as the default setting. I did not expect an improvement in objective metrics with overlap loss. My primary goal was to ensure robust performance across a diverse range of configurations.

@zaptrem
Copy link
Author

zaptrem commented Jun 28, 2024

Thanks. Your model uses a lot of bands/compute on frequencies the human ear doesn't really care about, so I've been trying to fix that. Here's what 16 bands looks like with your current approach:

standard_spectrogram

I thought of using an stft with a higher n_fft (4096) for the lower frequencies (so more bands/compute is spent on parts we care about) and a lower n_fft on the higher frequencies (1024 but still satisfying COLA since hop size is 512).

stft_comparison_50_iterations

The problem I ran into is my multi-resolution STFT implementation is not as perfectly invertible as a normal spectrogram (though it's close), and I'm not quite sure why. When you run it back and forth hundreds of times some ringing/artifacts will appear on the border between the two resolutions. When you train on these (with overlap turned off since it doesn't make sense across resolution boundaries and wave=true, haven't tried false) the model seems to have a much harder time wrapping its head around anything and after one night still hasn't figured out phase.

Have you tried anything like this?

My code in case you're interested: https://gist.github.com/zaptrem/94d10c5d76d2f601841e9f8e8bf4859a

Also, I'm not quite sure I understand the motivation for doing this pred = self.stft(self.istft(pred)) all the time (e.g., in compute_stft_loss and when taking inference steps) when wave is true. Why do you do it?

And why did you stop using the trick Vocos uses for better phase outputs (since phase is periodic)?

Also, (sorry for so many questions haha) with wave=false did you find the feature_loss to improve things or make them worse?

@bfs18
Copy link
Owner

bfs18 commented Jul 1, 2024

Hi @zaptrem

  1. I've tested your code and noticed a horizontal line between the two subbands. Additionally, there seems to be an error in the code regarding the subband dimensions; they should be 1024 and 257 instead of 1024 and 256. However, even after correcting this, there are still artifacts present. I have not yet determined a solution to this issue. One potential approach to mitigate the error accumulation could be to set wave=False and conduct the modeling directly in the frequency domain. This would necessitate only a single inverse STFT operation after sampling, which may reduce the severity of artifacts.
    20240701-144528
  2. stft(istft(pred)) follows the stft and istft operation in Figure 1 in the paper.

@zaptrem
Copy link
Author

zaptrem commented Jul 6, 2024

  1. Additionally, there seems to be an error in the code regarding the subband dimensions; they should be 1024 and 257 instead of 1024 and 256

Thanks! Can you tell me more about where this is happening and how you fixed it? One of my suspicions for why this was performing significantly worse was the shape/location of the real/imaginary channels was shifting between bands, but I could never determine whether that was actually happening. Also noting here that I tried some CQT/wavelet based transforms and got similarly mediocre results, I think because they're much more periodic than real/imag STFT.

Also, I've noticed the waterfall effect seems to continue with even with STFT loss enabled in high-noise regions of the spectrogram and it becomes more prominent as you get further in the training run. Could it be caused by errors being amplified between bands via the stft/istft operation from each step?

@zaptrem
Copy link
Author

zaptrem commented Jul 10, 2024

image

Also, I noticed the waterfall effect is still quite prevalent when there's noise but not complete silence even at the end of training with STFT loss enabled.

@bfs18
Copy link
Owner

bfs18 commented Jul 10, 2024

image

Also, I noticed the waterfall effect is still quite prevalent when there's noise but not complete silence even at the end of training with STFT loss enabled.
Hi @zaptrem
I also don't know how to fix the mismatch when shifting between bands of different FFT size.
Regarding the waterfall noise, I believe it occurs because the model attempts to reconstruct a phase even for background noise, where no meaningful phase is present. Increasing the STFT loss weight might resolve this issue by making the model place slightly more emphasis on the magnitude information.

@zaptrem
Copy link
Author

zaptrem commented Jul 10, 2024

image

Also, I noticed the waterfall effect is still quite prevalent when there's noise but not complete silence even at the end of training with STFT loss enabled.

Hi @zaptrem

I also don't know how to fix the mismatch when shifting between bands of different FFT size.

Regarding the waterfall noise, I believe it occurs because the model attempts to reconstruct a phase even for background noise, where no meaningful phase is present. Increasing the STFT loss weight might resolve this issue by making the model place slightly more emphasis on the magnitude information.

I figured out you can use a PQMF to get clean cuts that you can apply different STFT settings to, but I haven't tried a model with it yet because idk how the aliasing cancellation will work when neither top-level-band is aware of the other.

Something I noticed (though haven't rigorously confirmed) is that the waterfall effect is actually less prominent earlier in training. Also notice how the lines span between bands. Could small errors be getting amplified by the overlap between bands or possibly some sort of spectral bleeding during sampling? Did you see the waterfall effect in your experiments from before you started using STFT(ISTFT()) each step during sampling?

Edit: The waterfalling goes away entirely when I disable the stft(istft()) at inference time. However, otherwise the quality becomes worse:
image

Top: istft/stft turned off. Middle: Turned on. Bottom: Ground truth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants