-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subband vs Duration #3
Comments
Hi @zaptrem ,
|
Thanks! It looks like you probably have a similar staircase effect between 0 and 30k but I'm not certain since it's zoomed out. Also, I noticed your PQMF filter is hard-coded to 8 bands, 124 taps, and cutoff 0.071. If using 16 bands would it be better to set these to (following the trend you set going from 4 to 8) 16 bands, 248 taps, and cutoff 0.0355? Also, similar to CFG scale have you noticed any other changes that disproportionately help with generating waveforms for Encodec tokens? |
Hi @zaptrem , |
Thanks! For clarification, the original paper used 4 vs 8 (model, not PQMF) subbands, but you have since moved to 16 and that is how you got the results in this screenshot? Also, when you switched between 4/8/16 model subbands did you need to adjust the |
Hi,
|
Were these with 16 bands? If not, did 16 bands improve it beyond these results? |
In this table, RFWave utilizes 8 subbands, a detail noted in the paper's text but omitted from the table's title. |
In initial trials, I noticed the presence of horizontal striations across subbands within the spectrograms occasionally. Implementing overlap loss as a countermeasure effectively mitigated this issue, leading to its adoption as the default setting. I did not expect an improvement in objective metrics with overlap loss. My primary goal was to ensure robust performance across a diverse range of configurations. |
Thanks. Your model uses a lot of bands/compute on frequencies the human ear doesn't really care about, so I've been trying to fix that. Here's what 16 bands looks like with your current approach: I thought of using an stft with a higher n_fft (4096) for the lower frequencies (so more bands/compute is spent on parts we care about) and a lower n_fft on the higher frequencies (1024 but still satisfying COLA since hop size is 512). The problem I ran into is my multi-resolution STFT implementation is not as perfectly invertible as a normal spectrogram (though it's close), and I'm not quite sure why. When you run it back and forth hundreds of times some ringing/artifacts will appear on the border between the two resolutions. When you train on these (with overlap turned off since it doesn't make sense across resolution boundaries and wave=true, haven't tried false) the model seems to have a much harder time wrapping its head around anything and after one night still hasn't figured out phase. Have you tried anything like this? My code in case you're interested: https://gist.github.com/zaptrem/94d10c5d76d2f601841e9f8e8bf4859a Also, I'm not quite sure I understand the motivation for doing this And why did you stop using the trick Vocos uses for better phase outputs (since phase is periodic)? Also, (sorry for so many questions haha) with wave=false did you find the feature_loss to improve things or make them worse? |
Hi @zaptrem
|
Thanks! Can you tell me more about where this is happening and how you fixed it? One of my suspicions for why this was performing significantly worse was the shape/location of the real/imaginary channels was shifting between bands, but I could never determine whether that was actually happening. Also noting here that I tried some CQT/wavelet based transforms and got similarly mediocre results, I think because they're much more periodic than real/imag STFT. Also, I've noticed the waterfall effect seems to continue with even with STFT loss enabled in high-noise regions of the spectrogram and it becomes more prominent as you get further in the training run. Could it be caused by errors being amplified between bands via the stft/istft operation from each step? |
|
I figured out you can use a PQMF to get clean cuts that you can apply different STFT settings to, but I haven't tried a model with it yet because idk how the aliasing cancellation will work when neither top-level-band is aware of the other. Something I noticed (though haven't rigorously confirmed) is that the waterfall effect is actually less prominent earlier in training. Also notice how the lines span between bands. Could small errors be getting amplified by the overlap between bands or possibly some sort of spectral bleeding during sampling? Did you see the waterfall effect in your experiments from before you started using STFT(ISTFT()) each step during sampling? Edit: The waterfalling goes away entirely when I disable the stft(istft()) at inference time. However, otherwise the quality becomes worse: Top: istft/stft turned off. Middle: Turned on. Bottom: Ground truth. |
Hi, can you explain the difference between the subband and duration experiments and share which you've found to perform better? Also, what have you discovered about the use of classifier free guidance?
The text was updated successfully, but these errors were encountered: