Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where to find datasets for wake word detection training? #9

Open
farooqkz opened this issue Nov 2, 2023 · 2 comments
Open

Where to find datasets for wake word detection training? #9

farooqkz opened this issue Nov 2, 2023 · 2 comments

Comments

@farooqkz
Copy link

farooqkz commented Nov 2, 2023

Hello. I think it would be nice to include some links or hints in the README about this.

@GiviMAD
Copy link
Owner

GiviMAD commented Nov 3, 2023

The problem is that I have no idea where to find those. I was recommended to use these ones https://github.com/Picovoice/wake-word-benchmark/tree/master/audio but you'll still need to collect the records with no wakeword sound.

What I did with the wakeword I built is to collect some minutes of records from a podcast through the microphone by running a for loop in a bash script using rustpotter-cli records --ms 2000 $i.wav, there were still many false positives in my live tests so I just collected a bunch of false positives using rustpotter-cli spot -t 0.8 --record-path ./noises trained.rpw (the 'noises' folder needs to exists), I also added some more positive detections recorded the same way to balance the numbers, at that point the medium and large model sizes started to be confident.

I also noticed that the audio quality matters, I collected all the initial records using my MacBook microphone with captures very clean sound, when I changed to use the model on my Jabra speaker with a Raspberry PI, I have to take some records there and added it to the dataset to achieve similar performance, because it captures the audio with some minor eco and background noise. That is another thing that stopped me to trying to collect and share a dataset, at the end it seems like a task that requires a group of people with several devices to be involved in.

Currently I'm using a medium size model with threshold 0.93, min counter 15, and the gain normalizer filter and I having a pretty good experience where the detection works most of the time even when I'm watching tv.

In case you are interested on my setup, I'm using it with OpenHAB with a whisper.cpp add-on I'm working on (for voice generation I'm still using a cloud service), and it gives me an acceptable experience, my server is running in an Orange Pi 5, I will get a Raspberry 5 in a couple of weeks, which can be overclocked to 3.0hz I think, I hope it works a little faster there. I'm using a small fine tuned whisper model for Spanish I found on HF. As speaker I'm using a Jabra Speaker2 40 connected to a Raspberry Pi Zero 2 W (previously I was using an older Jabra speaker but the sound was not too good as commented).

video_2023-11-03_12-55-24-2.mp4

The setup is summarized here https://community.openhab.org/t/dialog-processing-with-the-pulseaudiobinding/148191, still I don't think anyone has tried it and succeed.

@farooqkz
Copy link
Author

farooqkz commented Nov 21, 2023

There is Mozilla Common Voice. And the words cut are provided by mswc. I could open a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants