This demo is a simple voice to text and translation application. It uses Whisper model to transcribe the voice and then translate the text to the target language.
This demo can run on an older GPU like NVIDIA GeForce GTX 960. But we still introduce it in AWS-G5 as a standard baseline. You may choose as you like.
First, make sure you've installed the NVIDIA driver and CUDA Toolkit according to the Prepare the CUDA environment in AWS G5 instances undert Ubuntu 24.04 article.
sudo apt install portaudio19-dev virtualenv
git clone https://github.com/hardenedlinux/hard-voice.git
cd hard-voice
virtualenv .local
source .local/bin/activate
pip install -r requirements.txt
You can configure by modifying these lines:
# Select from the following models: "tiny", "base", "small", "medium", "large"
model = whisper.load_model("small")
In our test, small
is good enough.
No, you don't need to specify transcription language. Whisper model will detect it automatically. Say, Whisper model know what language you are speaking. Amazing, huh?
options = {"fp16": False, "language": None, "task": "translate"}
If you set it to None
, the Whisper model will detect it as English in default.
python run.py
Then open your browser and visit http://localhost:7860
.
First, click record
button and say something. You can say various languages, say Chinese, Japanese...etc, and whisper model will detect and translate it automatically.
Then, click transcribe
button to get the the text you've just said.
Finally, click translate
button to get the translation.