It only works on "short", about 5 seconds or so, audio clips. (We should have documented this better, but I just put in a PR adding this to the documentation.)
However, you can use voice activity detection (VAD), for example webrtcvad from PyPI, to chop long audio into smaller bits that are able to be digested.
Maybe we should just put VAD in the client and have this occur automatically?
Personally, I'd love to see that as part of the client.
Just this week I started looking into how I could generate transcripts for a bunch of videos. Even if the transcripts aren't perfect, it helps with tagging and searching through large video collections that include certain keywords.
Sadly, I didn't have any luck with local solutions. I managed to generate a few transcripts using GCP's Cloud Speech API with minimal hassle, but I'd much prefer to do it locally.
I was planning on trying this out later today, and had already downloaded the Common Voice corpus. Having to add another step to break up the input into smaller chunks probably isn't a huge deal, but I wouldn't have known what tool to use in order to achieve that.
Do you know of any comparisons between various speech-to-text tools? I've avoided commercial tools so far because I'm hesitant to drop $250+ just for playing around, but I'd be interested in seeing if they're truly superior to existing open alternatives.
Added an enhancement request, issue 1064[1], to github, asking for the clients to support longer audio clips.
I can't promise when we'll get to it, as from now until new year is a bit of a wash.
I don't know of any detailed comparisons of commercial solutions. However, with respect to pure word error rate, the article[2] does a comparison of several engines as of circa 2015.
Thanks, didn’t know about the 5 seconds. If this chopping tool can generate a map to the original audio, that means subtitles TED-style are going to be possible.
out of interest, do you also work on a reverse solution, text-to-speech? Most open source engines sadly still can't compete with commercial alternatives.
Thank you so much for this link, that is the best text-to-speech with an open architecture I've ever heard 'til now. Under https://github.com/keithito/tacotron you can find a pre-trained model based on this paper, although it isn't matching the quality yet. Maybe I can get some cluster time to train a new model using multiple datasets.