I ran into some minor glitches trying to install and use DeepSpeech couple of da...

kdavis · on Dec 3, 2017

It only works on "short", about 5 seconds or so, audio clips. (We should have documented this better, but I just put in a PR adding this to the documentation.)

However, you can use voice activity detection (VAD), for example webrtcvad from PyPI, to chop long audio into smaller bits that are able to be digested.

Maybe we should just put VAD in the client and have this occur automatically?

TheAceOfHearts · on Dec 3, 2017

Personally, I'd love to see that as part of the client.

Just this week I started looking into how I could generate transcripts for a bunch of videos. Even if the transcripts aren't perfect, it helps with tagging and searching through large video collections that include certain keywords.

Sadly, I didn't have any luck with local solutions. I managed to generate a few transcripts using GCP's Cloud Speech API with minimal hassle, but I'd much prefer to do it locally.

I was planning on trying this out later today, and had already downloaded the Common Voice corpus. Having to add another step to break up the input into smaller chunks probably isn't a huge deal, but I wouldn't have known what tool to use in order to achieve that.

Do you know of any comparisons between various speech-to-text tools? I've avoided commercial tools so far because I'm hesitant to drop $250+ just for playing around, but I'd be interested in seeing if they're truly superior to existing open alternatives.

kdavis · on Dec 3, 2017

Added an enhancement request, issue 1064[1], to github, asking for the clients to support longer audio clips.

I can't promise when we'll get to it, as from now until new year is a bit of a wash.

I don't know of any detailed comparisons of commercial solutions. However, with respect to pure word error rate, the article[2] does a comparison of several engines as of circa 2015.

[1] https://github.com/mozilla/DeepSpeech/issues/1064

[2] https://arxiv.org/abs/1412.5567

ssttoo · on Dec 3, 2017

Thanks, didn’t know about the 5 seconds. If this chopping tool can generate a map to the original audio, that means subtitles TED-style are going to be possible.

And thanks for you Mozilla peep’s hard work!

Barrin92 · on Dec 3, 2017

out of interest, do you also work on a reverse solution, text-to-speech? Most open source engines sadly still can't compete with commercial alternatives.

egnehots · on Dec 3, 2017

Maybe Tacotron will interest you? It's an end-to-end model, that's reasonably close to the state of the art:

https://google.github.io/tacotron/publications/tacotron/inde...

They are some open source implementations.

throwmenow_0140 · on Dec 4, 2017

Thank you so much for this link, that is the best text-to-speech with an open architecture I've ever heard 'til now. Under https://github.com/keithito/tacotron you can find a pre-trained model based on this paper, although it isn't matching the quality yet. Maybe I can get some cluster time to train a new model using multiple datasets.

Edit: Another interesting one: http://research.baidu.com/deep-voice-3-2000-speaker-neural-t...

Barrin92 · on Dec 3, 2017

that does look pretty good, thanks!

timc3 · on Dec 3, 2017

Any plans on making it work with longer audio?

kdavis · on Dec 3, 2017

We're working on it! :-)

I don't have an ETA, but it's in the works.