Open Source Transcription | Notes In The Margin

Today I started looking for a command line, open source, audio-file-to-text conversion tool.

Given that CMU Sphinx was available in the 1990s, it feels like, by now, there should be something of this nature. So I went to Google and started looking for open source speech-to-text conversion tools.

There’s certainly no shortage of projects.

Spoiler/Conclusion

It certainly appears that all of these systems are designed by researchers, for researchers. They do not expect mere humans to be able to use them yet, and so the documentation is heavy on technical detail, and light on … y’know … actually using it.

Vosk appears to be the best, but that’s only because it’s literally the only one that I could get working. The output seems to be JSON, and I am not yet sure how I would actually use it in real life, for an actual recording.

Julius looks promising, in that it’s actually packaged for Fedora, but the documentation is not particularly helpful, and the command line help, while copious, is completely unhelpful, because it assumes that you are already an expert and know all the jargon. I kind of feel like Julius would be the best option, if I had someone to show me how to get started.

DeepSpeech2 (Now actually called PaddlePaddleSpeech) also looks like it would be exactly what I want, if the actual software did anything like what the documentation suggests. Unfortunately, I was not able to get through the installation.

It is notable that all of these tools refer to themselves as toolkits, rather than applications, or anything else that would indicate that it could be used to actually convert speech to text in a daily workflow.

Sources:

https://fosspost.org/open-source-speech-recognition/

https://softwarerecs.stackexchange.com/questions/34740/open-source-speech-to-text-software-for-audio-files-in-english

https://en.wikipedia.org/wiki/List_of_speech_recognition_software

Trying stuff …

Kaldi

http://kaldi-asr.org/doc/index.html

Not in Fedora
Not … anywhere. Have to install directly from GitHub, which is … not awesome, but trying it …
… yeah, gonna move this to the bottom of the list. The INSTALL file is very off-putting.

Might come back to this later if everything else fails.

Julius

https://github.com/julius-speech/julius

It’s in Fedora, so that’s a good start
Easy to install, and extensive help
Seems to assume that you know a lot about voice recog, but …

First try:

julius -input file -filelist donna_audio_only.wav ERROR: m_chkparam: you should specify at least one LM to run Julius!

Ok, what’s an LM?

Last update do the docs … 3 years. Uh oh …

Readme, at https://github.com/julius-speech/julius , is completely inscrutable. sigh Ok, moving along.

Wav2Letter

https://github.com/flashlight/wav2letter

Not in Fedora.
Github is pretty accurate, but … the project appears to have redirected to a new home

Um …
https://github.com/flashlight/flashlight

I have literally no idea where to start. Moving along.

DeepSpeech2

This one also is apparently now a different project, called PaddlePaddleSpeech

https://github.com/PaddlePaddle/PaddleSpeech

The Readme suggests that this is exactly what I want:

paddlespeech asr --lang zh --input input_16k.wav

So, here goes…

Of course, it’s not in Fedora, but it’s in Python, so it should probably Just Work, right?

git clone and there is not, of course, any actual executable or script named paddlespeech, so apparently the docs are oversimplifying something.

https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md looks like a promising place to look …

conda install -y -c conda-forge sox libsndfile bzip2

Um … ok. Installing conda

Next, I need a C++ compiler … so the initial note that this is in Python was in error.

And once Conda is installed, I get:

NoBaseEnvironmentError: This conda installation has no default base environment. Use 'conda create' to create new environments and 'conda activate' to activate environments.

So I’m already deeper in this yak stack than I really want to be.

There’s also a docker option, but this has literally never actually worked for me, so … let’s try the next thing.

This is very disappointing, because this looked the most promising so far.

Vosk

I followed the instructions at https://alphacephei.com/vosk/install

pip3 install vosk

And then ran the “Usage example”

git clone https://github.com/alphacep/vosk-api

cd vosk-api/python/example

wget https://alphacephei.com/kaldi/models/vosk-model-small-en-us-0.15.zip

unzip vosk-model-small-en-us-0.15.zip

mv vosk-model-small-en-us-0.15 model

python3 ./test_simple.py test.wav

Amazingly, this worked. However, the output is awful:

A million of these:

{ "text" : "come out be happy to israeli set up another time for that i'd be happy to rich thanks for the chat and i have fun putting that together think you are a bank so i only look forward to talking to region by it out" }

I’m not sure how you’re actually supposed to use it in the real world.

The output is close-ish to the actual transcription, but is in a format that will make it hard to use. But it’s promising, and I’ll come back to it if nothing else works better.

Athena

Not in Fedora, of course.

https://github.com/athena-team/athena

Installation instructions includes 7 sections:

3.1) Creating a virtual environment [Optional]
3.2) Install tensorflow backend
3.3) Install sph2pipe, spm, kenlm, sclite for ASR Tasks [Optional]
3.4) Install horovod for multiple-device training [Optional]
3.5) Install pydecoder for WFST decoding [Optional]
3.6) Install athena package
3.7) Test your installation

That’s a no from me. Too many moving parts. I’ll come back to it if all else fails.

Conclusion

I think I’m going to try to use Vosk to script up something that gives me a starting point, and work from there. It’s discouraging that 30+ years on from CMU Sphinx, this is still the state of the art, and that people are generally split between “Outsource it to these folks, they do good work” and “You should use Dragon Naturally Speaking.”

FWIW, Dragon, also isn’t awesome, and is particularly bad when there’s more than one voice on the recording.