Speech Recognition DeepSpeech Python* Demo

This demo shows Automatic Speech Recognition (ASR) with a pretrained Mozilla* DeepSpeech 0.8.2 model.

It works with version 0.6.1 as well, and should also work with other models trained with Mozilla DeepSpeech 0.6.x/0.7.x/0.8.x/0.9.x with ASCII alphabets.

How It Works

The application accepts

  • Mozilla* DeepSpeech 0.8.2 neural network in Intermediate Representation (IR) format,

  • n-gram language model file in kenlm quantized binary format, and

  • an audio file in PCM WAV 16 kHz mono format.

The application has two modes:

  • Normal mode (default). Audio data is streamed in 10 second chunks into a streaming pipeline of: computation of audio features, running a neural network to get per-frame character probabilities, and CTC decoding. After processing the whole file, the demo prints the decoded text and the time spent.

  • In simulated real-time mode the app simulates speech recognition of live recording by feeding audio data from input file and displaying the current partial result in a creeping line in console output. Data is fed at real-time speed by introducing the necessary delays. Audio data is fed in 0.32 sec chunks (size is controlled by --block-size option) into the same streaming pipeline. In this mode the pipeline provides updated recognition result after each data chunk.

Preparing to Run

The list of models supported by the demo is in <omz_dir>/demos/speech_recognition_deepspeech_demo/python/models.lst file. This file can be used as a parameter for Model Downloader and Converter to download and, if necessary, convert models to OpenVINO IR format (*.xml + *.bin). Don’t forget to configure Model Optimizer, which is a requirement for Model Downloader, as described in its documentation.

An example of using the Model Downloader:

omz_downloader --list models.lst

An example of using the Model Converter:

omz_converter --list models.lst

Please pay attention to the model license, Mozilla Public License 2.0.

Prerequisites

The demo depends on the ctcdecode_numpy Python extension module, which implements CTC decoding in C++ for faster decoding. Please refer to Open Model Zoo demos for instructions on how to build the extension module and prepare the environment for running the demo. Alternatively, instead of using cmake you can run python -m pip install . inside ctcdecode-numpy directory to build and install ctcdecode-numpy.

Supported Models

  • mozilla-deepspeech-0.6.1

  • mozilla-deepspeech-0.8.2

Please pay attention to the model license, Mozilla Public License 2.0.

NOTE: Refer to the tables Intel’s Pre-Trained Models Device Support and Public Pre-Trained Models Device Support for the details on models inference support at different devices.

Running Demo

Run the application with -h option to see help message. Here are the available command line options:

usage: speech_recognition_deepspeech_demo.py [-h] -i FILENAME [-d DEVICE] -m
                                             FILENAME [-L FILENAME] -p NAME
                                             [-b N] [-c N] [--realtime]
                                             [--block-size BLOCK_SIZE]
                                             [--realtime-window REALTIME_WINDOW]

Speech recognition DeepSpeech demo

optional arguments:
  -h, --help            show this help message and exit
  -i FILENAME, --input FILENAME
                        Required. Path to an audio file in WAV PCM 16 kHz mono format
  -d DEVICE, --device DEVICE
                        Optional. Specify the target device to infer on, for
                        example: CPU or GPU or HETERO. The
                        demo will look for a suitable OpenVINO Runtime plugin for this
                        device. (default is CPU)
  -m FILENAME, --model FILENAME
                        Required. Path to an .xml file with a trained model
  -L FILENAME, --lm FILENAME
                        Optional. Path to language model file
  -p NAME, --profile NAME
                        Required. Choose pre/post-processing profile: mds06x_en
                        for Mozilla DeepSpeech v0.6.x,
                        mds07x_en/mds08x_en/mds09x_en for Mozilla DeepSpeech
                        v0.7.x/v0.8.x/v0.9.x(English), other: filename of a
                        YAML file
  -b N, --beam-width N  Beam width for beam search in CTC decoder (default
                        500)
  -c N, --max-candidates N
                        Show top N (or less) candidates (default 1)
  --realtime            Simulated real-time mode: slow down data feeding to
                        real time and show partial transcription during
                        recognition
  --block-size BLOCK_SIZE
                        Block size in audio samples for streaming into ASR
                        pipeline (defaults to samples in 10 sec for offline;
                        samples in 16 frame strides for online)
  --realtime-window REALTIME_WINDOW
                        In simulated real-time mode, show this many characters
                        on screen (default 79)

The typical command line for offline mode is:

python3 speech_recognition_deepspeech_demo.py \
    -p mds08x_en \
    -m <path_to_model>/mozilla-deepspeech-0.8.2.xml \
    -L <path_to_file>/deepspeech-0.8.2-models.kenlm \
    -i <path_to_audio>/audio.wav

For version 0.6.1 it is:

python3 speech_recognition_deepspeech_demo.py \
    -p mds06x_en \
    -m <path_to_model>/mozilla-deepspeech0-0.6.1.xml \
    -L <path_to_file>/lm.binary \
    -i <path_to_audio>/audio.wav

To run in simulated real-time mode add command-line option --realtime.

NOTE: Only 16-bit, 16 kHz, mono-channel WAVE audio files are supported.

Optional (but highly recommended) language model files, deepspeech-0.8.2-models.kenlm or lm.binary are part of corresponding model downloaded content and will be located in the Model Downloader output folder after model downloading and conversion. An example audio file can be taken from https://storage.openvinotoolkit.org/models_contrib/speech/2021.2/librispeech_s5/how_are_you_doing_today.wav.

Demo Output

The application shows time taken by the initialization and processing stages, and the decoded text for the audio file. In real-time mode the current recognition result is shown while the app is running as well. In offline mode the demo reports

  • Latency: total processing time required to process input data (from reading the data to displaying the results).