Speech Library and Speech Recognition Demos

The automatic speech recognition sample distributed as part of OpenVINO™ package is a good demonstration of acoustic model inference based on Kaldi* neural networks. However, this sample works with Kaldi ARK files only, so it cannot, by itself, cover the natural end-to-end speech recognition scenario, speech to text. It requires additional preprocessing (feature extraction) to get a feature vector from speech signal, as well as postprocessing (decoding) to produce text from scores:

speech_sample.png

Starting with 2020.1 release, we provide a set of libraries and demos to demonstrate the end-to-end speech recognition. We are also providing new acoustic and language models that can work with these demos. So the full end-to-end speech processing scenario is covered and demonstrated by libraries and tools distributed with OpenVINO™:

new_speech_demos.png

All new content as available at the following location: <INSTALL_DIR>/data_processing/audio/speech_recognition.

NOTE: This content is installed only if Inference Engine Runtime for Intel® Gaussian & Neural Accelerator component is selected during installation. However, speech library and speech recognition demos do not need GNA accelerator to work (see "Hardware support" section below).

Package content

The package includes the following components:

Additionally, new acoustic and language models are placed on download.01.org to be used by new demos.

How to run speech recognition demos using pre-trained models

In order to download pre-trained models and build all dependencies, a single batch file / shell script is provided:

On Linux*: <INSTALL_DIR>/deployment_tools/demo/demo_speech_recognition.sh

On Windows*: <INSTALL_DIR>\\deployment_tools\\demo\\demo_speech_recognition.bat

This script:

If you are behind a proxy, set the following environment variables in the console session before running the script:

On Linux* and macOS*:

export http_proxy=http://{proxyHost}:{proxyPort}
export https_proxy=https://{proxyHost}:{proxyPort}

On Windows* OS:

set http_proxy=http://{proxyHost}:{proxyPort}
set https_proxy=https://{proxyHost}:{proxyPort}

Hardware support

The provided acoustic models have been tested on CPU, GPU and GNA, and you can switch between these targets in offline and live speech recognition demos. Note that Intel® Gaussian & Neural Accelerator (Intel® GNA) is a specific low-power co-processor (available on Intel® Core™ i3-8121U processor, Intel® Core™ i7-1065G7 processor, and some others) which is designed to offload some workloads, thus saving power and CPU resources. If you are using one of the processors supporting GNA, you can notice that CPU load is much lower when GNA is selected. If GNA is selected as a device for inference, and your processor does not support GNA, then execution is performed in emulation mode (on CPU) because GNA_AUTO configuration option is used). Please check documentation on the GNA plugin for more information.

The speech library provides a highly optimized implementation of preprocessing and postprocessing (feature extraction and decoding) on CPU only.

What is required to use custom models

In order to run demonstration applications with custom models one needs to:

After these steps, new models can be used by the live speech recognition demo. To perform speech recognition using new model and command line application, provide the path to the new configuration file as an input argument of the app.

Conversion of acoustic model using OpenVINO™ Model Optimizer for Kaldi*

In order to convert acoustic models, the following Kaldi files are required:

For conversion steps, please follow OpenVINO™ documentation. The path to the converted model (xml file) shall be set in the configuration file.

Conversion of language model using the provided converter

In order to convert language model the following Kaldi files are required:

All files are required to create resources for demo applications.

Model conversion from Kaldi requires the following steps:

  1. Save HCLG WFST as openFST const type: $KALDI_ROOT/tools/openfst/bin/fstconvert --fst_type=const HCLG.fst HCLG.const_fst
  2. Generate transition ID information using phones.txt and final.mdl: $KALDI_ROOT/src/bin/show-transitions phones.txt final.mdl > transitions.txt
  3. Convert HCLG WFST using resource conversion executable: kaldi_slm_conversion_tool HCLG.const_fst transitions.txt words.txt cl.fst labels.bin

The paths to cl.fst and labels.bin files must be put in the configuration file in order to be used with Live Speech Recognition Demo Application.

Please refer to offline speech recognition demo documentation to learn about the configuration file format.

You can find here more information on the conversion tool.