The automatic speech recognition sample distributed as part of OpenVINO™ package is a good demonstration of acoustic model inference based on Kaldi* neural networks. However, this sample works with Kaldi ARK files only, so it cannot, by itself, cover the natural end-to-end speech recognition scenario, speech to text. It requires additional preprocessing (feature extraction) to get a feature vector from speech signal, as well as postprocessing (decoding) to produce text from scores:
Starting with 2020.1 release, we provide a set of libraries and demos to demonstrate the end-to-end speech recognition. We are also providing new acoustic and language models that can work with these demos. So the full end-to-end speech processing scenario is covered and demonstrated by libraries and tools distributed with OpenVINO™:
All new content as available at the following location: <INSTALL_DIR>/data_processing/audio/speech_recognition
.
NOTE: This content is installed only if Inference Engine Runtime for Intel® Gaussian & Neural Accelerator component is selected during installation. However, speech library and speech recognition demos do not need GNA accelerator to work (see "Hardware support" section below).
The package includes the following components:
Additionally, new acoustic and language models are placed on download.01.org to be used by new demos.
In order to download pre-trained models and build all dependencies, a single batch file / shell script is provided:
On Linux*: <INSTALL_DIR>/deployment_tools/demo/demo_speech_recognition.sh
On Windows*: <INSTALL_DIR>\\deployment_tools\\demo\\demo_speech_recognition.bat
This script:
If you are behind a proxy, set the following environment variables in the console session before running the script:
On Linux* and macOS*:
On Windows* OS:
The provided acoustic models have been tested on CPU, GPU and GNA, and you can switch between these targets in offline and live speech recognition demos. Note that Intel® Gaussian & Neural Accelerator (Intel® GNA) is a specific low-power co-processor (available on Intel® Core™ i3-8121U processor, Intel® Core™ i7-1065G7 processor, and some others) which is designed to offload some workloads, thus saving power and CPU resources. If you are using one of the processors supporting GNA, you can notice that CPU load is much lower when GNA is selected. If GNA is selected as a device for inference, and your processor does not support GNA, then execution is performed in emulation mode (on CPU) because GNA_AUTO configuration option is used). Please check documentation on the GNA plugin for more information.
The speech library provides a highly optimized implementation of preprocessing and postprocessing (feature extraction and decoding) on CPU only.
In order to run demonstration applications with custom models one needs to:
demo_speech_recognition.sh/.bat
file mentioned in the "How to run..." section{OpenVINO build folder}/data_processing/audio/speech_recognition/models/{LANG}
. The demo models are trained for US English language, so we use en-us
for {LANG}
folder name.After these steps, new models can be used by the live speech recognition demo. To perform speech recognition using new model and command line application, provide the path to the new configuration file as an input argument of the app.
In order to convert acoustic models, the following Kaldi files are required:
final.nnet
(RAW neural network without topology information)pdf.counts
(if used)final.feature_transform
(if used).For conversion steps, please follow OpenVINO™ documentation. The path to the converted model (xml file) shall be set in the configuration file.
In order to convert language model the following Kaldi files are required:
final.mdl
HCLG.wfst
words.txt
.All files are required to create resources for demo applications.
Model conversion from Kaldi requires the following steps:
$KALDI_ROOT/tools/openfst/bin/fstconvert --fst_type=const HCLG.fst HCLG.const_fst
$KALDI_ROOT/src/bin/show-transitions phones.txt final.mdl > transitions.txt
kaldi_slm_conversion_tool HCLG.const_fst transitions.txt words.txt cl.fst labels.bin
The paths to cl.fst
and labels.bin
files must be put in the configuration file in order to be used with Live Speech Recognition Demo Application.
Please refer to offline speech recognition demo documentation to learn about the configuration file format.
You can find here more information on the conversion tool.