This topic shows how to run the speech sample application, which demonstrates acoustic model inference based on Kaldi* neural networks and speech feature vectors.
Running the application with the -h
option yields the following usage message:
Running the application with the empty list of options yields the usage message given above and an error message.
NOTE: Before running the sample with a trained model, make sure the model is converted to the Inference Engine format (*.xml + *.bin) using the Model Optimizer tool.
You can use the following model optimizer command to convert a Kaldi nnet1 or nnet2 neural network to Intel IR format:
Assuming that the model optimizer (mo.py
), Kaldi-trained neural network, wsj_dnn5b_smbr.nnet
, and Kaldi class counts file, wsj_dnn5b_smbr.counts
, are in the working directory this produces the Intel IR network consisting of wsj_dnn5b_smbr.xml
and wsj_dnn5b_smbr.bin
.
The following pretrained models are available:
All of them can be downloaded from https://download.01.org/openvinotoolkit/2018_R3/models_contrib/GNA/.
Once the IR is created, you can use the following command to do inference on Intel^® Processors with the GNA co-processor (or emulation library):
Here, the floating point Kaldi-generated reference neural network scores (wsj_dnn5b_smbr_dev93_scores_10.ark
) corresponding to the input feature file (wsj_dnn5b_smbr_dev93_10.ark
) are assumed to be available for comparison.
The acoustic log likelihood sequences for all utterances are stored in the Kaldi ARK file, scores.ark
. If the -r
option is used, a report on the statistical score error is generated for each utterance such as the following:
Upon the start-up the speech_sample application reads command line parameters and loads a Kaldi-trained neural network along with Kaldi ARK speech feature vector file to the Inference Engine plugin. It then performs inference on all speech utterances stored in the input ARK file. Context-windowed speech frames are processed in batches of 1-8 frames according to the -bs
parameter. Batching across utterances is not supported by this sample. When inference is done, the application creates an output ARK file. If the -r
option is given, error statistics are provided for each speech utterance as shown above.
If the GNA device is selected (for example, using the -d
GNA flag), the GNA Inference Engine plugin quantizes the model and input feature vector sequence to integer representation before performing inference. Several parameters control neural network quantization. The -q
flag determines the quantization mode. Three modes are supported: static, dynamic, and user-defined. In static quantization mode, the first utterance in the input ARK file is scanned for dynamic range. The scale factor (floating point scalar multiplier) required to scale the maximum input value of the first utterance to 16384 (15 bits) is used for all subsequent inputs. The neural network is quantized to accomodate the scaled input dynamic range. In user-defined quantization mode, the user may specify a scale factor via the -sf
flag that will be used for static quantization. In dynamic quantization mode, the scale factor for each input batch is computed just before inference on that batch. The input and network are (re)quantized on-the-fly using an efficient procedure.
The -qb
flag provides a hint to the GNA plugin regarding the preferred target weight resolution for all layers. For example, when -qb 8
is specified, the plugin will use 8-bit weights wherever possible in the network. Note that it is not always possible to use 8-bit weights due to GNA hardware limitations. For example, convolutional layers always use 16-bit weights (GNA harware verison 1 and 2). This limitation will be removed in GNA hardware version 3 and higher.
Several execution modes are supported via the -d
flag. If the device is set to CPU
and the GNA plugin is selected, the GNA device is emulated in fast-but-not-bit-exact mode. If the device is set to GNA_AUTO
, then the GNA hardware is used if available and the driver is installed. Otherwise, the GNA device is emulated in fast-but-not-bit-exact mode. If the device is set to GNA_HW
, then the GNA hardware is used if available and the driver is installed. Otherwise, an error will occur. If the device is set to GNA_SW
, the GNA device is emulated in fast-but-not-bit-exact mode. Finally, if the device is set to GNA_SW_EXACT
, the GNA device is emulated in bit-exact mode.
The GNA plugin supports loading and saving of the GNA-optimized model (non-IR) via the -rg
and -wg
flags. Thereby, it is possible to avoid the cost of full model quantization at run time. The GNA plugin also supports export of firmware-compatible embedded model images for the Intel® Speech Enabling Developer Kit and Amazon Alexa* Premium Far-Field Voice Development Kit via the -we
flag (save only).
In addition to performing inference directly from a GNA model file, these options make it possible to:
-m
, -wg
)-m
, -we
)-rg
, -we
)The Wall Street Journal DNN model used in this example was prepared using the Kaldi s5 recipe and the Kaldi Nnet (nnet1) framework. It is possible to recognize speech by substituting the speech_sample
for Kaldi's nnet-forward command. Since the speech_sample does not yet use pipes, it is necessary to use temporary files for speaker- transformed feature vectors and scores when running the Kaldi speech recognition pipeline. The following operations assume that feature extraction was already performed according to the s5
recipe and that the working directory within the Kaldi source tree is egs/wsj/s5
.
final.feature_transform
and the feature files specified in feats.scp
: speech_sample
: HCLG.fst
), vocabulary (words.txt
), and TID/PID mapping (final.mdl
): words.txt
) and reference transcript (test_filt.txt
):