Offline Speech Recognition Demo

This demo provides a command line interface for automatic speech recognition using OpenVINO™. Components used by this executable:

How It Works

The application transcribes speech from a given WAV file and outputs the text to the console.


The application requires two command line parameters, which point to audio file with speech to be transcribed and configuration file describing the resources to be used for transcription.

Parameters for executable

-wave - Filepath to input WAV to be processed. WAV file needs to be in following format: RIFF WAVE PCM 16bit, 16kHz, 1 channel, with header. -c, --config - Filepath to configuration file containing paths to resources and other parameters.

Example usage:

offline_speech_recognition_app.exe -wave="<path_to_audio>/inputAudio.wav" -c="<path_to_config>/configFile.cfg"

Description of configuration file

The configuration file is simple ASCII text file where:

Description of parameters

Parameter Description Value used for demo
-fe:rt:numCeps Number of MFCC cepstrums 13
-fe:rt:contextLeft Numbers of past frames that will be stacked to form input vector for neural network inference 5
-fe:rt:contextRight Numbers of future frames that will be stacked to form input vector for neural network inference 5
-fe:rt:hpfBeta High pass filter beta coefficient, where 0.0f means - no filtering 0.0f
-fe:rt:inputDataType Feature extraction input format description INT16_16KHZ
-fe:rt:cepstralLifter Lifting factor 22.0f
-fe:rt:noDct Flag: use DCT as final step or not 0
-fe:rt:featureTransform Kaldi feature transform file that normalizes stacked features for neural network inference
-dec:wfst:acousticModelFName Full path to the acoustic model file e.g. open_vino_ir.xml
-dec:wfst:acousticScaleFactor The acoustic log likelihood scaling factor 0.1f
-dec:wfst:beamWidth Viterbi search beam width 14.0f
-dec:wfst:latticeWidth Lattice beam width (extends the beam width) 0.0f
-dec:wfst:nbest Number of n-best hypothesis to be generated 1
-dec:wfst:confidenceAcousticScaleFactor Scaling parameter to factor in acoustic scores in confidence computations 1.0f
-dec:wfst:confidenceLMScaleFactor Scaling parameter to factor in language model in confidence computations 1.0f
-dec:wfst:hmmModelFName Full path to HMM model
-dec:wfst:fsmFName Full path to pronunciation model or full statically composed LM, if static composition is used
-dec:wfstotf:gramFsmFName Full path to grammar model
-dec:wfst:outSymsFName Full path to the output symbols (lexicon) filename
-dec:wfst:tokenBufferSize Token pool size expressed in number of DWORDs 150000
-dec:wfstotf:traceBackLogSize Number of entries in traceback expressed as log2(N) 19
-dec:wfstotf:minStableFrames The time expressed in frames, after which, wining hypothesis is recognized as stable and final result can be printed. 45
-dec:wfst:maxCumulativeTokenSize Maximum fill rate of token buffer before token beam is adjusted to keep token buffer fill constant. Expressed as factor of buffer size (0.0, 1.0> 0.2f
-dec:wfst:maxTokenBufferFill Active token count number triggering beam tightening expressed as factor of buffer size 0.6f
-dec:wfst:maxAvgTokenBufferFill Average active token count number for utterance, which triggers beam tightening when exceed. Expressed as factor of buffer size 1.0f
-dec:wfst:tokenBufferMinFill Minimum fill rate of token buffer 0.1f
-dec:wfst:pruningTighteningDelta Beam tightening value when token pool usage reaches the pool capacity 1.0f
-dec:wfst:pruningRelaxationDelta Beam relaxation value when token pool is not meeting minimum fill ratio criterion 0.5f
-dec:wfst:useScoreTrendForEndpointing Extend end pointing with acoustic feedback 1
-dec:wfstotf:cacheLogSize Number of entries in LM cache expressed as log2(N) 16
-eng:output:format Format of the speech recognition output text
-inference:contextLeft IE: Additional stacking option - independent from feature extraction 0
-inference:contextRight IE: Additional stacking option - independent from feature extraction 0
-inference:device IE: Device used for neural computations CPU
-inference:numThreads IE: Number of threads used by GNA in SW mode 1
-inference:scaleFactor IE: Scale factor used for static quantization 3000.0
-inference:quantizationBits IE: Quantization resolution in bits 16 or 8

Demo Output

The resulting transcription for the sample audio file is:

[ INFO ] Using feature transformation
[ INFO ] InferenceEngine API
[ INFO ] Device info:
[ INFO ] Batch size: 1
[ INFO ] Model loading time: 61.01 ms
Recognition result: