Offline Speech Recognition Demo¶
This demo provides a command-line interface for automatic speech recognition using OpenVINO. Components used by this executable:
lspeech_s5_ext
model - Example pre-trained LibriSpeech DNNspeech_library.dll
(.so
) - Open source speech recognition library that uses OpenVINO Inference Engine, Intel Speech Feature Extraction and Intel Speech Decoder libraries
How It Works¶
The application transcribes speech from a given WAV file and outputs the text to the console.
Running¶
The application requires two command-line parameters, which point to an audio file with speech to transcribe and a configuration file describing the resources to use for transcription.
Parameters for Executable¶
-wave
- Path to input WAV to process. WAV file needs to be in the following format: RIFF WAVE PCM 16bit, 16kHz, 1 channel, with header.-c
,--config
- Path to configuration file with paths to resources and other parameters.
Example usage:
offline_speech_recognition_app.exe -wave="<path_to_audio>/inputAudio.wav" -c="<path_to_config>/configFile.cfg"
Configuration File Description¶
The configuration file is an ASCII text file where:
Parameter name and its value are separated with the space character
Parameter and value pair ends with the end of line character
Parameter Description¶
Parameter |
Description |
Value used for demo |
---|---|---|
|
Number of MFCC cepstrums |
13 |
|
Numbers of past frames that are stacked to form input vector for neural network inference |
5 |
|
Numbers of future frames that are stacked to form input vector for neural network inference |
5 |
|
High pass filter beta coefficient, where 0.0f means no filtering |
0.0f |
|
Feature extraction input format description |
INT16_16KHZ |
|
Lifting factor |
22.0f |
|
Flag: use DCT as final step or not |
0 |
|
Kaldi feature transform file that normalizes stacked features for neural network inference |
|
|
Full path to the acoustic model file, for example |
|
|
The acoustic log likelihood scaling factor |
0.1f |
|
Viterbi search beam width |
14.0f |
|
Lattice beam width (extends the beam width) |
0.0f |
|
Number of n-best hypothesis to be generated |
1 |
|
Scaling parameter to factor in acoustic scores in confidence computations |
1.0f |
|
Scaling parameter to factor in language model in confidence computations |
1.0f |
|
Full path to HMM model |
|
|
Full path to pronunciation model or full statically composed LM, if static composition is used |
|
|
Full path to grammar model |
|
|
Full path to the output symbols (lexicon) filename |
|
|
Token pool size expressed in number of DWORDs |
150000 |
|
Number of entries in traceback expressed as log2(N) |
19 |
|
The time expressed in frames, after which the winning hypothesis is recognized as stable and the final result can be printed |
45 |
|
Maximum fill rate of token buffer before token beam is adjusted to keep token buffer fill constant. Expressed as factor of buffer size (0.0, 1.0) |
0.2f |
|
Active token count number triggering beam tightening expressed as factor of buffer size |
0.6f |
|
Average active token count number for utterance, which triggers beam tightening when exceeded. Expressed as factor of buffer size |
1.0f |
|
Minimum fill rate of token buffer |
0.1f |
|
Beam tightening value when token pool usage reaches the pool capacity |
1.0f |
|
Beam relaxation value when token pool is not meeting minimum fill ratio criterion |
0.5f |
|
Extend end pointing with acoustic feedback |
1 |
|
Number of entries in LM cache expressed as log2(N) |
16 |
|
Format of the speech recognition output |
text |
|
IE: Additional stacking option, independent from feature extraction |
0 |
|
IE: Additional stacking option, independent from feature extraction |
0 |
|
IE: Device used for neural computations |
CPU |
|
IE: Number of threads used by GNA in SW mode |
1 |
|
IE: Scale factor used for static quantization |
3000.0 |
|
IE: Quantization resolution in bits |
16 or 8 |
Demo Output¶
The resulting transcription for the sample audio file:
[ INFO ] Using feature transformation
[ INFO ] InferenceEngine API
[ INFO ] Device info:
[ INFO ] CPU: MKLDNNPlugin
[ INFO ] Batch size: 1
[ INFO ] Model loading time: 61.01 ms
Recognition result:
HOW ARE YOU DOING