This demo provides a command line interface for automatic speech recognition using OpenVINO™. Components used by this executable:
The application transcribes speech from a given WAV file and outputs the text to the console.
The application requires two command line parameters, which point to audio file with speech to be transcribed and configuration file describing the resources to be used for transcription.
-wave
- Filepath to input WAV to be processed. WAV file needs to be in following format: RIFF WAVE PCM 16bit, 16kHz, 1 channel, with header. -c
, --config
- Filepath to configuration file containing paths to resources and other parameters.
Example usage:
The configuration file is simple ASCII text file where:
Parameter | Description | Value used for demo |
---|---|---|
-fe:rt:numCeps | Number of MFCC cepstrums | 13 |
-fe:rt:contextLeft | Numbers of past frames that will be stacked to form input vector for neural network inference | 5 |
-fe:rt:contextRight | Numbers of future frames that will be stacked to form input vector for neural network inference | 5 |
-fe:rt:hpfBeta | High pass filter beta coefficient, where 0.0f means - no filtering | 0.0f |
-fe:rt:inputDataType | Feature extraction input format description | INT16_16KHZ |
-fe:rt:cepstralLifter | Lifting factor | 22.0f |
-fe:rt:noDct | Flag: use DCT as final step or not | 0 |
-fe:rt:featureTransform | Kaldi feature transform file that normalizes stacked features for neural network inference | |
-dec:wfst:acousticModelFName | Full path to the acoustic model file e.g. open_vino_ir.xml | |
-dec:wfst:acousticScaleFactor | The acoustic log likelihood scaling factor | 0.1f |
-dec:wfst:beamWidth | Viterbi search beam width | 14.0f |
-dec:wfst:latticeWidth | Lattice beam width (extends the beam width) | 0.0f |
-dec:wfst:nbest | Number of n-best hypothesis to be generated | 1 |
-dec:wfst:confidenceAcousticScaleFactor | Scaling parameter to factor in acoustic scores in confidence computations | 1.0f |
-dec:wfst:confidenceLMScaleFactor | Scaling parameter to factor in language model in confidence computations | 1.0f |
-dec:wfst:hmmModelFName | Full path to HMM model | |
-dec:wfst:fsmFName | Full path to pronunciation model or full statically composed LM, if static composition is used | |
-dec:wfstotf:gramFsmFName | Full path to grammar model | |
-dec:wfst:outSymsFName | Full path to the output symbols (lexicon) filename | |
-dec:wfst:tokenBufferSize | Token pool size expressed in number of DWORDs | 150000 |
-dec:wfstotf:traceBackLogSize | Number of entries in traceback expressed as log2(N) | 19 |
-dec:wfstotf:minStableFrames | The time expressed in frames, after which, wining hypothesis is recognized as stable and final result can be printed. | 45 |
-dec:wfst:maxCumulativeTokenSize | Maximum fill rate of token buffer before token beam is adjusted to keep token buffer fill constant. Expressed as factor of buffer size (0.0, 1.0> | 0.2f |
-dec:wfst:maxTokenBufferFill | Active token count number triggering beam tightening expressed as factor of buffer size | 0.6f |
-dec:wfst:maxAvgTokenBufferFill | Average active token count number for utterance, which triggers beam tightening when exceed. Expressed as factor of buffer size | 1.0f |
-dec:wfst:tokenBufferMinFill | Minimum fill rate of token buffer | 0.1f |
-dec:wfst:pruningTighteningDelta | Beam tightening value when token pool usage reaches the pool capacity | 1.0f |
-dec:wfst:pruningRelaxationDelta | Beam relaxation value when token pool is not meeting minimum fill ratio criterion | 0.5f |
-dec:wfst:useScoreTrendForEndpointing | Extend end pointing with acoustic feedback | 1 |
-dec:wfstotf:cacheLogSize | Number of entries in LM cache expressed as log2(N) | 16 |
-eng:output:format | Format of the speech recognition output | text |
-inference:contextLeft | IE: Additional stacking option - independent from feature extraction | 0 |
-inference:contextRight | IE: Additional stacking option - independent from feature extraction | 0 |
-inference:device | IE: Device used for neural computations | CPU |
-inference:numThreads | IE: Number of threads used by GNA in SW mode | 1 |
-inference:scaleFactor | IE: Scale factor used for static quantization | 3000.0 |
-inference:quantizationBits | IE: Quantization resolution in bits | 16 or 8 |
The resulting transcription for the sample audio file is: