Converting a Kaldi ASpIRE Chain Time Delay Neural Network (TDNN) Model¶
Warning
Note that OpenVINO support for Kaldi is currently being deprecated and will be removed entirely in the future.
At the beginning, you should download a pre-trained model for the ASpIRE Chain Time Delay Neural Network (TDNN) from the Kaldi project official website.
Converting an ASpIRE Chain TDNN Model to IR¶
Generate the Intermediate Representation of the model by running model conversion with the following parameters:
mo --input_model exp/chain/tdnn_7b/final.mdl --output output
The IR will have two inputs: input for data, and ivector for ivectors.
Example: Running ASpIRE Chain TDNN Model with the Speech Recognition Sample¶
Note
Before you continue with this part of the article, get familiar with the Speech Recognition sample.
In this example, the input data contains one utterance from one speaker.
To run the ASpIRE Chain TDNN Model with Speech Recognition sample, You need to prepare environment. Do it by following the steps below :
Download a Kaldi repository.
Build it by following instructions in
README.mdfrom the repository.Download the model archive from Kaldi website.
Extract the downloaded model archive to the
egs/aspire/s5folder of the Kaldi repository.
Once everything has been prepared, you can start a proper run:
Prepare the model for decoding. Refer to the
README.txtfile from the downloaded model archive for instructions.Convert data and ivectors to
.arkformat. Refer to the corresponding sections below for instructions.
Preparing Data¶
If you have a .wav data file, convert it to the .ark format using the following command:
<path_to_kaldi_repo>/src/featbin/compute-mfcc-feats --config=<path_to_kaldi_repo>/egs/aspire/s5/conf/mfcc_hires.conf scp:./wav.scp ark,scp:feats.ark,feats.scp
Add the feats.ark absolute path to feats.scp to avoid errors in later commands.
Preparing Ivectors¶
Prepare ivectors for the Speech Recognition sample:
Copy the
feats.scpfile to theegs/aspire/s5/directory of the built Kaldi repository and navigate there:cp feats.scp <path_to_kaldi_repo>/egs/aspire/s5/ cd <path_to_kaldi_repo>/egs/aspire/s5/
Extract ivectors from the data:
./steps/online/nnet2/extract_ivectors_online.sh --nj 1 --ivector_period <max_frame_count_in_utterance> <data folder> exp/tdnn_7b_chain_online/ivector_extractor <ivector folder>
You can simplify the preparation of ivectors for the Speech Recognition sample. To do it, specify the maximum number of frames in utterances as a parameter for
--ivector_periodto get only one ivector per utterance.To get the maximum number of frames in utterances, use the following command line:
../../../src/featbin/feat-to-len scp:feats.scp ark,t: | cut -d' ' -f 2 - | sort -rn | head -1
As a result, you will find the
ivector_online.1.arkfile in<ivector folder>.Go to the
<ivector folder>:cd <ivector folder>
Convert the
ivector_online.1.arkfile to text format, using thecopy-featstool. Run the following command:<path_to_kaldi_repo>/src/featbin/copy-feats --binary=False ark:ivector_online.1.ark ark,t:ivector_online.1.ark.txt
For the Speech Recognition sample, the
.arkfile must contain an ivector for each frame. Copy the ivectorframe_counttimes by running the below script in the Python command prompt:import subprocess subprocess.run(["<path_to_kaldi_repo>/src/featbin/feat-to-len", "scp:<path_to_kaldi_repo>/egs/aspire/s5/feats.scp", "ark,t:feats_length.txt"]) f = open("ivector_online.1.ark.txt", "r") g = open("ivector_online_ie.ark.txt", "w") length_file = open("feats_length.txt", "r") for line in f: if "[" not in line: for i in range(frame_count): line = line.replace("]", " ") g.write(line) else: g.write(line) frame_count = int(length_file.read().split(" ")[1]) g.write("]") f.close() g.close() length_file.close()
Create an
.arkfile from.txt:<path_to_kaldi_repo>/src/featbin/copy-feats --binary=True ark,t:ivector_online_ie.ark.txt ark:ivector_online_ie.ark
Running the Speech Recognition Sample¶
Run the Speech Recognition sample with the created ivector .ark file:
speech_sample -i feats.ark,ivector_online_ie.ark -m final.xml -d CPU -o prediction.ark -cw_l 17 -cw_r 12
Results can be decoded as described in “Use of Sample in Kaldi Speech Recognition Pipeline” in the Speech Recognition Sample description article.