[DEPRECATED] Automatic Speech Recognition Sample

Note

This sample is now deprecated and will be removed with OpenVINO 2024.0. The sample was mainly designed to demonstrate the features of the GNA plugin and the use of models produced by the Kaldi framework. OpenVINO support for these components is now deprecated and will be discontinued, making the sample redundant.

This sample demonstrates how to do a Synchronous Inference of acoustic model based on Kaldi neural models and speech feature vectors.

The sample works with Kaldi ARK or Numpy uncompressed NPZ files, so it does not cover an end-to-end speech recognition scenario (speech to text), requiring additional preprocessing (feature extraction) to get a feature vector from a speech signal, as well as postprocessing (decoding) to produce text from scores. Before using the sample, refer to the following requirements:

  • The sample accepts any file format supported by core.read_model.

  • The sample has been validated with an acoustic model based on Kaldi neural models (see Model Preparation section)

  • To build the sample, use instructions available at Build the Sample Applications section in “Get Started with Samples” guide.

How It Works

At startup, the sample application reads command-line parameters, loads a specified model and input data to the OpenVINO™ Runtime plugin, performs synchronous inference on all speech utterances stored in the input file, logging each step in a standard output stream.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# Copyright (C) 2018-2023 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

import sys
from io import BytesIO
from timeit import default_timer
from typing import Dict

import numpy as np
import openvino as ov

from arg_parser import parse_args
from file_options import read_utterance_file, write_utterance_file
from utils import (GNA_ATOM_FREQUENCY, GNA_CORE_FREQUENCY,
                   calculate_scale_factor, compare_with_reference,
                   get_input_layouts, get_sorted_scale_factors, log,
                   set_scale_factors)


def do_inference(data: Dict[str, np.ndarray], infer_request: ov.InferRequest, cw_l: int = 0, cw_r: int = 0) -> np.ndarray:
    """Do a synchronous matrix inference."""
    frames_to_infer = {}
    result = {}

    batch_size = infer_request.model_inputs[0].shape[0]
    num_of_frames = next(iter(data.values())).shape[0]

    for output in infer_request.model_outputs:
        result[output.any_name] = np.ndarray((num_of_frames, np.prod(tuple(output.shape)[1:])))

    for i in range(-cw_l, num_of_frames + cw_r, batch_size):
        if i < 0:
            index = 0
        elif i >= num_of_frames:
            index = num_of_frames - 1
        else:
            index = i

        for _input in infer_request.model_inputs:
            frames_to_infer[_input.any_name] = data[_input.any_name][index:index + batch_size]
            num_of_frames_to_infer = len(frames_to_infer[_input.any_name])

            # Add [batch_size - num_of_frames_to_infer] zero rows to 2d numpy array
            # Used to infer fewer frames than the batch size
            frames_to_infer[_input.any_name] = np.pad(
                frames_to_infer[_input.any_name],
                [(0, batch_size - num_of_frames_to_infer), (0, 0)],
            )

            frames_to_infer[_input.any_name] = frames_to_infer[_input.any_name].reshape(_input.tensor.shape)

        frame_results = infer_request.infer(frames_to_infer)

        if i - cw_r < 0:
            continue

        for output in frame_results.keys():
            vector_result = frame_results[output].reshape((batch_size, result[output.any_name].shape[1]))
            result[output.any_name][i - cw_r:i - cw_r + batch_size] = vector_result[:num_of_frames_to_infer]

    return result


def main():
    args = parse_args()

# --------------------------- Step 1. Initialize OpenVINO Runtime Core ------------------------------------------------
    log.info('Creating OpenVINO Runtime Core')
    core = ov.Core()

# --------------------------- Step 2. Read a model --------------------------------------------------------------------
    if args.model:
        log.info(f'Reading the model: {args.model}')
        # (.xml and .bin files) or (.onnx file)
        model = core.read_model(args.model)

# --------------------------- Step 3. Apply preprocessing -------------------------------------------------------------
        model.add_outputs(args.output[0] + args.reference[0])

        if args.layout:
            layouts = get_input_layouts(args.layout, model.inputs)

        ppp = ov.preprocess.PrePostProcessor(model)

        for i in range(len(model.inputs)):
            ppp.input(i).tensor().set_element_type(ov.Type.f32)

            input_name = model.input(i).get_any_name()

            if args.layout and input_name in layouts.keys():
                ppp.input(i).tensor().set_layout(ov.Layout(layouts[input_name]))
                ppp.input(i).model().set_layout(ov.Layout(layouts[input_name]))

        for i in range(len(model.outputs)):
            ppp.output(i).tensor().set_element_type(ov.Type.f32)

        model = ppp.build()

        if args.batch_size:
            batch_size = args.batch_size if args.context_window_left == args.context_window_right == 0 else 1

            if any((not _input.node.layout.empty for _input in model.inputs)):
                ov.set_batch(model, batch_size)
            else:
                log.warning('Layout is not set for any input, so custom batch size is not set')

# ---------------------------Step 4. Configure plugin ---------------------------------------------------------
    devices = args.device.replace('HETERO:', '').split(',')
    plugin_config = {}

    if 'GNA' in args.device:
        gna_device_mode = devices[0] if '_' in devices[0] else 'GNA_AUTO'
        devices[0] = 'GNA'

        plugin_config['GNA_DEVICE_MODE'] = gna_device_mode
        plugin_config['GNA_PRECISION'] = f'I{args.quantization_bits}'
        plugin_config['GNA_EXEC_TARGET'] = args.exec_target
        plugin_config['GNA_PWL_MAX_ERROR_PERCENT'] = str(args.pwl_me)

        # Set a GNA scale factor
        if args.import_gna_model:
            if args.scale_factor[1]:
                log.error(f'Custom scale factor can not be set for imported gna model: {args.import_gna_model}')
                return 1
            else:
                log.info(f'Using scale factor from provided imported gna model: {args.import_gna_model}')
        else:
            if args.scale_factor[1]:
                scale_factors = get_sorted_scale_factors(args.scale_factor, model.inputs)
            else:
                scale_factors = []

                for file_name in args.input[1]:
                    _, utterances = read_utterance_file(file_name)
                    scale_factor = calculate_scale_factor(utterances[0])
                    log.info('Using scale factor(s) calculated from first utterance')
                    scale_factors.append(str(scale_factor))

            set_scale_factors(plugin_config, scale_factors, model.inputs)

        if args.export_embedded_gna_model:
            plugin_config['GNA_FIRMWARE_MODEL_IMAGE'] = args.export_embedded_gna_model
            plugin_config['GNA_FIRMWARE_MODEL_IMAGE_GENERATION'] = args.embedded_gna_configuration

        if args.performance_counter:
            plugin_config['PERF_COUNT'] = 'YES'

    device_str = f'HETERO:{",".join(devices)}' if 'HETERO' in args.device else devices[0]

# --------------------------- Step 5. Loading model to the device -----------------------------------------------------
    log.info('Loading the model to the plugin')
    if args.model:
        compiled_model = core.compile_model(model, device_str, plugin_config)
    else:
        with open(args.import_gna_model, 'rb') as f:
            buf = BytesIO(f.read())
            compiled_model = core.import_model(buf, device_str, plugin_config)

# --------------------------- Exporting GNA model using InferenceEngine AOT API ---------------------------------------
    if args.export_gna_model:
        log.info(f'Writing GNA Model to {args.export_gna_model}')
        user_stream = compiled_model.export_model()
        with open(args.export_gna_model, 'wb') as f:
            f.write(user_stream)
        return 0

    if args.export_embedded_gna_model:
        log.info(f'Exported GNA embedded model to file {args.export_embedded_gna_model}')
        log.info(f'GNA embedded model export done for GNA generation {args.embedded_gna_configuration}')
        return 0

# --------------------------- Step 6. Set up input --------------------------------------------------------------------
    input_layer_names = args.input[0] if args.input[0] else [_input.any_name for _input in compiled_model.inputs]
    input_file_names = args.input[1]

    if len(input_layer_names) != len(input_file_names):
        log.error(f'Number of model inputs ({len(compiled_model.inputs)}) is not equal '
                  f'to number of ark files ({len(input_file_names)})')
        return 3

    input_file_data = [read_utterance_file(file_name) for file_name in input_file_names]

    infer_data = [
        {
            input_layer_names[j]: input_file_data[j].utterances[i]
            for j in range(len(input_file_data))
        }
        for i in range(len(input_file_data[0].utterances))
    ]

    output_layer_names = args.output[0] if args.output[0] else [compiled_model.outputs[0].any_name]
    output_file_names = args.output[1]

    reference_layer_names = args.reference[0] if args.reference[0] else [compiled_model.outputs[0].any_name]
    reference_file_names = args.reference[1]

    reference_file_data = [read_utterance_file(file_name) for file_name in reference_file_names]

    references = [
        {
            reference_layer_names[j]: reference_file_data[j].utterances[i]
            for j in range(len(reference_file_data))
        }
        for i in range(len(input_file_data[0].utterances))
    ]

# --------------------------- Step 7. Create infer request ------------------------------------------------------------
    infer_request = compiled_model.create_infer_request()

# --------------------------- Step 8. Do inference --------------------------------------------------------------------
    log.info('Starting inference in synchronous mode')
    results = []
    total_infer_time = 0

    for i in range(len(infer_data)):
        start_infer_time = default_timer()

        # Reset states between utterance inferences to remove a memory impact
        infer_request.reset_state()

        results.append(do_inference(
            infer_data[i],
            infer_request,
            args.context_window_left,
            args.context_window_right,
        ))

        infer_time = default_timer() - start_infer_time
        total_infer_time += infer_time
        num_of_frames = infer_data[i][input_layer_names[0]].shape[0]
        avg_infer_time_per_frame = infer_time / num_of_frames

# --------------------------- Step 9. Process output ------------------------------------------------------------------
        log.info('')
        log.info(f'Utterance {i}:')
        log.info(f'Total time in Infer (HW and SW): {infer_time * 1000:.2f}ms')
        log.info(f'Frames in utterance: {num_of_frames}')
        log.info(f'Average Infer time per frame: {avg_infer_time_per_frame * 1000:.2f}ms')

        for name in set(reference_layer_names + output_layer_names):
            log.info('')
            log.info(f'Output layer name: {name}')
            log.info(f'Number scores per frame: {results[i][name].shape[1]}')

            if name in references[i].keys():
                log.info('')
                compare_with_reference(results[i][name], references[i][name])

        if args.performance_counter:
            if 'GNA' in args.device:
                total_cycles = infer_request.profiling_info[0].real_time.total_seconds()
                stall_cycles = infer_request.profiling_info[1].real_time.total_seconds()
                active_cycles = total_cycles - stall_cycles
                frequency = 10**6
                if args.arch == 'CORE':
                    frequency *= GNA_CORE_FREQUENCY
                else:
                    frequency *= GNA_ATOM_FREQUENCY
                total_inference_time = total_cycles / frequency
                active_time = active_cycles / frequency
                stall_time = stall_cycles / frequency
                log.info('')
                log.info('Performance Statistics of GNA Hardware')
                log.info(f'   Total Inference Time: {(total_inference_time * 1000):.4f} ms')
                log.info(f'   Active Time: {(active_time * 1000):.4f} ms')
                log.info(f'   Stall Time:  {(stall_time * 1000):.4f} ms')

    log.info('')
    log.info(f'Total sample time: {total_infer_time * 1000:.2f}ms')

    for i in range(len(output_file_names)):
        log.info(f'Saving results from "{output_layer_names[i]}" layer to {output_file_names[i]}')
        data = [results[j][output_layer_names[i]] for j in range(len(input_file_data[0].utterances))]
        write_utterance_file(output_file_names[i], input_file_data[0].keys, data)

# ----------------------------------------------------------------------------------------------------------------------
    log.info('This sample is an API example, '
             'for any performance measurements please use the dedicated benchmark_app tool\n')
    return 0


if __name__ == '__main__':
    sys.exit(main())
// Copyright (C) 2018-2023 Intel Corporation
// SPDX-License-Identifier: Apache-2.0
//
#include <time.h>

#include <chrono>
#include <fstream>
#include <functional>
#include <iomanip>
#include <iostream>
#include <limits>
#include <map>
#include <memory>
#include <random>
#include <string>
#include <thread>
#include <utility>
#include <vector>

// clang-format off
#include <openvino/openvino.hpp>
#include <openvino/runtime/intel_gna/properties.hpp>

#include <samples/args_helper.hpp>
#include <samples/slog.hpp>

#include "fileutils.hpp"
#include "speech_sample.hpp"
#include "utils.hpp"
// clang-format on

using namespace ov::preprocess;

/**
 * @brief The entry point for OpenVINO Runtime automatic speech recognition sample
 * @file speech_sample/main.cpp
 * @example speech_sample/main.cpp
 */
int main(int argc, char* argv[]) {
    try {
        // ------------------------------ Get OpenVINO Runtime version ----------------------------------------------
        slog::info << "OpenVINO runtime: " << ov::get_openvino_version() << slog::endl;

        // ------------------------------ Parsing and validation of input arguments ---------------------------------
        if (!parse_and_check_command_line(argc, argv)) {
            return 0;
        }
        BaseFile* file;
        BaseFile* fileOutput;
        ArkFile arkFile;
        NumpyFile numpyFile;
        std::pair<std::string, std::vector<std::string>> input_data;
        if (!FLAGS_i.empty())
            input_data = parse_parameters(FLAGS_i);
        auto extInputFile = fileExt(input_data.first);
        if (extInputFile == "ark") {
            file = &arkFile;
        } else if (extInputFile == "npz") {
            file = &numpyFile;
        } else {
            throw std::logic_error("Invalid input file");
        }
        std::vector<std::string> inputFiles;
        std::vector<uint32_t> numBytesThisUtterance;
        uint32_t numUtterances(0);
        if (!input_data.first.empty()) {
            std::string outStr;
            std::istringstream stream(input_data.first);
            uint32_t currentNumUtterances(0), currentNumBytesThisUtterance(0);
            while (getline(stream, outStr, ',')) {
                std::string filename(fileNameNoExt(outStr) + "." + extInputFile);
                inputFiles.push_back(filename);
                file->get_file_info(filename.c_str(), 0, &currentNumUtterances, &currentNumBytesThisUtterance);
                if (numUtterances == 0) {
                    numUtterances = currentNumUtterances;
                } else if (currentNumUtterances != numUtterances) {
                    throw std::logic_error(
                        "Incorrect input files. Number of utterance must be the same for all input files");
                }
                numBytesThisUtterance.push_back(currentNumBytesThisUtterance);
            }
        }
        size_t numInputFiles(inputFiles.size());

        // --------------------------- Step 1. Initialize OpenVINO Runtime core and read model
        // -------------------------------------
        ov::Core core;
        try {
            const auto& gnaLibraryVersion = core.get_property("GNA", ov::intel_gna::library_full_version);
            slog::info << "Detected GNA Library: " << gnaLibraryVersion << slog::endl;
        } catch (std::exception& e) {
            slog::info << "Cannot detect GNA Library version, exception: " << e.what() << slog::endl;
        }
        slog::info << "Loading model files:" << slog::endl << FLAGS_m << slog::endl;
        uint32_t batchSize = (FLAGS_cw_r > 0 || FLAGS_cw_l > 0 || !FLAGS_bs) ? 1 : (uint32_t)FLAGS_bs;
        std::shared_ptr<ov::Model> model;
        // --------------------------- Processing custom outputs ---------------------------------------------
        const auto output_data = parse_parameters(FLAGS_o);
        const auto reference_data = parse_parameters(FLAGS_r);

        const auto outputs = get_first_non_empty(output_data.second, reference_data.second);

        // ------------------------------ Preprocessing ------------------------------------------------------
        // the preprocessing steps can be done only for loaded network and are not applicable for the imported network
        // (already compiled)
        if (!FLAGS_m.empty()) {
            const auto outputs_with_ports = parse_to_extract_port(outputs);
            model = core.read_model(FLAGS_m);
            for (const auto& output_with_port : outputs_with_ports) {
                auto output = model->add_output(output_with_port.first, output_with_port.second);
                output.set_names({output_with_port.first + ":" + std::to_string(output_with_port.second)});
            }
            check_number_of_inputs(model->inputs().size(), numInputFiles);
            ov::preprocess::PrePostProcessor proc(model);
            const auto& inputs = model->inputs();
            std::map<std::string, std::string> custom_layouts;
            if (!FLAGS_layout.empty()) {
                custom_layouts = parse_input_layouts(FLAGS_layout, inputs);
            }
            for (const auto& input : inputs) {
                const auto& item_name = input.get_any_name();
                auto& in = proc.input(item_name);
                in.tensor().set_element_type(ov::element::f32);
                // Explicitly set inputs layout
                if (custom_layouts.count(item_name) > 0) {
                    in.model().set_layout(ov::Layout(custom_layouts.at(item_name)));
                }
            }
            for (size_t i = 0; i < model->outputs().size(); i++) {
                proc.output(i).tensor().set_element_type(ov::element::f32);
            }
            model = proc.build();
            if (FLAGS_bs) {
                if (FLAGS_layout.empty() &&
                    std::any_of(inputs.begin(), inputs.end(), [](const ov::Output<ov::Node>& i) {
                        return ov::layout::get_layout(i).empty();
                    })) {
                    throw std::logic_error(
                        "-bs option is set to " + std::to_string(FLAGS_bs) +
                        " but model does not contain layout information for any input. Please "
                        "specify it explicitly using -layout option. For example, input1[NCHW], input2[NC] or [NC]");
                } else {
                    ov::set_batch(model, batchSize);
                }
            }
        }
        // ------------------------------ Get Available Devices ------------------------------------------------------
        auto isFeature = [&](const std::string xFeature) {
            return FLAGS_d.find(xFeature) != std::string::npos;
        };
        bool useGna = isFeature("GNA");
        bool useHetero = isFeature("HETERO");
        std::string deviceStr = useHetero && useGna ? "HETERO:GNA,CPU" : FLAGS_d.substr(0, (FLAGS_d.find("_")));
        // -----------------------------------------------------------------------------------------------------
        // --------------------------- Set parameters and scale factors -------------------------------------
        /** Setting parameter for per layer metrics **/
        ov::AnyMap gnaPluginConfig;
        ov::AnyMap genericPluginConfig;
        if (useGna) {
            std::string gnaDevice =
                useHetero ? FLAGS_d.substr(FLAGS_d.find("GNA"), FLAGS_d.find(",") - FLAGS_d.find("GNA")) : FLAGS_d;
            auto parse_gna_device = [&](const std::string& device) -> ov::intel_gna::ExecutionMode {
                ov::intel_gna::ExecutionMode mode;
                std::stringstream ss(device);
                ss >> mode;
                return mode;
            };
            gnaPluginConfig[ov::intel_gna::execution_mode.name()] = gnaDevice.find("_") == std::string::npos
                                                                        ? ov::intel_gna::ExecutionMode::AUTO
                                                                        : parse_gna_device(gnaDevice);
        }
        if (FLAGS_pc) {
            genericPluginConfig.emplace(ov::enable_profiling(true));
        }
        if (FLAGS_q.compare("user") == 0) {
            if (!FLAGS_rg.empty()) {
                std::string errMessage("Custom scale factor can not be set for imported gna model: " + FLAGS_rg);
                throw std::logic_error(errMessage);
            } else {
                auto scale_factors_per_input = parse_scale_factors(model->inputs(), FLAGS_sf);
                if (numInputFiles != scale_factors_per_input.size()) {
                    std::string errMessage("Incorrect command line for multiple inputs: " +
                                           std::to_string(scale_factors_per_input.size()) +
                                           " scale factors provided for " + std::to_string(numInputFiles) +
                                           " input files.");
                    throw std::logic_error(errMessage);
                }
                for (auto&& sf : scale_factors_per_input) {
                    slog::info << "For input " << sf.first << " using scale factor of " << sf.second << slog::endl;
                }
                gnaPluginConfig[ov::intel_gna::scale_factors_per_input.name()] = scale_factors_per_input;
            }
        } else {
            // "static" quantization with calculated scale factor
            if (!FLAGS_rg.empty()) {
                slog::info << "Using scale factor from provided imported gna model: " << FLAGS_rg << slog::endl;
            } else {
                std::map<std::string, float> scale_factors_per_input;
                for (size_t i = 0; i < numInputFiles; i++) {
                    auto inputFileName = inputFiles[i].c_str();
                    std::string name;
                    std::vector<uint8_t> ptrFeatures;
                    uint32_t numArrays(0), numBytes(0), numFrames(0), numFrameElements(0), numBytesPerElement(0);
                    file->get_file_info(inputFileName, 0, &numArrays, &numBytes);
                    ptrFeatures.resize(numBytes);
                    file->load_file(inputFileName,
                                    0,
                                    name,
                                    ptrFeatures,
                                    &numFrames,
                                    &numFrameElements,
                                    &numBytesPerElement);
                    auto floatScaleFactor = scale_factor_for_quantization(ptrFeatures.data(),
                                                                          MAX_VAL_2B_FEAT,
                                                                          numFrames * numFrameElements);
                    slog::info << "Using scale factor of " << floatScaleFactor << " calculated from first utterance."
                               << slog::endl;
                    scale_factors_per_input[strip_name(model->input(i).get_any_name())] = floatScaleFactor;
                }
                gnaPluginConfig[ov::intel_gna::scale_factors_per_input.name()] = scale_factors_per_input;
            }
        }
        gnaPluginConfig[ov::hint::inference_precision.name()] = (FLAGS_qb == 8) ? ov::element::i8 : ov::element::i16;
        const std::unordered_map<std::string, ov::intel_gna::HWGeneration> StringHWGenerationMap{
            {"GNA_TARGET_1_0", ov::intel_gna::HWGeneration::GNA_1_0},
            {"GNA_TARGET_1_0_E", ov::intel_gna::HWGeneration::GNA_1_0_E},
            {"GNA_TARGET_2_0", ov::intel_gna::HWGeneration::GNA_2_0},
            {"GNA_TARGET_3_0", ov::intel_gna::HWGeneration::GNA_3_0},
            {"GNA_TARGET_3_1", ov::intel_gna::HWGeneration::GNA_3_1},
            {"GNA_TARGET_3_5", ov::intel_gna::HWGeneration::GNA_3_5},
            {"GNA_TARGET_3_5_E", ov::intel_gna::HWGeneration::GNA_3_5_E},
            {"GNA_TARGET_3_6", ov::intel_gna::HWGeneration::GNA_3_6},
            {"GNA_TARGET_4_0", ov::intel_gna::HWGeneration::GNA_4_0}};
        auto parse_target = [&](const std::string& target) -> ov::intel_gna::HWGeneration {
            auto hw_target = ov::intel_gna::HWGeneration::UNDEFINED;
            const auto key_iter = StringHWGenerationMap.find(target);
            if (key_iter != StringHWGenerationMap.end()) {
                hw_target = key_iter->second;
            } else if (!target.empty()) {
                slog::warn << "Unsupported target: " << target << slog::endl;
            }
            return hw_target;
        };

        gnaPluginConfig[ov::intel_gna::execution_target.name()] = parse_target(FLAGS_exec_target);
        gnaPluginConfig[ov::intel_gna::compile_target.name()] = parse_target(FLAGS_compile_target);
        gnaPluginConfig[ov::intel_gna::memory_reuse.name()] = !FLAGS_memory_reuse_off;
        gnaPluginConfig[ov::intel_gna::pwl_max_error_percent.name()] = FLAGS_pwl_me;
        gnaPluginConfig[ov::log::level.name()] = FLAGS_log;
        // -----------------------------------------------------------------------------------------------------
        // --------------------------- Write model to file --------------------------------------------------
        // Embedded GNA model dumping (for Intel(R) Speech Enabling Developer Kit)
        if (!FLAGS_we.empty()) {
            gnaPluginConfig[ov::intel_gna::firmware_model_image_path.name()] = FLAGS_we;
        }
        // -----------------------------------------------------------------------------------------------------
        // --------------------------- Step 2. Loading model to the device ------------------------------------------
        if (useGna) {
            if (useHetero) {
                genericPluginConfig.insert(ov::device::properties("GNA", gnaPluginConfig));
            } else {
                genericPluginConfig.insert(std::begin(gnaPluginConfig), std::end(gnaPluginConfig));
            }
        }
        auto t0 = Time::now();
        ms loadTime = std::chrono::duration_cast<ms>(Time::now() - t0);
        slog::info << "Model loading time " << loadTime.count() << " ms" << slog::endl;
        ov::CompiledModel executableNet;
        if (!FLAGS_m.empty()) {
            slog::info << "Loading model to the device " << FLAGS_d << slog::endl;
            executableNet = core.compile_model(model, deviceStr, genericPluginConfig);
        } else {
            slog::info << "Importing model to the device" << slog::endl;
            std::ifstream streamrq(FLAGS_rg, std::ios_base::binary | std::ios_base::in);
            if (!streamrq.is_open()) {
                throw std::runtime_error("Cannot open model file " + FLAGS_rg);
            }
            executableNet = core.import_model(streamrq, deviceStr, genericPluginConfig);
            // loading batch from exported model
            const auto& imported_inputs = executableNet.inputs();
            if (std::any_of(imported_inputs.begin(), imported_inputs.end(), [](const ov::Output<const ov::Node>& i) {
                    return ov::layout::get_layout(i).empty();
                })) {
                slog::warn << "No batch dimension was found at any input, assuming batch to be 1." << slog::endl;
                batchSize = 1;
            } else {
                for (auto& info : imported_inputs) {
                    auto imported_layout = ov::layout::get_layout(info);
                    if (ov::layout::has_batch(imported_layout)) {
                        batchSize = (uint32_t)info.get_shape()[ov::layout::batch_idx(imported_layout)];
                        break;
                    }
                }
            }
        }
        // --------------------------- Exporting gna model using OpenVINO API---------------------
        if (!FLAGS_wg.empty()) {
            slog::info << "Writing GNA Model to file " << FLAGS_wg << slog::endl;
            t0 = Time::now();
            std::ofstream streamwq(FLAGS_wg, std::ios_base::binary | std::ios::out);
            executableNet.export_model(streamwq);
            ms exportTime = std::chrono::duration_cast<ms>(Time::now() - t0);
            slog::info << "Exporting time " << exportTime.count() << " ms" << slog::endl;
            return 0;
        }
        if (!FLAGS_we.empty()) {
            slog::info << "Exported GNA embedded model to file " << FLAGS_we << slog::endl;
            if (!FLAGS_compile_target.empty()) {
                slog::info << "GNA embedded model target: " << FLAGS_compile_target << slog::endl;
            }
            return 0;
        }
        // ---------------------------------------------------------------------------------------------------------
        // --------------------------- Step 3. Create infer request
        // --------------------------------------------------
        std::vector<InferRequestStruct> inferRequests(1);

        for (auto& inferRequest : inferRequests) {
            inferRequest = {executableNet.create_infer_request(), -1, batchSize};
        }
        // --------------------------- Step 4. Configure input & output
        // --------------------------------------------------
        std::vector<ov::Tensor> ptrInputBlobs;
        auto cInputInfo = executableNet.inputs();
        check_number_of_inputs(cInputInfo.size(), numInputFiles);
        if (!input_data.second.empty()) {
            std::vector<std::string> inputNameBlobs = input_data.second;
            if (inputNameBlobs.size() != cInputInfo.size()) {
                std::string errMessage(std::string("Number of network inputs ( ") + std::to_string(cInputInfo.size()) +
                                       " ) is not equal to the number of inputs entered in the -i argument ( " +
                                       std::to_string(inputNameBlobs.size()) + " ).");
                throw std::logic_error(errMessage);
            }
            for (const auto& input : inputNameBlobs) {
                ov::Tensor blob = inferRequests.begin()->inferRequest.get_tensor(input);
                if (!blob) {
                    std::string errMessage("No blob with name : " + input);
                    throw std::logic_error(errMessage);
                }
                ptrInputBlobs.push_back(blob);
            }
        } else {
            for (const auto& input : cInputInfo) {
                ptrInputBlobs.push_back(inferRequests.begin()->inferRequest.get_tensor(input));
            }
        }
        std::vector<std::string> output_name_files;
        std::vector<std::string> reference_name_files;
        size_t count_file = 1;
        if (!output_data.first.empty()) {
            output_name_files = convert_str_to_vector(output_data.first);
            if (output_name_files.size() != outputs.size() && outputs.size()) {
                throw std::logic_error("The number of output files is not equal to the number of network outputs.");
            }
            count_file = output_name_files.size();
            if (executableNet.outputs().size() > 1 && output_data.second.empty() && count_file == 1) {
                throw std::logic_error("-o is ambiguous: the model has multiple outputs but only one file provided "
                                       "without output name specification");
            }
        }
        if (!reference_data.first.empty()) {
            reference_name_files = convert_str_to_vector(reference_data.first);
            if (reference_name_files.size() != outputs.size() && outputs.size()) {
                throw std::logic_error("The number of reference files is not equal to the number of network outputs.");
            }
            count_file = reference_name_files.size();
            if (executableNet.outputs().size() > 1 && reference_data.second.empty() && count_file == 1) {
                throw std::logic_error("-r is ambiguous: the model has multiple outputs but only one file provided "
                                       "without output name specification");
            }
        }
        if (count_file > executableNet.outputs().size()) {
            throw std::logic_error(
                "The number of output/reference files is not equal to the number of network outputs.");
        }
        // -----------------------------------------------------------------------------------------------------
        // --------------------------- Step 5. Do inference --------------------------------------------------------
        std::vector<std::vector<uint8_t>> ptrUtterances;
        const auto effective_outputs_size = outputs.size() ? outputs.size() : executableNet.outputs().size();
        std::vector<std::vector<uint8_t>> vectorPtrScores(effective_outputs_size);
        std::vector<uint16_t> numScoresPerOutput(effective_outputs_size);

        std::vector<std::vector<uint8_t>> vectorPtrReferenceScores(reference_name_files.size());
        std::vector<ScoreErrorT> vectorFrameError(reference_name_files.size()),
            vectorTotalError(reference_name_files.size());
        ptrUtterances.resize(inputFiles.size());
        // initialize memory state before starting
        for (auto&& state : inferRequests.begin()->inferRequest.query_state()) {
            state.reset();
        }
        /** Work with each utterance **/
        for (uint32_t utteranceIndex = 0; utteranceIndex < numUtterances; ++utteranceIndex) {
            std::map<std::string, ov::ProfilingInfo> utterancePerfMap;
            uint64_t totalNumberOfRunsOnHw = 0;
            std::string uttName;
            uint32_t numFrames(0), n(0);
            std::vector<uint32_t> numFrameElementsInput;
            std::vector<uint32_t> numFramesReference(reference_name_files.size()),
                numFrameElementsReference(reference_name_files.size()),
                numBytesPerElementReference(reference_name_files.size()),
                numBytesReferenceScoreThisUtterance(reference_name_files.size());

            /** Get information from input file for current utterance **/
            numFrameElementsInput.resize(numInputFiles);
            for (size_t i = 0; i < inputFiles.size(); i++) {
                std::vector<uint8_t> ptrUtterance;
                auto inputFilename = inputFiles[i].c_str();
                uint32_t currentNumFrames(0), currentNumFrameElementsInput(0), currentNumBytesPerElementInput(0);
                file->get_file_info(inputFilename, utteranceIndex, &n, &numBytesThisUtterance[i]);
                ptrUtterance.resize(numBytesThisUtterance[i]);
                file->load_file(inputFilename,
                                utteranceIndex,
                                uttName,
                                ptrUtterance,
                                &currentNumFrames,
                                &currentNumFrameElementsInput,
                                &currentNumBytesPerElementInput);
                if (numFrames == 0) {
                    numFrames = currentNumFrames;
                } else if (numFrames != currentNumFrames) {
                    std::string errMessage("Number of frames in input files is different: " +
                                           std::to_string(numFrames) + " and " + std::to_string(currentNumFrames));
                    throw std::logic_error(errMessage);
                }
                ptrUtterances[i] = ptrUtterance;
                numFrameElementsInput[i] = currentNumFrameElementsInput;
            }
            int i = 0;
            for (auto& ptrInputBlob : ptrInputBlobs) {
                if (ptrInputBlob.get_size() != numFrameElementsInput[i++] * batchSize) {
                    throw std::logic_error("network input size(" + std::to_string(ptrInputBlob.get_size()) +
                                           ") mismatch to input file size (" +
                                           std::to_string(numFrameElementsInput[i - 1] * batchSize) + ")");
                }
            }

            double totalTime = 0.0;

            for (size_t errorIndex = 0; errorIndex < vectorFrameError.size(); errorIndex++) {
                clear_score_error(&vectorTotalError[errorIndex]);
                vectorTotalError[errorIndex].threshold = vectorFrameError[errorIndex].threshold = MAX_SCORE_DIFFERENCE;
            }

            std::vector<uint8_t*> inputFrame;
            for (auto& ut : ptrUtterances) {
                inputFrame.push_back(&ut.front());
            }
            std::map<std::string, ov::ProfilingInfo> callPerfMap;
            size_t frameIndex = 0;
            uint32_t numFramesFile = numFrames;
            numFrames += FLAGS_cw_l + FLAGS_cw_r;
            uint32_t numFramesThisBatch{batchSize};
            auto t0 = Time::now();
            auto t1 = t0;

            BaseFile* fileReferenceScores;
            std::string refUtteranceName;

            if (!reference_data.first.empty()) {
                /** Read file with reference scores **/
                auto exReferenceScoresFile = fileExt(reference_data.first);
                if (exReferenceScoresFile == "ark") {
                    fileReferenceScores = &arkFile;
                } else if (exReferenceScoresFile == "npz") {
                    fileReferenceScores = &numpyFile;
                } else {
                    throw std::logic_error("Invalid Reference Scores file");
                }
                for (size_t next_output = 0; next_output < count_file; next_output++) {
                    if (fileReferenceScores != nullptr) {
                        fileReferenceScores->get_file_info(reference_name_files[next_output].c_str(),
                                                           utteranceIndex,
                                                           &n,
                                                           &numBytesReferenceScoreThisUtterance[next_output]);
                        vectorPtrReferenceScores[next_output].resize(numBytesReferenceScoreThisUtterance[next_output]);
                        fileReferenceScores->load_file(reference_name_files[next_output].c_str(),
                                                       utteranceIndex,
                                                       refUtteranceName,
                                                       vectorPtrReferenceScores[next_output],
                                                       &numFramesReference[next_output],
                                                       &numFrameElementsReference[next_output],
                                                       &numBytesPerElementReference[next_output]);
                    }
                }
            }

            while (frameIndex <= numFrames) {
                if (frameIndex == numFrames) {
                    if (std::find_if(inferRequests.begin(), inferRequests.end(), [&](InferRequestStruct x) {
                            return (x.frameIndex != -1);
                        }) == inferRequests.end()) {
                        break;
                    }
                }
                bool inferRequestFetched = false;
                /** Start inference loop **/
                for (auto& inferRequest : inferRequests) {
                    if (frameIndex == numFrames) {
                        numFramesThisBatch = 1;
                    } else {
                        numFramesThisBatch =
                            (numFrames - frameIndex < batchSize) ? (numFrames - frameIndex) : batchSize;
                    }

                    /* waits until inference result becomes available */
                    if (inferRequest.frameIndex != -1) {
                        inferRequest.inferRequest.wait();
                        if (inferRequest.frameIndex >= 0)
                            for (size_t next_output = 0; next_output < count_file; next_output++) {
                                const auto output_name = outputs.size() > next_output
                                                             ? outputs[next_output]
                                                             : executableNet.output(next_output).get_any_name();
                                auto dims = executableNet.output(output_name).get_shape();
                                numScoresPerOutput[next_output] = std::accumulate(std::begin(dims),
                                                                                  std::end(dims),
                                                                                  size_t{1},
                                                                                  std::multiplies<size_t>());

                                vectorPtrScores[next_output].resize(numFramesFile * numScoresPerOutput[next_output] *
                                                                    sizeof(float));

                                if (!FLAGS_o.empty()) {
                                    /* Prepare output data for save to file in future */
                                    auto outputFrame = &vectorPtrScores[next_output].front() +
                                                       numScoresPerOutput[next_output] * sizeof(float) *
                                                           (inferRequest.frameIndex) / batchSize;

                                    ov::Tensor outputBlob =
                                        inferRequest.inferRequest.get_tensor(executableNet.output(output_name));
                                    // locked memory holder should be alive all time while access to its buffer happens
                                    auto byteSize = numScoresPerOutput[next_output] * sizeof(float);
                                    std::memcpy(outputFrame, outputBlob.data<float>(), byteSize);
                                }
                                if (!FLAGS_r.empty()) {
                                    /** Compare output data with reference scores **/
                                    ov::Tensor outputBlob =
                                        inferRequest.inferRequest.get_tensor(executableNet.output(output_name));

                                    if (numScoresPerOutput[next_output] / numFrameElementsReference[next_output] ==
                                        batchSize) {
                                        compare_scores(
                                            outputBlob.data<float>(),
                                            &vectorPtrReferenceScores[next_output]
                                                                     [inferRequest.frameIndex *
                                                                      numFrameElementsReference[next_output] *
                                                                      numBytesPerElementReference[next_output]],
                                            &vectorFrameError[next_output],
                                            inferRequest.numFramesThisBatch,
                                            numFrameElementsReference[next_output]);
                                        update_score_error(&vectorFrameError[next_output],
                                                           &vectorTotalError[next_output]);
                                    } else {
                                        throw std::logic_error("Number of output and reference frames does not match.");
                                    }
                                }
                                if (FLAGS_pc) {
                                    // retrieve new counters
                                    get_performance_counters(inferRequest.inferRequest, callPerfMap);
                                    // summarize retrieved counters with all previous
                                    sum_performance_counters(callPerfMap, utterancePerfMap, totalNumberOfRunsOnHw);
                                }
                            }
                        // -----------------------------------------------------------------------------------------------------
                    }
                    if (frameIndex == numFrames) {
                        inferRequest.frameIndex = -1;
                        continue;
                    }
                    ptrInputBlobs.clear();
                    if (input_data.second.empty()) {
                        for (auto& input : cInputInfo) {
                            ptrInputBlobs.push_back(inferRequest.inferRequest.get_tensor(input));
                        }
                    } else {
                        std::vector<std::string> inputNameBlobs = input_data.second;
                        for (const auto& input : inputNameBlobs) {
                            ov::Tensor blob = inferRequests.begin()->inferRequest.get_tensor(input);
                            if (!blob) {
                                std::string errMessage("No blob with name : " + input);
                                throw std::logic_error(errMessage);
                            }
                            ptrInputBlobs.push_back(blob);
                        }
                    }

                    /** Iterate over all the input blobs **/
                    for (size_t i = 0; i < numInputFiles; ++i) {
                        ov::Tensor minput = ptrInputBlobs[i];
                        if (!minput) {
                            std::string errMessage("We expect ptrInputBlobs[" + std::to_string(i) +
                                                   "] to be inherited from Tensor, " +
                                                   "but in fact we were not able to cast input to Tensor");
                            throw std::logic_error(errMessage);
                        }
                        memcpy(minput.data(),
                               inputFrame[i],
                               numFramesThisBatch * numFrameElementsInput[i] * sizeof(float));
                        // Used to infer fewer frames than the batch size
                        if (batchSize != numFramesThisBatch) {
                            memset(minput.data<float>() + numFramesThisBatch * numFrameElementsInput[i],
                                   0,
                                   (batchSize - numFramesThisBatch) * numFrameElementsInput[i]);
                        }
                    }
                    // -----------------------------------------------------------------------------------------------------
                    int index = static_cast<int>(frameIndex) - (FLAGS_cw_l + FLAGS_cw_r);
                    /* Starting inference in asynchronous mode*/
                    inferRequest.inferRequest.start_async();
                    inferRequest.frameIndex = index < 0 ? -2 : index;
                    inferRequest.numFramesThisBatch = numFramesThisBatch;
                    frameIndex += numFramesThisBatch;
                    for (size_t j = 0; j < inputFiles.size(); j++) {
                        if (FLAGS_cw_l > 0 || FLAGS_cw_r > 0) {
                            int idx = frameIndex - FLAGS_cw_l;
                            if (idx > 0 && idx < static_cast<int>(numFramesFile)) {
                                inputFrame[j] += sizeof(float) * numFrameElementsInput[j] * numFramesThisBatch;
                            } else if (idx >= static_cast<int>(numFramesFile)) {
                                inputFrame[j] = &ptrUtterances[j].front() + (numFramesFile - 1) * sizeof(float) *
                                                                                numFrameElementsInput[j] *
                                                                                numFramesThisBatch;
                            } else if (idx <= 0) {
                                inputFrame[j] = &ptrUtterances[j].front();
                            }
                        } else {
                            inputFrame[j] += sizeof(float) * numFrameElementsInput[j] * numFramesThisBatch;
                        }
                    }
                    inferRequestFetched |= true;
                }
                /** Inference was finished for current frame **/
                if (!inferRequestFetched) {
                    std::this_thread::sleep_for(std::chrono::milliseconds(1));
                    continue;
                }
            }
            t1 = Time::now();
            fsec fs = t1 - t0;
            ms d = std::chrono::duration_cast<ms>(fs);
            totalTime += d.count();
            // resetting state between utterances
            for (auto&& state : inferRequests.begin()->inferRequest.query_state()) {
                state.reset();
            }
            // -----------------------------------------------------------------------------------------------------

            // --------------------------- Step 6. Process output
            // -------------------------------------------------------

            /** Show performance results **/
            std::cout << "Utterance " << utteranceIndex << ": " << std::endl;
            std::cout << "Total time in Infer (HW and SW):\t" << totalTime << " ms" << std::endl;
            std::cout << "Frames in utterance:\t\t\t" << numFrames << " frames" << std::endl;
            std::cout << "Average Infer time per frame:\t\t" << totalTime / static_cast<double>(numFrames) << " ms\n"
                      << std::endl;

            if (FLAGS_pc) {
                // print performance results
                print_performance_counters(utterancePerfMap,
                                           frameIndex,
                                           std::cout,
                                           getFullDeviceName(core, FLAGS_d),
                                           totalNumberOfRunsOnHw,
                                           FLAGS_d);
            }

            for (size_t next_output = 0; next_output < count_file; next_output++) {
                if (!FLAGS_o.empty()) {
                    auto exOutputScoresFile = fileExt(output_data.first);
                    if (exOutputScoresFile == "ark") {
                        fileOutput = &arkFile;
                    } else if (exOutputScoresFile == "npz") {
                        fileOutput = &numpyFile;
                    } else {
                        throw std::logic_error("Invalid Reference Scores file");
                    }
                    /* Save output data to file */
                    bool shouldAppend = (utteranceIndex == 0) ? false : true;
                    fileOutput->save_file(output_name_files[next_output].c_str(),
                                          shouldAppend,
                                          uttName,
                                          &vectorPtrScores[next_output].front(),
                                          numFramesFile,
                                          numScoresPerOutput[next_output] / batchSize);
                }
                if (!FLAGS_r.empty()) {
                    // print statistical score error
                    const auto output_name = outputs.size() > next_output
                                                 ? outputs[next_output]
                                                 : executableNet.output(next_output).get_any_name();
                    std::cout << "Output name: " << output_name << std::endl;
                    std::cout << "Number scores per frame: " << numScoresPerOutput[next_output] / batchSize << std::endl
                              << std::endl;
                    print_reference_compare_results(vectorTotalError[next_output], numFrames, std::cout);
                }
            }
        }
    } catch (const std::exception& error) {
        slog::err << error.what() << slog::endl;
        return 1;
    } catch (...) {
        slog::err << "Unknown/internal exception happened" << slog::endl;
        return 1;
    }
    slog::info << "Execution successful" << slog::endl;
    return 0;
}

You can see the explicit description ofeach sample step at Integration Steps section of “Integrate OpenVINO™ Runtime with Your Application” guide.

GNA-specific details

Quantization

If the GNA device is selected (for example, using the -d GNA flag), the GNA OpenVINO™ Runtime plugin quantizes the model and input feature vector sequence to integer representation before performing inference.

Several neural model quantization modes:

  • static - The first utterance in the input file is scanned for dynamic range. The scale factor (floating point scalar multiplier) required to scale the maximum input value of the first utterance to 16384 (15 bits) is used for all subsequent inputs. The model is quantized to accommodate the scaled input dynamic range.

  • user-defined - The user may specify a scale factor via the -sf flag that will be used for static quantization.

The -qb flag provides a hint to the GNA plugin regarding the preferred target weight resolution for all layers. For example, when -qb 8 is specified, the plugin will use 8-bit weights wherever possible in the model.

Note

It is not always possible to use 8-bit weights due to GNA hardware limitations. For example, convolutional layers always use 16-bit weights (GNA hardware version 1 and 2). This limitation will be removed in GNA hardware version 3 and higher.

Execution Modes

Several execution modes are supported via the -d flag:

  • CPU - All calculations are performed on CPU device using CPU Plugin.

  • GPU - All calculations are performed on GPU device using GPU Plugin.

  • NPU - All calculations are performed on NPU device using NPU Plugin.

  • GNA_AUTO - GNA hardware is used if available and the driver is installed. Otherwise, the GNA device is emulated in fast-but-not-bit-exact mode.

  • GNA_HW - GNA hardware is used if available and the driver is installed. Otherwise, an error will occur.

  • GNA_SW - Deprecated. The GNA device is emulated in fast-but-not-bit-exact mode.

  • GNA_SW_FP32 - Substitutes parameters and calculations from low precision to floating point (FP32).

  • GNA_SW_EXACT - GNA device is emulated in bit-exact mode.

Loading and Saving Models

The GNA plugin supports loading and saving of the GNA-optimized model (non-IR) via the -rg and -wg flags. Thereby, it is possible to avoid the cost of full model quantization at run time. The GNA plugin also supports export of firmware-compatible embedded model images for the Intel® Speech Enabling Developer Kit and Amazon Alexa Premium Far-Field Voice Development Kit via the -we flag (save only).

In addition to performing inference directly from a GNA model file, these options make it possible to:

  • Convert from IR format to GNA format model file (-m, -wg)

  • Convert from IR format to embedded format model file (-m, -we)

  • Convert from GNA format to embedded format model file (-rg, -we)

Running

Run the application with the -h option to see the usage message:

python speech_sample.py -h

Usage message:

usage: speech_sample.py [-h] (-m MODEL | -rg IMPORT_GNA_MODEL) -i INPUT [-o OUTPUT] [-r REFERENCE] [-d DEVICE] [-bs [1-8]]
                        [-layout LAYOUT] [-qb [8, 16]] [-sf SCALE_FACTOR] [-wg EXPORT_GNA_MODEL]
                        [-we EXPORT_EMBEDDED_GNA_MODEL] [-we_gen [GNA1, GNA3]]
                        [--exec_target [GNA_TARGET_2_0, GNA_TARGET_3_0]] [-pc] [-a [CORE, ATOM]] [-iname INPUT_LAYERS]
                        [-oname OUTPUT_LAYERS] [-cw_l CONTEXT_WINDOW_LEFT] [-cw_r CONTEXT_WINDOW_RIGHT] [-pwl_me PWL_ME]

optional arguments:
  -m MODEL, --model MODEL
                        Path to an .xml file with a trained model (required if -rg is missing).
  -rg IMPORT_GNA_MODEL, --import_gna_model IMPORT_GNA_MODEL
                        Read GNA model from file using path/filename provided (required if -m is missing).

Options:
  -h, --help            Show this help message and exit.
  -i INPUT, --input INPUT
                        Required. Path(s) to input file(s).
                        Usage for a single file/layer: <input_file.ark> or <input_file.npz>.
                        Example of usage for several files/layers: <layer1>:<port_num1>=<input_file1.ark>,<layer2>:<port_num2>=<input_file2.ark>.
  -o OUTPUT, --output OUTPUT
                        Optional. Output file name(s) to save scores (inference results).
                        Usage for a single file/layer: <output_file.ark> or <output_file.npz>.
                        Example of usage for several files/layers: <layer1>:<port_num1>=<output_file1.ark>,<layer2>:<port_num2>=<output_file2.ark>.
  -r REFERENCE, --reference REFERENCE
                        Read reference score file(s) and compare inference results with reference scores.
                        Usage for a single file/layer: <reference_file.ark> or <reference_file.npz>.
                        Example of usage for several files/layers: <layer1>:<port_num1>=<reference_file1.ark>,<layer2>:<port_num2>=<reference_file2.ark>.
  -d DEVICE, --device DEVICE
                        Optional. Specify a target device to infer on. CPU, GPU, NPU, GNA_AUTO, GNA_HW, GNA_SW_FP32,
                        GNA_SW_EXACT and HETERO with combination of GNA as the primary device and CPU as a secondary (e.g.
                        HETERO:GNA,CPU) are supported. The sample will look for a suitable plugin for device specified.
                        Default value is CPU.
  -bs [1-8], --batch_size [1-8]
                        Optional. Batch size 1-8.
  -layout LAYOUT        Optional. Custom layout in format: "input0[value0],input1[value1]" or "[value]" (applied to all
                        inputs)
  -qb [8, 16], --quantization_bits [8, 16]
                        Optional. Weight resolution in bits for GNA quantization: 8 or 16 (default 16).
  -sf SCALE_FACTOR, --scale_factor SCALE_FACTOR
                        Optional. User-specified input scale factor for GNA quantization.
                        If the model contains multiple inputs, provide scale factors by separating them with commas.
                        For example: <layer1>:<sf1>,<layer2>:<sf2> or just <sf> to be applied to all inputs.
  -wg EXPORT_GNA_MODEL, --export_gna_model EXPORT_GNA_MODEL
                        Optional. Write GNA model to file using path/filename provided.
  -we EXPORT_EMBEDDED_GNA_MODEL, --export_embedded_gna_model EXPORT_EMBEDDED_GNA_MODEL
                        Optional. Write GNA embedded model to file using path/filename provided.
  -we_gen [GNA1, GNA3], --embedded_gna_configuration [GNA1, GNA3]
                        Optional. GNA generation configuration string for embedded export. Can be GNA1 (default) or GNA3.
  --exec_target [GNA_TARGET_2_0, GNA_TARGET_3_0]
                        Optional. Specify GNA execution target generation. By default, generation corresponds to the GNA HW
                        available in the system or the latest fully supported generation by the software. See the GNA
                        Plugin's GNA_EXEC_TARGET config option description.
  -pc, --performance_counter
                        Optional. Enables performance report (specify -a to ensure arch accurate results).
  -a [CORE, ATOM], --arch [CORE, ATOM]
                        Optional. Specify architecture. CORE, ATOM with the combination of -pc.
  -cw_l CONTEXT_WINDOW_LEFT, --context_window_left CONTEXT_WINDOW_LEFT
                        Optional. Number of frames for left context windows (default is 0). Works only with context window
                        models. If you use the cw_l or cw_r flag, then batch size argument is ignored.
  -cw_r CONTEXT_WINDOW_RIGHT, --context_window_right CONTEXT_WINDOW_RIGHT
                        Optional. Number of frames for right context windows (default is 0). Works only with context window
                        models. If you use the cw_l or cw_r flag, then batch size argument is ignored.
  -pwl_me PWL_ME        Optional. The maximum percent of error for PWL function. The value must be in <0, 100> range. The
                        default value is 1.0.
speech_sample -h

Usage message:

[ INFO ] OpenVINO Runtime version ......... <version>
[ INFO ] Build ........... <build>
[ INFO ]
[ INFO ] Parsing input parameters

speech_sample [OPTION]
Options:

    -h                         Print a usage message.
    -i "<path>"                Required. Path(s) to input file(s). Usage for a single file/layer: <input_file.ark> or <input_file.npz>. Example of usage for several files/layers: <layer1>:<port_num1>=<input_file1.ark>,<layer2>:<port_num2>=<input_file2.ark>.
    -m "<path>"                Required. Path to an .xml file with a trained model (required if -rg is missing).
    -o "<path>"                Optional. Output file name(s) to save scores (inference results). Example of usage for a single file/layer: <output_file.ark> or <output_file.npz>. Example of usage for several files/layers: <layer1>:<port_num1>=<output_file1.ark>,<layer2>:<port_num2>=<output_file2.ark>.
    -d "<device>"              Optional. Specify a target device to infer on. CPU, GPU, NPU, GNA_AUTO, GNA_HW, GNA_HW_WITH_SW_FBACK, GNA_SW_FP32, GNA_SW_EXACT and HETERO with combination of GNA as the primary device and CPU as a secondary (e.g. HETERO:GNA,CPU) are supported. The sample will look for a suitable plugin for device specified.
    -pc                        Optional. Enables per-layer performance report.
    -q "<mode>"                Optional. Input quantization mode for GNA: static (default) or user defined (use with -sf).
    -qb "<integer>"            Optional. Weight resolution in bits for GNA quantization: 8 or 16 (default)
    -sf "<double>"             Optional. User-specified input scale factor for GNA quantization (use with -q user). If the model contains multiple inputs, provide scale factors by separating them with commas. For example: <layer1>:<sf1>,<layer2>:<sf2> or just <sf> to be applied to all inputs.
    -bs "<integer>"            Optional. Batch size 1-8 (default 1)
    -r "<path>"                Optional. Read reference score file(s) and compare inference results with reference scores. Usage for a single file/layer: <reference.ark> or <reference.npz>. Example of usage for several files/layers: <layer1>:<port_num1>=<reference_file1.ark>,<layer2>:<port_num2>=<reference_file2.ark>.
    -rg "<path>"               Read GNA model from file using path/filename provided (required if -m is missing).
    -wg "<path>"               Optional. Write GNA model to file using path/filename provided.
    -we "<path>"               Optional. Write GNA embedded model to file using path/filename provided.
    -cw_l "<integer>"          Optional. Number of frames for left context windows (default is 0). Works only with context window networks. If you use the cw_l or cw_r flag, then batch size argument is ignored.
    -cw_r "<integer>"          Optional. Number of frames for right context windows (default is 0). Works only with context window networks. If you use the cw_r or cw_l flag, then batch size argument is ignored.
    -layout "<string>"         Optional. Prompts how network layouts should be treated by application. For example, "input1[NCHW],input2[NC]" or "[NCHW]" in case of one input size.
    -pwl_me "<double>"         Optional. The maximum percent of error for PWL function.The value must be in <0, 100> range. The default value is 1.0.
    -exec_target "<string>"    Optional. Specify GNA execution target generation. May be one of GNA_TARGET_2_0, GNA_TARGET_3_0. By default, generation corresponds to the GNA HW available in the system or the latest fully supported generation by the software. See the GNA Plugin's GNA_EXEC_TARGET config option description.
    -compile_target "<string>" Optional. Specify GNA compile target generation. May be one of GNA_TARGET_2_0, GNA_TARGET_3_0. By default, generation corresponds to the GNA HW available in the system or the latest fully supported generation by the software. See the GNA Plugin's GNA_COMPILE_TARGET config option description.
    -memory_reuse_off          Optional. Disables memory optimizations for compiled model.

Available target devices:  CPU  GNA  GPU  NPU

Model Preparation

You can use the following model conversion command to convert a Kaldi nnet1 or nnet2 model to OpenVINO Intermediate Representation (IR) format:

mo --framework kaldi --input_model wsj_dnn5b.nnet --counts wsj_dnn5b.counts --remove_output_softmax --output_dir <OUTPUT_MODEL_DIR>
mo --framework kaldi --input_model wsj_dnn5b.nnet --counts wsj_dnn5b.counts --remove_output_softmax --output_dir <OUTPUT_MODEL_DIR>

The following pre-trained models are available:

  • rm_cnn4a_smbr

  • rm_lstm4f

  • wsj_dnn5b_smbr

All of them can be downloaded from the storage .

Speech Inference

Once the IR has been created, you can do inference on Intel® Processors with the GNA co-processor (or emulation library):

python speech_sample.py -m wsj_dnn5b.xml -i dev93_10.ark -r dev93_scores_10.ark -d GNA_AUTO -o result.npz
speech_sample -m wsj_dnn5b.xml -i dev93_10.ark -r dev93_scores_10.ark -d GNA_AUTO -o result.ark

Here, the floating point Kaldi-generated reference neural network scores (dev93_scores_10.ark) corresponding to the input feature file (dev93_10.ark) are assumed to be available for comparison.

Note

  • Before running the sample with a trained model, make sure the model is converted to the intermediate representation (IR) format (*.xml + *.bin) using model conversion API.

  • The sample supports input and output in numpy file format (.npz)

  • When you specify single options multiple times, only the last value will be used. For example, the -m flag:

    python classification_sample_async.py -m model.xml -m model2.xml
    
    ./speech_sample -m model.xml -m model2.xml
    

Sample Output

The sample application logs each step in a standard output stream.

[ INFO ] Creating OpenVINO Runtime Core
[ INFO ] Reading the model: /models/wsj_dnn5b_smbr_fp32.xml
[ INFO ] Using scale factor(s) calculated from first utterance
[ INFO ] For input 0 using scale factor of 2175.4322418
[ INFO ] Loading the model to the plugin
[ INFO ] Starting inference in synchronous mode
[ INFO ]
[ INFO ] Utterance 0:
[ INFO ] Total time in Infer (HW and SW): 6326.06ms
[ INFO ] Frames in utterance: 1294
[ INFO ] Average Infer time per frame: 4.89ms
[ INFO ]
[ INFO ] Output blob name: affinetransform14
[ INFO ] Number scores per frame: 3425
[ INFO ]
[ INFO ] max error: 0.7051840
[ INFO ] avg error: 0.0448388
[ INFO ] avg rms error: 0.0582387
[ INFO ] stdev error: 0.0371650
[ INFO ]
[ INFO ] Utterance 1:
[ INFO ] Total time in Infer (HW and SW): 4526.57ms
[ INFO ] Frames in utterance: 1005
[ INFO ] Average Infer time per frame: 4.50ms
[ INFO ]
[ INFO ] Output blob name: affinetransform14
[ INFO ] Number scores per frame: 3425
[ INFO ]
[ INFO ] max error: 0.7575974
[ INFO ] avg error: 0.0452166
[ INFO ] avg rms error: 0.0586013
[ INFO ] stdev error: 0.0372769
[ INFO ]
[ INFO ] Utterance 2:
[ INFO ] Total time in Infer (HW and SW): 6636.56ms
[ INFO ] Frames in utterance: 1471
[ INFO ] Average Infer time per frame: 4.51ms
[ INFO ]
[ INFO ] Output blob name: affinetransform14
[ INFO ] Number scores per frame: 3425
[ INFO ]
[ INFO ] max error: 0.7191710
[ INFO ] avg error: 0.0472226
[ INFO ] avg rms error: 0.0612991
[ INFO ] stdev error: 0.0390846
[ INFO ]
[ INFO ] Utterance 3:
[ INFO ] Total time in Infer (HW and SW): 3927.01ms
[ INFO ] Frames in utterance: 845
[ INFO ] Average Infer time per frame: 4.65ms
[ INFO ]
[ INFO ] Output blob name: affinetransform14
[ INFO ] Number scores per frame: 3425
[ INFO ]
[ INFO ] max error: 0.7436461
[ INFO ] avg error: 0.0477581
[ INFO ] avg rms error: 0.0621334
[ INFO ] stdev error: 0.0397457
[ INFO ]
[ INFO ] Utterance 4:
[ INFO ] Total time in Infer (HW and SW): 3891.49ms
[ INFO ] Frames in utterance: 855
[ INFO ] Average Infer time per frame: 4.55ms
[ INFO ]
[ INFO ] Output blob name: affinetransform14
[ INFO ] Number scores per frame: 3425
[ INFO ]
[ INFO ] max error: 0.7071600
[ INFO ] avg error: 0.0449147
[ INFO ] avg rms error: 0.0585048
[ INFO ] stdev error: 0.0374897
[ INFO ]
[ INFO ] Utterance 5:
[ INFO ] Total time in Infer (HW and SW): 3378.61ms
[ INFO ] Frames in utterance: 699
[ INFO ] Average Infer time per frame: 4.83ms
[ INFO ]
[ INFO ] Output blob name: affinetransform14
[ INFO ] Number scores per frame: 3425
[ INFO ]
[ INFO ] max error: 0.8870468
[ INFO ] avg error: 0.0479243
[ INFO ] avg rms error: 0.0625490
[ INFO ] stdev error: 0.0401951
[ INFO ]
[ INFO ] Utterance 6:
[ INFO ] Total time in Infer (HW and SW): 4034.31ms
[ INFO ] Frames in utterance: 790
[ INFO ] Average Infer time per frame: 5.11ms
[ INFO ]
[ INFO ] Output blob name: affinetransform14
[ INFO ] Number scores per frame: 3425
[ INFO ]
[ INFO ] max error: 0.7648273
[ INFO ] avg error: 0.0482702
[ INFO ] avg rms error: 0.0629734
[ INFO ] stdev error: 0.0404429
[ INFO ]
[ INFO ] Utterance 7:
[ INFO ] Total time in Infer (HW and SW): 2854.04ms
[ INFO ] Frames in utterance: 622
[ INFO ] Average Infer time per frame: 4.59ms
[ INFO ]
[ INFO ] Output blob name: affinetransform14
[ INFO ] Number scores per frame: 3425
[ INFO ]
[ INFO ] max error: 0.7389560
[ INFO ] avg error: 0.0465543
[ INFO ] avg rms error: 0.0604941
[ INFO ] stdev error: 0.0386294
[ INFO ]
[ INFO ] Utterance 8:
[ INFO ] Total time in Infer (HW and SW): 2493.28ms
[ INFO ] Frames in utterance: 548
[ INFO ] Average Infer time per frame: 4.55ms
[ INFO ]
[ INFO ] Output blob name: affinetransform14
[ INFO ] Number scores per frame: 3425
[ INFO ]
[ INFO ] max error: 0.6680136
[ INFO ] avg error: 0.0439341
[ INFO ] avg rms error: 0.0574614
[ INFO ] stdev error: 0.0370353
[ INFO ]
[ INFO ] Utterance 9:
[ INFO ] Total time in Infer (HW and SW): 1654.67ms
[ INFO ] Frames in utterance: 368
[ INFO ] Average Infer time per frame: 4.50ms
[ INFO ]
[ INFO ] Output blob name: affinetransform14
[ INFO ] Number scores per frame: 3425
[ INFO ]
[ INFO ] max error: 0.6550579
[ INFO ] avg error: 0.0467643
[ INFO ] avg rms error: 0.0605045
[ INFO ] stdev error: 0.0383914
[ INFO ]
[ INFO ] Total sample time: 39722.60ms
[ INFO ] File result.npz was created!
[ INFO ] This sample is an API example, for any performance measurements please use the dedicated benchmark_app tool

The sample application logs each step in a standard output stream.

[ INFO ] OpenVINO runtime: OpenVINO Runtime version ......... 2022.1.0
[ INFO ] Build ........... 2022.1.0-6311-a90bb1ff017
[ INFO ]
[ INFO ] Parsing input parameters
[ INFO ] Loading model files:
[ INFO ] \test_data\models\wsj_dnn5b_smbr_fp32\wsj_dnn5b_smbr_fp32.xml
[ INFO ] Using scale factor of 2175.43 calculated from first utterance.
[ INFO ] Model loading time 0.0034 ms
[ INFO ] Loading model to the device GNA_AUTO
[ INFO ] Loading model to the device
[ INFO ] Number scores per frame : 3425
Utterance 0:
Total time in Infer (HW and SW):        5687.53 ms
Frames in utterance:                    1294 frames
Average Infer time per frame:           4.39531 ms
         max error: 0.705184
         avg error: 0.0448388
     avg rms error: 0.0574098
       stdev error: 0.0371649


End of Utterance 0

[ INFO ] Number scores per frame : 3425
Utterance 1:
Total time in Infer (HW and SW):        4341.34 ms
Frames in utterance:                    1005 frames
Average Infer time per frame:           4.31974 ms
         max error: 0.757597
         avg error: 0.0452166
     avg rms error: 0.0578436
       stdev error: 0.0372769


End of Utterance 1

...
End of Utterance X

[ INFO ] Execution successful

Use of C++ Sample in Kaldi Speech Recognition Pipeline

The Wall Street Journal DNN model used in this example was prepared using the Kaldi s5 recipe and the Kaldi Nnet (nnet1) framework. It is possible to recognize speech by substituting the speech_sample for Kaldi’s nnet-forward command. Since the speech_sample does not yet use pipes, it is necessary to use temporary files for speaker-transformed feature vectors and scores when running the Kaldi speech recognition pipeline. The following operations assume that feature extraction was already performed according to the s5 recipe and that the working directory within the Kaldi source tree is egs/wsj/s5.

  1. Prepare a speaker-transformed feature set, given that the feature transform is specified in final.feature_transform and the feature files are specified in feats.scp:

    nnet-forward --use-gpu=no final.feature_transform "ark,s,cs:copy-feats scp:feats.scp ark:- |" ark:feat.ark
    
  2. Score the feature set, using the speech_sample:

    ./speech_sample -d GNA_AUTO -bs 8 -i feat.ark -m wsj_dnn5b.xml -o scores.ark
    

    OpenVINO™ toolkit Intermediate Representation wsj_dnn5b.xml file was generated in the previous Model Preparation section.

  3. Run the Kaldi decoder to produce n-best text hypotheses and select most likely text, given that the WFST (HCLG.fst), vocabulary (words.txt), and TID/PID mapping (final.mdl) are specified:

    latgen-faster-mapped --max-active=7000 --max-mem=50000000 --beam=13.0 --lattice-beam=6.0 --acoustic-scale=0.0833 --allow-partial=true    --word-symbol-table=words.txt final.mdl HCLG.fst ark:scores.ark ark:-| lattice-scale --inv-acoustic-scale=13 ark:- ark:- | lattice-best-path    --word-symbol-table=words.txt ark:- ark,t:-  > out.txt &
    
  4. Run the word error rate tool to check accuracy, given that the vocabulary (words.txt) and reference transcript (test_filt.txt) are specified:

    cat out.txt | utils/int2sym.pl -f 2- words.txt | sed s:\<UNK\>::g | compute-wer --text --mode=present ark:test_filt.txt ark,p:-
    

    All of the files can be downloaded from the storage