GNA Device

The Intel® Gaussian & Neural Accelerator (GNA) is a low-power neural coprocessor for continuous inference at the edge.

Intel® GNA is not intended to replace typical inference devices such as the CPU and GPU. It is designed for offloading continuous inference workloads including but not limited to noise reduction or speech recognition to save power and free CPU resources.

The GNA plugin provides a way to run inference on Intel® GNA, as well as in the software execution mode on CPU.

For more details on how to configure a machine to use GNA plugin, see the GNA configuration page.

Intel® GNA Generational Differences

The first (1.0) and second (2.0) versions of Intel® GNA found in 10th and 11th generation Intel® Core™ Processors may be considered functionally equivalent. Intel® GNA 2.0 provided performance improvement with respect to Intel® GNA 1.0. Starting with 12th Generation Intel® Core™ Processors (formerly codenamed Alder Lake), support for Intel® GNA 3.0 features is being added.

In this documentation, “GNA 2.0” refers to Intel® GNA hardware delivered on 10th and 11th generation Intel® Core™ processors, and the term “GNA 3.0” refers to GNA hardware delivered on 12th generation Intel® Core™ processors.

Intel® GNA Forward and Backward Compatibility

When a model is run, using the GNA plugin, it is compiled internally for the specific hardware target. It is possible to export a compiled model, using Import/Export functionality to use it later. In general, there is no guarantee that a model compiled and exported for GNA 2.0 runs on GNA 3.0 or vice versa.

Interoperability of compile target and hardware target

Hardware

Compile target 2.0

Compile target 3.0

GNA 2.0

Supported

Not supported (incompatible layers emulated on CPU)

GNA 3.0

Partially supported

Supported

Note

In most cases, a network compiled for GNA 2.0 runs as expected on GNA 3.0. However, the performance may be worse compared to when a network is compiled specifically for the latter. The exception is a network with convolutions with the number of filters greater than 8192 (see the Models and Operations Limitations section).

For optimal work with POT quantized models, which include 2D convolutions on GNA 3.0 hardware, the following requirements should be satisfied.

Choose a compile target with priority on: cross-platform execution, performance, memory, or power optimization.

Use the following properties to check interoperability in your application: ov::intel_gna::execution_target and ov::intel_gna::compile_target.

Speech C++ Sample can be used for experiments (see the -exec_target and -compile_target command line options).

Software Emulation Mode

Software emulation mode is used by default on platforms without GNA hardware support. Therefore, model runs even if there is no GNA HW within your platform. GNA plugin enables switching the execution between software emulation mode and hardware execution mode once the model has been loaded. For details, see a description of the ov::intel_gna::execution_mode property.

Recovery from Interruption by High-Priority Windows Audio Processes

GNA is designed for real-time workloads i.e., noise reduction. For such workloads, processing should be time constrained. Otherwise, extra delays may cause undesired effects such as audio glitches. The GNA driver provides a Quality of Service (QoS) mechanism to ensure that processing can satisfy real-time requirements. The mechanism interrupts requests that might cause high-priority Windows audio processes to miss the schedule. As a result, long running GNA tasks terminate early.

To prepare the applications correctly, use Automatic QoS Feature described below.

Automatic QoS Feature on Windows

Starting with the 2021.4.1 release of OpenVINO™ and the 03.00.00.1363 version of Windows GNA driver, a new execution mode of ov::intel_gna::ExecutionMode::HW_WITH_SW_FBACK has been available to ensure that workloads satisfy real-time execution. In this mode, the GNA driver automatically falls back on CPU for a particular infer request if the HW queue is not empty. Therefore, there is no need for explicitly switching between GNA and CPU.

#include <openvino/openvino.hpp>
#include <openvino/runtime/intel_gna/properties.hpp>
from openvino.runtime import Core
core = Core()
model = core.read_model(model=model_path)
compiled_model = core.compile_model(model, device_name="GNA",
    config={ 'GNA_DEVICE_MODE' : 'GNA_HW_WITH_SW_FBACK'})

Note

Due to the “first come - first served” nature of GNA driver and the QoS feature, this mode may lead to increased CPU consumption

if there are several clients using GNA simultaneously. Even a lightweight competing infer request, not cleared at the time when the user’s GNA client process makes its request, can cause the user’s request to be executed on CPU, unnecessarily increasing CPU utilization and power.

Supported Inference Data Types

Intel® GNA essentially operates in the low-precision mode which represents a mix of 8-bit (i8), 16-bit (i16), and 32-bit (i32) integer computations.

GNA plugin users are encouraged to use the Post-Training Optimization Tool to get a model with quantization hints based on statistics for the provided dataset.

Unlike other plugins supporting low-precision execution, the GNA plugin can calculate quantization factors at the model loading time. Therefore, a model can be run without calibration. However, this mode may not provide satisfactory accuracy because the internal quantization algorithm is based on heuristics, the efficiency of which depends on the model and dynamic range of input data. This mode is going to be deprecated soon.

GNA plugin supports the i16 and i8 quantized data types as inference precision of internal primitives.

Hello Query Device C++ Sample can be used to print out supported data types for all detected devices.

POT API Usage sample for GNA demonstrates how a model can be quantized for GNA, using POT API in two modes:

  • Accuracy (i16 weights)

  • Performance (i8 weights)

For POT quantized model, the ov::inference_precision property has no effect except cases described in Support for 2D Convolutions using POT.

Supported Features

The plugin supports the features listed below:

Models Caching

Due to import/export functionality support (see below), cache for GNA plugin may be enabled via common ov::cache_dir property of OpenVINO™.

For more details, see the Model caching overview.

Import/Export

The GNA plugin supports import/export capability, which helps decrease first inference time significantly. The model compile target is the same as the execution target by default. If there is no GNA HW in the system, the default value for the execution target corresponds to available hardware or latest hardware version, supported by the plugin (i.e., GNA 3.0).

To export a model for a specific version of GNA HW, use the ov::intel_gna::compile_target property and then export the model:

std::ofstream ofs(blob_path, std::ios_base::binary | std::ios::out);
compiled_model.export_model(ofs);
user_stream = compiled_model.export_model()
with open(blob_path, 'wb') as f:
    f.write(user_stream)

Import model:

std::ifstream ifs(blob_path, std::ios_base::binary | std::ios_base::in);
auto compiled_model = core.import_model(ifs, "GNA");
with open(blob_path, 'rb') as f:
    buf = BytesIO(f.read())
    compiled_model = core.import_model(buf, device_name="GNA")

To compile a model, use either compile Tool or Speech C++ Sample.

Stateful Models

GNA plugin natively supports stateful models. For more details on such models, refer to the Stateful models.

Note

The GNA is typically used in streaming scenarios when minimizing latency is important. Taking into account that POT does not support the TensorIterator operation, the recommendation is to use the --transform option of the Model Optimizer to apply LowLatency2 transformation when converting an original model.

Profiling

The GNA plugin allows turning on profiling, using the ov::enable_profiling property. With the following methods, you can collect profiling information with various performance data about execution on GNA:

ov::InferRequest::get_profiling_info

openvino.runtime.InferRequest.get_profiling_info

The current GNA implementation calculates counters for the whole utterance scoring and does not provide per-layer information. The API enables you to retrieve counter units in cycles. You can convert cycles to seconds as follows:

seconds = cycles / frequency

Refer to the table below to learn about the frequency of Intel® GNA inside a particular processor:

Frequency of Intel® GNA inside a particular processor

Processor

Frequency of Intel® GNA, MHz

Intel® Core™ processors

400

Intel® processors formerly codenamed Elkhart Lake

200

Intel® processors formerly codenamed Gemini Lake

200

Inference request performance counters provided for the time being:

  • The number of total cycles spent on scoring in hardware, including compute and memory stall cycles

  • The number of stall cycles spent in hardware

Limitations

Model and Operation Limitations

Due to the specification of hardware architecture, Intel® GNA supports a limited set of operations (including their kinds and combinations). For example, GNA Plugin should not be expected to run computer vision models because the plugin does not fully support 2D convolutions. The exception are the models specifically adapted for the GNA Plugin.

Limitations include:

  • Prior to GNA 3.0, only 1D convolutions are natively supported on the HW; 2D convolutions have specific limitations (see the table below).

  • The number of output channels for convolutions must be a multiple of 4.

  • The maximum number of filters is 65532 for GNA 2.0 and 8192 for GNA 3.0.

  • Transpose layer support is limited to the cases where no data reordering is needed or when reordering is happening for two dimensions, at least one of which is not greater than 8.

  • Splits and concatenations are supported for continuous portions of memory (e.g., split of 1,2,3,4 to 1,1,3,4 and 1,1,3,4 or concats of 1,2,3,4 and 1,2,3,5 to 2,2,3,4).

  • For Multiply, Add and Subtract layers, auto broadcasting is only supported for constant inputs.

Support for 2D Convolutions

The Intel® GNA 1.0 and 2.0 hardware natively supports only 1D convolutions. However, 2D convolutions can be mapped to 1D when a convolution kernel moves in a single direction.

Initially, a limited subset of Intel® GNA 3.0 features are added to the previous feature set including the following:

  • 2D VALID Convolution With Small 2D Kernels: Two-dimensional convolutions with the following kernel dimensions [H, W] are supported: [1,1], [2,2], [3,3], [2,1], [3,1], [4,1], [5,1], [6,1], [7,1], [1,2], or [1,3]. Input tensor dimensions are limited to [1,8,16,16] <= [N, C, H, W] <= [1,120,384,240]. Up to 384 C channels may be used with a subset of kernel sizes (see the table below). Up to 256 kernels (output channels) are supported. Pooling is limited to pool shapes of [1,1], [2,2], or [3,3]. Not all combinations of kernel shape and input tensor shape are supported (see the tables below for exact limitations).

The tables below show that the exact limitation on the input tensor width W depends on the number of input channels C (indicated as Ci below) and the kernel shape. There is much more freedom to choose the input tensor height and number of output channels.

The following tables provide a more explicit representation of the Intel(R) GNA 3.0 2D convolution operations initially supported. The limits depend strongly on number of input tensor channels (Ci) and the input tensor width (W). Other factors are kernel height (KH), kernel width (KW), pool height (PH), pool width (PW), horizontal pool step (SH), and vertical pool step (PW). For example, the first table shows that for a 3x3 kernel with max pooling, only square pools are supported, and W is limited to 87 when there are 64 input channels.

Table of Maximum Input Tensor Widths (W) vs. Rest of Parameters (Input and Kernel Precision: i16)

Table of Maximum Input Tensor Widths (W) vs. Rest of Parameters (Input and Kernel Precision: i8)

Note

The above limitations only apply to the new hardware 2D convolution operation. When possible, the Intel® GNA plugin graph compiler flattens 2D convolutions so that the second generation Intel® GNA 1D convolution operations (without these limitations) may be used. The plugin will also flatten 2D convolutions regardless of the sizes if GNA 2.0 compilation target is selected (see below).

Support for 2D Convolutions using POT

For POT to successfully work with the models including GNA3.0 2D convolutions, the following requirements must be met:

  • All convolution parameters are natively supported by HW (see tables above).

  • The runtime precision is explicitly set by the ov::inference_precision property as i8 for the models produced by the performance mode of POT, and as i16 for the models produced by the accuracy mode of POT.

Batch Size Limitation

Intel® GNA plugin supports the processing of context-windowed speech frames in batches of 1-8 frames.

Refer to the Layout API overview to determine batch dimension.

To set layout of model inputs in runtime, use the Optimize Preprocessing guide:

#include <openvino/openvino.hpp>
ov::preprocess::PrePostProcessor ppp(model);
for (const auto& input : model->inputs()) {
    auto& in = ppp.input(input.get_any_name());
    in.model().set_layout(ov::Layout("N?"));
}
model = ppp.build();
from openvino.runtime import Core, set_batch
from openvino.preprocess import PrePostProcessor
ppp = PrePostProcessor(model)
for i in range(len(model.inputs)):
    input_name = model.input(i).get_any_name()
    ppp.input(i).model().set_layout("N?")
model = ppp.build()

then set batch size:

ov::set_batch(model, batch_size);
set_batch(model, batch_size)

Increasing batch size only improves efficiency of MatMul layers.

Note

For models with Convolution, LSTMCell, GRUCell, or ReadValue / Assign operations, the only supported batch size is 1.

Compatibility with Heterogeneous mode

Heterogeneous execution is currently not supported by GNA plugin.