GNA Device¶
The Intel® Gaussian & Neural Accelerator (GNA) is a low-power neural coprocessor for continuous inference at the edge.
Intel® GNA is not intended to replace typical inference devices such as the CPU and GPU. It is designed for offloading continuous inference workloads including but not limited to noise reduction or speech recognition to save power and free CPU resources. It lets you run inference on Intel® GNA, as well as the CPU, in the software execution mode. For more details on how to configure a system to use GNA, see the GNA configuration page.
Note
Intel’s GNA is being discontinued and Intel® Core™ Ultra (formerly known as Meteor Lake) will be the last generation of hardware to include it. For this reason, the GNA plugin will soon be discontinued. Consider Intel’s new Neural Processing Unit as a low-power solution for offloading neural network computation, for processors offering the technology.
Intel® GNA Generational Differences¶
The first (1.0) and second (2.0) versions of Intel® GNA found in 10th and 11th generation Intel® Core™ Processors may be considered functionally equivalent. Intel® GNA 2.0 provided performance improvement with respect to Intel® GNA 1.0.
Intel CPU generation |
GNA HW Version |
---|---|
10th, 11th |
GNA 2.0 |
12th, 13th |
GNA 3.0 |
14th |
GNA 3.5 |
In this documentation, “GNA 2.0” refers to Intel® GNA hardware delivered on 10th and 11th generation Intel® Core™ processors, and the term “GNA 3.0” refers to GNA hardware delivered on 12th, 13th generation Intel® Core™ processors, and the term “GNA 3.5” refers to GNA hardware delivered on 14th generation of Intel® Core™ processors.
Intel® GNA Forward and Backward Compatibility¶
When a model is run, using the GNA plugin, it is compiled internally for the specific hardware target. It is possible to export a compiled model, using Import/Export functionality to use it later. In general, there is no guarantee that a model compiled and exported for GNA 2.0 runs on GNA 3.0 or vice versa.
Hardware |
Compile target 2.0 |
Compile target 3.0 |
Compile target 3.5 |
---|---|---|---|
GNA 2.0 |
Supported |
Not supported (incompatible layers emulated on CPU) |
Not supported (incompatible layers emulated on CPU) |
GNA 3.0 |
Partially supported |
Supported |
Not supported (incompatible layers emulated on CPU) |
GNA 3.5 |
Partially supported |
Partially supported |
Supported |
Note
In most cases, a network compiled for GNA 2.0 runs as expected on GNA 3.0. However, performance may be worse compared to when a network is compiled specifically for the latter. The exception is a network with convolutions with the number of filters greater than 8192 (see the Model and Operation Limitations section).
For optimal work with POT quantized models, which include 2D convolutions on GNA 3.0/3.5 hardware, the following requirements should be satisfied:
Choose a compile target with priority on: cross-platform execution, performance, memory, or power optimization.
To check interoperability in your application use:
ov::intel_gna::execution_target
andov::intel_gna::compile_target
.
Speech C++ Sample can be used for experiments (see the -exec_target
and -compile_target
command line options).
Software Emulation Mode¶
Software emulation mode is used by default on platforms without GNA hardware support. Therefore, model runs even if there is no GNA HW within your platform.
GNA plugin enables switching the execution between software emulation mode and hardware execution mode once the model has been loaded.
For details, see a description of the ov::intel_gna::execution_mode
property.
Recovery from Interruption by High-Priority Windows Audio Processes¶
GNA is designed for real-time workloads i.e., noise reduction. For such workloads, processing should be time constrained. Otherwise, extra delays may cause undesired effects such as audio glitches. The GNA driver provides a Quality of Service (QoS) mechanism to ensure that processing can satisfy real-time requirements. The mechanism interrupts requests that might cause high-priority Windows audio processes to miss the schedule. As a result, long running GNA tasks terminate early.
To prepare the applications correctly, use Automatic QoS Feature described below.
Automatic QoS Feature on Windows¶
Starting with the 2021.4.1 release of OpenVINO™ and the 03.00.00.1363 version of Windows GNA driver, the execution mode of
ov::intel_gna::ExecutionMode::HW_WITH_SW_FBACK
has been available to ensure that workloads satisfy real-time execution.
In this mode, the GNA driver automatically falls back on CPU for a particular infer request if the HW queue is not empty.
Therefore, there is no need for explicitly switching between GNA and CPU.
import openvino as ov
core = ov.Core()
compiled_model = core.compile_model(
model, device_name="GNA", config={"GNA_DEVICE_MODE": "GNA_HW_WITH_SW_FBACK"}
)
#include <openvino/openvino.hpp>
#include <openvino/runtime/intel_gna/properties.hpp>
ov::Core core;
auto model = core.read_model(model_path);
auto compiled_model = core.compile_model(model, "GNA",
ov::intel_gna::execution_mode(ov::intel_gna::ExecutionMode::HW_WITH_SW_FBACK));
Note
Due to the “first come - first served” nature of GNA driver and the QoS feature, this mode may lead to increased CPU consumption if there are several clients using GNA simultaneously. Even a lightweight competing infer request, not cleared at the time when the user’s GNA client process makes its request, can cause the user’s request to be executed on CPU, unnecessarily increasing CPU utilization and power.
Supported Inference Data Types¶
Intel® GNA essentially operates in the low-precision mode which represents a mix of 8-bit (i8
), 16-bit (i16
), and 32-bit (i32
)
integer computations. Unlike other OpenVINO devices supporting low-precision execution, it can calculate quantization factors at the
model loading time. Therefore, a model can be run without calibration. However, this mode may not provide satisfactory accuracy
because the internal quantization algorithm is based on heuristics, the efficiency of which depends on the model and dynamic range of input data.
This mode is going to be deprecated soon. GNA supports the i16
and i8
quantized data types as inference precision of internal primitives.
GNA users are encouraged to use the Post-Training Optimization Tool to get a model with quantization hints based on statistics for the provided dataset.
Hello Query Device C++ Sample can be used to print out supported data types for all detected devices.
POT API Usage sample for GNA demonstrates how a model can be quantized for GNA, using POT API in two modes:
Accuracy (i16 weights)
Performance (i8 weights)
For POT quantized models, the ov::hint::inference_precision
property has no effect except in cases described in the
Model and Operation Limitations section.
Supported Features¶
Model Caching¶
Due to import/export functionality support (see below), cache for GNA plugin may be enabled via common ov::cache_dir
property of OpenVINO™.
For more details, see the Model caching overview.
Import/Export¶
The GNA plugin supports import/export capability, which helps decrease first inference time significantly. The model compile target is the same as the execution target by default. If there is no GNA HW in the system, the default value for the execution target corresponds to available hardware or latest hardware version, supported by the plugin (i.e., GNA 3.0).
To export a model for a specific version of GNA HW, use the ov::intel_gna::compile_target
property and then export the model:
user_stream = compiled_model.export_model()
with open(blob_path, "wb") as f:
f.write(user_stream)
std::ofstream ofs(blob_path, std::ios_base::binary | std::ios::out);
compiled_model.export_model(ofs);
Import model:
with open(blob_path, "rb") as f:
buf = BytesIO(f.read())
compiled_model = core.import_model(buf, device_name="GNA")
std::ifstream ifs(blob_path, std::ios_base::binary | std::ios_base::in);
auto compiled_model = core.import_model(ifs, "GNA");
To compile a model, use either compile Tool or Speech C++ Sample.
Stateful Models¶
GNA plugin natively supports stateful models. For more details on such models, refer to the Stateful models.
Note
The GNA is typically used in streaming scenarios when minimizing latency is important. Taking into account that POT does not
support the TensorIterator
operation, the recommendation is to use the transform
option of model conversion API
to apply LowLatency2
transformation when converting an original model.
Profiling¶
The GNA plugin allows turning on profiling, using the ov::enable_profiling
property.
With the following methods, you can collect profiling information with various performance data about execution on GNA:
openvino.InferRequest.get_profiling_info
ov::InferRequest::get_profiling_info
The current GNA implementation calculates counters for the whole utterance scoring and does not provide per-layer information. The API enables you to retrieve counter units in cycles. You can convert cycles to seconds as follows:
seconds = cycles / frequency
Refer to the table below for the frequency of Intel® GNA inside particular processors:
Processor |
Frequency of Intel® GNA, MHz |
---|---|
Intel® Core™ processors |
400 |
Intel® processors formerly codenamed Elkhart Lake |
200 |
Intel® processors formerly codenamed Gemini Lake |
200 |
Inference request performance counters provided for the time being:
The number of total cycles spent on scoring in hardware, including compute and memory stall cycles
The number of stall cycles spent in hardware
Supported Properties¶
Read-write Properties¶
In order to take effect, the following parameters must be set before model compilation or passed as additional arguments to ov::Core::compile_model()
:
ov::cache_dir
ov::enable_profiling
ov::hint::inference_precision
ov::hint::num_requests
ov::intel_gna::compile_target
ov::intel_gna::firmware_model_image_path
ov::intel_gna::execution_target
ov::intel_gna::pwl_design_algorithm
ov::intel_gna::pwl_max_error_percent
ov::intel_gna::scale_factors_per_input
These parameters can be changed after model compilation ov::CompiledModel::set_property
:
ov::hint::performance_mode
ov::intel_gna::execution_mode
ov::log::level
Read-only Properties¶
ov::available_devices
ov::device::capabilities
ov::device::full_name
ov::intel_gna::library_full_version
ov::optimal_number_of_infer_requests
ov::range_for_async_infer_requests
ov::supported_properties
Limitations¶
Model and Operation Limitations¶
Due to the specification of hardware architecture, Intel® GNA supports a limited set of operations (including their kinds and combinations). For example, GNA Plugin should not be expected to run computer vision models because the plugin does not fully support 2D convolutions. The exception are the models specifically adapted for the GNA Plugin.
Limitations include:
Prior to GNA 3.0, only 1D convolutions are natively supported on the HW; 2D convolutions have specific limitations (see the table below).
The number of output channels for convolutions must be a multiple of 4.
The maximum number of filters is 65532 for GNA 2.0 and 8192 for GNA 3.0.
Starting with Intel® GNA 3.5 the support for Int8 convolution weights has been added. Int8 weights can be used in models quantized by POT.
Transpose layer support is limited to the cases where no data reordering is needed or when reordering is happening for two dimensions, at least one of which is not greater than 8.
Splits and concatenations are supported for continuous portions of memory (e.g., split of 1,2,3,4 to 1,1,3,4 and 1,1,3,4 or concats of 1,2,3,4 and 1,2,3,5 to 2,2,3,4).
For Multiply, Add and Subtract layers, auto broadcasting is only supported for constant inputs.
Support for 2D Convolutions up to GNA 3.0¶
The Intel® GNA 1.0 and 2.0 hardware natively supports only 1D convolutions. However, 2D convolutions can be mapped to 1D when a convolution kernel moves in a single direction. Initially, a limited subset of Intel® GNA 3.0 features are added to the previous feature set including:
2D VALID Convolution With Small 2D Kernels: Two-dimensional convolutions with the following kernel dimensions [
H
,``W``] are supported: [1,1], [2,2], [3,3], [2,1], [3,1], [4,1], [5,1], [6,1], [7,1], [1,2], or [1,3]. Input tensor dimensions are limited to [1,8,16,16] <= [N
,``C``,``H``,``W``] <= [1,120,384,240]. Up to 384C
channels may be used with a subset of kernel sizes (see the table below). Up to 256 kernels (output channels) are supported. Pooling is limited to pool shapes of [1,1], [2,2], or [3,3]. Not all combinations of kernel shape and input tensor shape are supported (see the tables below for exact limitations).
The tables below show that the exact limitation on the input tensor width W depends on the number of input channels C (indicated as Ci below) and the kernel shape. There is much more freedom to choose the input tensor height and number of output channels.
The following tables provide a more explicit representation of the Intel(R) GNA 3.0 2D convolution operations initially supported. The limits depend strongly on number of input tensor channels (Ci) and the input tensor width (W). Other factors are kernel height (KH), kernel width (KW), pool height (PH), pool width (PW), horizontal pool step (SH), and vertical pool step (PW). For example, the first table shows that for a 3x3 kernel with max pooling, only square pools are supported, and W is limited to 87 when there are 64 input channels.
Table of Maximum Input Tensor Widths (W) vs. Rest of Parameters (Input and Kernel Precision: i16)
Table of Maximum Input Tensor Widths (W) vs. Rest of Parameters (Input and Kernel Precision: i8)
Note
The above limitations only apply to the new hardware 2D convolution operation. For GNA 3.0, when possible, the Intel® GNA plugin graph compiler flattens 2D convolutions so that the second generation Intel® GNA 1D convolution operations (without these limitations) may be used. The plugin will also flatten 2D convolutions regardless of the sizes if GNA 2.0 compilation target is selected (see below).
Support for Convolutions since GNA 3.5¶
Starting from Intel® GNA 3.5, 1D convolutions are handled in a different way than in GNA 3.0. Convolutions have the following limitations:
Limitation |
Convolution 1D |
Convolution 2D |
---|---|---|
Input height |
1 |
1-65535 |
Input Width |
1-65535 |
1-65535 |
Input channel number |
1 |
1-1024 |
Kernel number |
1-8192 |
1-8192 |
Kernel height |
1 |
1-255 |
Kernel width |
1-2048 |
1-256 |
Stride height |
1 |
1-255 |
Stride width |
1-2048 |
1-256 |
Dilation height |
1 |
1 |
Dilation width |
1 |
1 |
Pooling window height |
1-1 |
1-255 |
Pooling window width |
1-255 |
1-255 |
Pooling stride height |
1 |
1-255 |
Pooling stride width |
1-255 |
1-255 |
Limitations for GNA 3.5 refers to the specific dimension. The full range of parameters is not always fully supported, e.g. where Convolution 2D Kernel can have height 255 and width 256, it may not work with Kernel with shape 255x256.
Support for 2D Convolutions using POT¶
For POT to successfully work with the models including GNA3.0 2D convolutions, the following requirements must be met:
All convolution parameters are natively supported by HW (see tables above).
The runtime precision is explicitly set by the
ov::hint::inference_precision
property asi8
for the models produced by theperformance mode
of POT, and asi16
for the models produced by theaccuracy mode
of POT.
Batch Size Limitation¶
Intel® GNA plugin supports processing of context-windowed speech frames in batches of 1-8 frames. Refer to the Layout API overview to determine batch dimension. To set the layout of model inputs in runtime, use the Optimize Preprocessing guide:
import openvino as ov
ppp = ov.preprocess.PrePostProcessor(model)
for i in range(len(model.inputs)):
input_name = model.input(i).get_any_name()
ppp.input(i).model().set_layout("N?")
model = ppp.build()
#include <openvino/openvino.hpp>
ov::preprocess::PrePostProcessor ppp(model);
for (const auto& input : model->inputs()) {
auto& in = ppp.input(input.get_any_name());
in.model().set_layout(ov::Layout("N?"));
}
model = ppp.build();
then set batch size:
ov.set_batch(model, batch_size)
ov::set_batch(model, batch_size);
Increasing batch size only improves efficiency of MatMul
layers.
Note
For models with Convolution
, LSTMCell
, GRUCell
, or ReadValue
/ Assign
operations, the only supported batch size is 1.
Compatibility with Heterogeneous mode¶
Heterogeneous execution is currently not supported by GNA plugin.