The Inference Engine workflow involves the creation of custom kernels and either custom or existing layers.
A Layer is a convolutional neural network (CNN) building block implemented in the training framework, for example, Convolution
in Caffe*. A Kernel is defined as the corresponding implementation in Inference Engine.
Please refer to the Custom Layers in the Model Optimizer section for the details of how a mapping between framework layers and Inference Engine kernels is registered.
In short, you can plug your own kernel implementations into the Inference Engine and map them to the layers in the original framework.
The rest of the section covers custom kernels and how do you integrate them into the Inference Engine.
Every sample uses the Inference Engine API to load custom kernels depending on the device type. Specifically, for the CPU, it is a shared library that exports certain interface that registers the kernels. For GPU or MYRIAD, it is an .xml
file that lists the kernels along with parameters that the kernels accept and how these map to the specific Intermediate Representation (IR) values.
You can find the examples of CPU-targeted kernels in the <INSTALL_DIR>/deployment_tools/inference_engine/src/extension
directory. You can also use as an example global GPU kernels delivered with the OpenVINO toolkit.
Several GPU-targeted kernels are also added to the binaries upon samples compilation so that the samples application can easy load them. Refer to the cldnn_global_custom_kernels
folder in GPU plugin installation directory.
The GPU codepath abstracts many details about OpenCL™. You need to provide the kernel code in the OpenCL C and the configuration file that connects the kernel and its parameters to the parameters of the layer.
There are two options of using custom layer configuration file:
cldnn_global_custom_kernels/cldnn_global_custom_kernels.xml
file (hosted in the <INSTALL_DIR>/deployment_tools/inference_engine/bin/intel64/{Debug/Release}
folder)InferenceEngine::Core::SetConfig()
method from your application with the InferenceEngine::PluginConfigParams::KEY_CONFIG_FILE
key and the configuration file name as a value before loading the network that uses custom layers to the plugin: All Inference Engine samples (except trivial hello_classification
) feature a dedicated command-line option -c
to load custom kernels. For example, to load custom layers for the classification sample:
The configuration file is expected to follow the .xml
file structure with a node of type CustomLayer
for every custom layer you provide.
The following definitions will use the notations:
CustomLayer Node and Sub-node Structure
CustomLayer
node contains the entire configuration for a single custom layer.
Attribute Name | # | Description |
---|---|---|
name |
(1) | The name of the layer type to be used. This name should be identical to the type used in the IR. |
type |
(1) | Must be SimpleGPU |
version |
(1) | Must be 1 |
Sub-nodes: Kernel
(1), Buffers
(1), CompilerOptions
(0+), WorkSizes
(0/1)
Kernel Node and Sub-node Structure
Kernel
node contains all kernel source code configuration. No kernel node structure exists.
Sub-nodes: Source
(1+), Define
(0+)
Source Node and Sub-node Structure
Source
node points to a single OpenCL source file.
Attribute Name | # | Description |
---|---|---|
filename |
(1) | Name of the file containing OpenCL source code. Notice that path is relative to your executable. Multiple source nodes will have their sources concatenated in order. |
Sub-nodes: None
Define Node and Sub-node Structure
Define
node configures a single #define
instruction to be added to the sources during compilation (JIT).
Attribute Name | # | Description |
---|---|---|
name |
(1) | The name of the defined JIT. For static constants, this can include the value as well (taken as a string). |
param |
(0/1) | This parameter value will be used as the value of this JIT definition. |
type |
(0/1) | The parameter type. Accepted values: int , float , and int[] , float[] for arrays. |
default |
(0/1) | The default value to be used if the specified parameters is missing from the layer in the IR. |
Sub-nodes: None
The resulting JIT will be of the form: #define [name] [type] [value/default]
.
Buffers Node and Sub-node Structure
Buffers
node configures all input/output buffers for the OpenCL entry function. No buffers node structure exists.
Sub-nodes: Data
(0+), Tensor
(1+)
Data Node and Sub-node Structure
Data
node configures a single input with static data (for example, weight or biases).
Attribute Name | # | Description |
---|---|---|
name |
(1) | Name of a blob attached to a layer in the IR |
arg-index |
(1) | 0-based index in the entry function arguments to be bound to |
Sub-nodes: None
Tensor Node and Sub-node Structure
Tensor
node configures a single input or output tensor.
Attribute Name | # | Description |
---|---|---|
arg-index |
(1) | 0-based index in the entry function arguments to be bound to. |
type |
(1) |
input or output |
port-index |
(1) | 0-based index in the layer’s input/output ports in the IR |
format |
(0/1) | Data layout declaration for the tensor. Accepted values: BFYX , BYXF , YXFB , FYXB (also in all lowercase). Default value: BFYX |
CompilerOptions Node and Sub-node Structure
CompilerOptions
node configures the compilation flags for the OpenCL sources.
Attribute Name | # | Description |
---|---|---|
options |
(1) | Options string to be passed to the OpenCL compiler |
Sub-nodes: None
WorkSizes Node and Sub-node Structure
WorkSizes
node configures the global/local work sizes to be used when queuing the OpenCL program for execution.
Attribute Name | # | Description |
---|---|---|
global local |
(0/1) (0/1) |
An array of up to 3 integers (or formulas) for defining the OpenCL work-sizes to be used during execution. The formulas can use the values of the B,F,Y,X dimensions and contain the operators: +,-,/,*,% (all evaluated in integer arithmetic). Default value: global=”B*F*Y*X” local=”” |
dim |
(0/1) | A tensor to take the work size from. Accepted values: input N , output , where N is an index of input tensor starting with 0. Default value: output |
Sub-nodes: None
The following code sample provides an example configuration file (in .xml
format). For information on configuration file structure, see Configuration File Format.
The following table includes definitions that will be attached before the user sources, where <TENSOR>
is the actual input and output, (for example, INPUT0
or OUTPUT0
).
For an example, see Example Kernel.
Name | Value |
---|---|
NUM_INPUTS |
Number of the input tensors bound to this kernel |
GLOBAL_WORKSIZE |
An array of global work sizes used to execute this kernel |
GLOBAL_WORKSIZE_SIZE |
The size of the GLOBAL_WORKSIZE array |
LOCAL_WORKSIZE |
An array of local work sizes used to execute this kernel |
LOCAL_WORKSIZE_SIZE |
The size of the LOCAL_WORKSIZE array |
<TENSOR>_DIMS |
An array of the tensor dimension sizes. Always ordered as BFYX |
<TENSOR>_DIMS_SIZE |
The size of the <TENSOR>_DIMS array. |
<TENSOR>_TYPE |
The data-type of the tensor: float , half or char |
<TENSOR>_FORMAT_ |
The format of the tensor, BFYX, BYXF, YXFB , FYXB or ANY. The format will be concatenated to the defined name. You can use the tensor format to define codepaths in your code with #ifdef/#endif . |
<TENSOR>_LOWER_PADDING |
An array of padding elements used for the tensor dimensions before they start. Always ordered as BFYX. |
<TENSOR>_ LOWER_PADDING_SIZE |
The size of the <TENSOR>_LOWER_PADDING array |
<TENSOR>_UPPER_PADDING |
An array of padding elements used for the tensor dimensions after they end. Always ordered as BFYX. |
<TENSOR>_UPPER_PADDING_SIZE |
The size of the <TENSOR>_UPPER_PADDING array |
<TENSOR>_PITCHES |
The number of elements between adjacent elements in each dimension. Always ordered as BFYX. |
<TENSOR>_PITCHES_SIZE |
The size of the <TENSOR>_PITCHES array |
<TENSOR>_OFFSET |
The number of elements from the start of the tensor to the first valid element (bypassing the lower padding) |
All <TENSOR>
values will be automatically defined for every tensor bound to this layer (INPUT0
, INPUT1
, OUTPUT0
, and so on), as shown in the following for example:
NOTE: As described in the previous section, all the things like
INPUT0_TYPE
are actually defined as OpenCL (pre-)compiler inputs by the Inference Engine for efficiency reasons. See Debugging Tips for information on debugging the results.
clDNN_program0.cl
, clDNN_program1.cl
. There are as many files as distinct sets of parameters for your custom kernel (different input tensor sizes, and kernel parameters).printf
in your kernels. However, you should be careful: for instance, do not output excessively as it would generate too much data. Since the printf
output is typical, your output can be truncated to fit the buffer. Also, because of buffering, you actually get an entire buffer of output when the execution ends.Since the primary vehicle for the performance of the CPU codepath in the Inference Engine is the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN), new CPU kernels extend the Inference Engine plugin for the Intel MKL-DNN. Implementing the InferenceEngine::ILayerImplFactory
defines a general CPU-side extension. There are no Intel MKL-DNN specifics in the way you need to implement a kernel.
All Inference Engine samples (except trivial hello_classification
) feature a dedicated command-line option -l
to CPU load custom kernels. Use the following command-line code to execute the Classification Sample with custom CPU kernels:
Consider simple CustomLayerFactory
class that registers example kernels which make multiplication by two of its input data, but and does not change the dimensions:
CustomLayerFactory
class: InferenceEngine::ILayerImplFactory
getShapes
and getImplementations
of the InferenceEngine::ILayerImplFactory
class: CustomLayerImpl
class: execute
method to change data, inherit it from the abstract class InferenceEngine::ILayerExecImpl
, overload and implement the abstract methods of this class: getSupportedConfigurations
virtual method, which returns all supported configuration formats (input/output tensor layouts) for your implementation. To specify formats of data, use InferenceEngine::TensorDesc
. Refer to the Memory Primitives section for instructions on how to do it. init
method to get a runtime-selected configuration from a vector that populated in the previous step and check the parameters: execute
method, which accepts and processes the actual tenors as input/output blobs: InferenceEngine::IExtension
, which defines the functions that you need to implement:
Unload
, Release
, SetLogCallback
: GetVersion
: AddExtension
method of the general plugin interface to load your primitives: NOTE: OpenCL™ custom layer support is available in the preview mode.
NOTE: This section assumes you are familiar with developing kernels using OpenCL™.
To customize your topology with an OpenCL™ layer, follow the steps below:
clc
)..xml
) of the model IR.NOTE: OpenCL compiler, targeting Intel® Neural Compute Stick 2 for SHAVE* processor only, is re-distributed with OpenVINO.
OpenCL support is provided by ComputeAorta*, and is distributed under a license agreement between Intel and Codeplay Software Ltd.
Since the OpenCL™ toolchain for Intel® Neural Compute Stick 2 supports offline compilation only, you should first compile OpenCL C code using the standalone clc
compiler. You can find compiler binary at <INSTALL_DIR>/deployment_tools/tools/cl_compiler
.
NOTE: By design, custom OpenCL layers support any OpenCL kernels written with 1.2 version assumed. It also supports half float
extension and optimized for this type since it's a native type for Movidius VPU.
SHAVE_MA2X8XLIBS_DIR=<INSTALL_DIR>/deployment_tools/tools/cl_compiler/lib/
SHAVE_LDSCRIPT_DIR=<INSTALL_DIR>/deployment_tools/tools/cl_compiler/ldscripts/
SHAVE_MYRIAD_LD_DIR=<INSTALL_DIR>/deployment_tools/tools/cl_compiler/bin/
SHAVE_MOVIASM_DIR=<INSTALL_DIR>/deployment_tools/tools/cl_compiler/bin/
--strip-binary-header
to make an OpenCL runtime-agnostic binary runnable with inference engine. To tie the topology IR for a layer you customize, you need to prepare a configuration file. The main purpose of this is to tell the Inference Engine where to find parameters for your kernel and describe the execution work grid. For example, given the following OpenCL kernel signature:
Configuration file for this kernel might be the following:
Each custom layer is described with the CustomLayer
node. It has the following nodes and attributes:
CustomLayer
contains the following attributes:name
— (Required) A name of the Inference Engine layer to bind the kernel with.type
and version
— (Required) Reserved for future use. Set them to MVCL
and 1
respectively.max-shaves
— (Optional) The maximum number of SHAVE cores that should be dedicated for the layer. It's useful for debugging concurrency issues or for resource saving if memory bound kernel doesn't scale well with the number of cores, so more resources can be left for the rest of a topology.Kernel
must contain the following attributes:entry
— A name of your kernel function as you defined it in a source file (in the example above, it is reorg_nhwc
).Source
must contain the following attributes:filename
— A path to a compiled binary relative to the .xml
binding file.Parameters
— Describes parameters bindings. For more information, see the description below.WorkSizes
— Describes local and global work group sizes and the source for dimension deduction as a pair direction,port
. In the example above, the work group is described relatively to the dimension of the input tensor that comes thought port 0 in the IR. global
and local
work group configurations support any simple math expressions with +,-,*,/, and () from B
(batch), Y
(height), X
(width) and F
(channels).Where
— Allows to customize bindings with the key="value"
attribute. For example, to substitute only 3x3 convolutions write <Where kernel="3,3"/>
in the binging xml.Parameter description supports Tensor
of one of tensor types such as input
, output
, input_buffer
, output_buffer
or data
and Scalar
nodes and has the following format:
Tensor
node that has type input
or output
must contain the following attribute:arg-name
— A name of a kernel parameter in the kernel signature.type
— input
or output
as in the IR.port-index
— A number of input/output port as in the IR.format
— Specifies the channel order in the tensor. Optional conversion layers are generated if the custom layer format is not compatible with formats of neighboring layers. BFXY
, BYXF
and ANY
formats are supported currently.Each Tensor
node that has type input_buffer
or output_buffer
must contain the following attribute:
arg-name
— A name of a kernel parameter in the kernel signature.type
— input_buffer
or output_buffer
. Use the appropriate type to bind multiple kernels that correspond to different stages of the same layer.port-index
— The unique identifier to bind by.dim
— The dim source with the same direction,port
format used for WorkSizes
bindings.size
— Amount of bytes needed. Current expression syntax supports only expression over dimensions of over selected input/output tensor or constants and might be expended in the future.Here is an example of multi-stage MVN layer binding:
Tensor
node that has type data
must contain the following attribute:source
— A name of the blob as it's in the IR (typical example is weights
for convolution).format
— Specifies the channel order in the tensor. Optional conversion layers are generated if custom layer format is not. Scalar
node must contain the following attributes:arg-name
— A name of a kernel parameter in the kernel signaturetype
— int
or float
value. It is used for correct argument extraction from IR parameters.source
— Contains the name of the parameter in the IR file or input/output (I
/O
, In
/On
, where n
is a port number) followed by dimension B
(batch), Y
(height), X
(width) or F
(channels).NOTE: If both native and custom layer implementations are present, the custom kernel has a priority over the native one.
Before loading the network that features the custom layers, provide a separate configuration file and load it using the InferenceEngine::Core::SetConfig()
method with the PluginConfigParams::KEY_CONFIG_FILE
key and the configuration file name as a value:
Optionally, you can set a path to a custom layers description with a pair of VPU_CUSTOM_LAYERS
and /path/to/your/customLayers.xml
as a network configuration:
This section provides optimization guidelines on writing custom layers with OpenCL for VPU devices. Knowledge about general OpenCL programming model and OpenCL kernel language is assumed and not a subject of this section. The OpenCL model mapping to VPU is described in the table below.
OpenCL Model | VPU Mapping |
---|---|
Device code | Executed on SHAVE cores |
Private memory | Mapped to CMX internal memory, limited to 100KB per work group, valid only while work group is executed |
Local memory | Mapped to CMX internal memory, limited to 100KB per work group, valid only while work group is executed |
Global memory | Mapped to DDR, used to pass execution preserved parameters for inputs, outputs and blobs |
Work group | Executed on a single SHAVE core iterating over multiple work items |
Note that by the OpenCL specification, the work group execution order is not specified. This means that it's your responsibility to ensure that race conditions among work groups are not introduced. Custom layer runtime spits evenly work grid among available compute resources and executes them in an arbitrary order. This static scheduling approach works best if the load is evenly spread out across work groups, which is a typical case for Deep Learning kernels. The following guidelines are recommended to use for work group partitioning:
max-shaves
attribute for the CustomLayer
node). This keeps more resources for the rest of topology. It also useful if the kernel scalability reached its limits, which may happen while optimizing memory bound kernels or kernels with poor parallelization.BFXY
/BYXF
) for the kernel if it improves work group partitioning or data access patterns. Consider full topology performance (not just specific layer boost) since data conversion layers would be automatically inserted as appropriate.Offline OpenCL compiler (clc
) features automatic vectorization over get_global_id(0)
usage, if uniform access is detected. For example, the kernel below could be automatically vectorized:
However, this work-group based vectorizer (WGV) conflicts with the default LLVM vectorizer based on superword-level parallelism (SLP) for the current compiler version. Manual vectorization is recommended to provide the best performance for non-uniform code patterns. WGV works if and only if vector types aren't used in the code.
Here is a short list of optimization tips:
restrict
where it possible.ocl_grn
from example below.restrict
markers for kernels with manually vectorized codes. In the ocl_grn
kernel below, the unrolled version without restrict
up to 20% slower then the most optimal one, which combines unrolling and restrict
.#pragma unroll N
to your loop header. Since the compiler doesn't trigger unrolling by default, it's your responsibility to annotate the code with pragmas as appropriate. In ocl_grn
version with #pragma unroll 4
is up to 50% faster, most of which comes from unrolling the first loop. The reason for it is that LLVM, in general, is better in scheduling 3-stage loops (load-compute-store), while the fist loop variance += (float)(src_data[c*H*W + y*W + x] * src_data[c*H*W + y*W + x]);
is only 2-stage (load-compute). Please, pay attention to unrolling such cases first. Unrolling factor is loop-dependent. Choose the smallest number that still improves performance as an optimum between the kernel size and execution speed. For this specific kernel, changing the unroll factor from 4
to 6
results in the same performance, so unrolling factor equal to 4 is an optimum. For Intel® Neural Compute Stick 2, unrolling is conjugated with the automatic software pipelining for load, store and compute stages: reqd_work_group_size
kernel attribute to ask the compiler to unroll the code up to local size of the work group. Please note that if the kernel is actually executed with the different work group configuration, the result is undefined.half
compute, if it keeps reasonable accuracy. 16-bit float is a native type for Intel® Neural Compute Stick 2, most of the functions half_*
are mapped to a single hardware instruction. Use the standard native_*
function for the rest of types.convert_half
function over vstore_half
if conversion to 32-bit float is required. convert_half
is mapped to a single hardware instruction. For the cvtf32f16
kernel above, the line outImage[idx] = convert_half(inImage[idx]*scale+bais);
is 8 times slower than the code with vstore_half
.clc
compiler due to conflicts with the auto-vectorizer. The generic advice would be to setup local size by x
dimension equal to inputs or/and outputs width. If it's impossible to define the work grid that exactly matches inputs or/and outputs to eliminate checks (for example, if (get_global_id(0) >= width) return
), use line-wise kernel variant with manual vectorization.The kernel example below demonstrates the impact of early exits on kernel performance.
This reorg
kernel is auto-vectorizable, but an input for Yolo v2 topology is NCHW=<1,64,26,26>
and it's not multiple of vector width (which is 8 for half
data type). As a result, Inference Engine doesn't select the auto-vectorized kernel. To compare performance of auto-vectorized and scalar version of the kernel, change the input size toNCHW=<1,64,26,32>
. This allows the auto-vectorized version to be selected by Inference Engine and can give you about 30% uplift. Since the auto-vectorized version is faster, it makes sense to enable it for the Yolo v2 topology input size by setting the local size multiple of vector (e.g. 32) and adjust global sizes accordingly. As the result, the execution work grid exceeds actual input dimension so out-of-bound checks should be inserted. See the updated kernel version below:
This code performs the same as the initial kernel above (scalar) due to branching overhead. If you replace min/max expression w = min(w, W-1);
with if (w >= W) return;
, runtime increases up to 2x against to code without branching (initial version).
If branching is inevitable for your element-based kernel, it's recommended to change the scheme to line-based. See the kernel variant below:
This decreases the execution time up to 40% against the best performing vectorized kernel without early exits (initial version).
__local
memory.reorg
kernel unrolled by stride
: scr
data in this case loaded only once. As the result, the cycle count drops up to 45% against the line-wise version.__dlobal
to __local
or __private
memory if the data is accessed more than once. Access to __dlobal
memory is orders of magnitude slower than access to __local
/__private
due to statically scheduled pipeline, which stalls completely on memory access without any prefetch. The same recommendation is applicable for scalar load/store from/to a __blobal
pointer since work-group copying could be done in a vector fashion.