NOTE: OpenCL™ custom layer support is available in the preview mode.
NOTE: This section assumes you are familiar with developing kernels using OpenCL™.
To customize your topology with an OpenCL™ layer, follow the steps below:
clc
)..xml
) of the model IR.NOTE: OpenCL compiler, targeting Intel® Neural Compute Stick 2 for the SHAVE* processor only, is redistributed with OpenVINO.
OpenCL support is provided by ComputeAorta*, and is distributed under a license agreement between Intel® and Codeplay* Software Ltd.
The OpenCL™ toolchain for the Intel® Neural Compute Stick 2 supports offline compilation only, so first compile OpenCL C code using the standalone clc
compiler. You can find the compiler binary at <INSTALL_DIR>/deployment_tools/tools/cl_compiler
.
NOTE: By design, custom OpenCL layers support any OpenCL kernels written with 1.2 version assumed. It also supports half float
extension and is optimized for this type, because it is a native type for Intel® Movidius™ VPUs.
SHAVE_MA2X8XLIBS_DIR=<INSTALL_DIR>/deployment_tools/tools/cl_compiler/lib/
SHAVE_LDSCRIPT_DIR=<INSTALL_DIR>/deployment_tools/tools/cl_compiler/ldscripts/
SHAVE_MYRIAD_LD_DIR=<INSTALL_DIR>/deployment_tools/tools/cl_compiler/bin/
SHAVE_MOVIASM_DIR=<INSTALL_DIR>/deployment_tools/tools/cl_compiler/bin/
--strip-binary-header
to make an OpenCL runtime-agnostic binary runnable with the Inference Engine. To tie the topology IR for a layer you customize, prepare a configuration file, so that the Inference Engine can find parameters for your kernel and the execution work grid is described. For example, given the following OpenCL kernel signature:
Configuration file for this kernel might be the following:
Each custom layer is described with the CustomLayer
node. It has the following nodes and attributes:
CustomLayer
contains the following attributes:name
– (Required) A name of the Inference Engine layer to bind the kernel with.type
and version
– (Required) Reserved for future use. Set them to MVCL
and 1
respectively.max-shaves
– (Optional) The maximum number of SHAVE cores that should be dedicated for the layer. It is useful for debugging concurrency issues or for resource saving if memory bound kernel does not scale well with the number of cores, so more resources can be left for the rest of a topology.Kernel
must contain the following attributes:entry
– A name of your kernel function as you defined it in a source file (in the example above, it is reorg_nhwc
).Source
must contain the following attributes:filename
– A path to a compiled binary relative to the .xml
binding file.Parameters
– Describes parameters bindings. For more information, see the description below.WorkSizes
– Describes local and global work group sizes and the source for dimension deduction as a pair direction,port
. In the example above, the work group is described relatively to the dimension of the input tensor that comes through port 0 in the IR. global
and local
work group configurations support any simple math expressions with +,-,*,/, and () from B
(batch), Y
(height), X
(width) and F
(channels).Where
– Allows to customize bindings with the key="value"
attribute. For example, to substitute only 3x3 convolutions, write <Where kernel="3,3"/>
in the binging xml.Parameter description supports Tensor
of one of tensor types such as input
, output
, input_buffer
, output_buffer
or data
, Scalar
, or Data
nodes and has the following format:
Tensor
node of input
or output
type must contain the following attributes:arg-name
– A name of a kernel parameter in the kernel signature.type
– Node type: input
or output
as in the IR.port-index
– A number of input/output ports as in the IR.format
– The channel order in the tensor. Optional conversion layers are generated if the custom layer format is not compatible with formats of neighboring layers. BFXY
, BYXF
, and ANY
formats are supported currently.Each Tensor
node of input_buffer
or output_buffer
type must contain the following attributes:
arg-name
– A name of a kernel parameter in the kernel signature.type
– Node type: input_buffer
or output_buffer
. Use the appropriate type to bind multiple kernels that correspond to different stages of the same layer.port-index
– The unique identifier to bind by.dim
– The dim source with the same direction,port
format used for WorkSizes
bindings.size
– Amount of bytes needed. Current expression syntax supports only expression over dimensions of over selected input/output tensor or constants and might be expended in the future.Here is an example of multi-stage MVN layer binding:
Tensor
node that has the type data
must contain the following attributes:source
– A name of the blob as it is in the IR (typical example is weights
for convolutionformat
– Specifies the channel order in the tensor. Optional conversion layers are generated if the custom layer format is not. Scalar
node must contain the following attributes:arg-name
– A name of a kernel parameter in the kernel signature.type
– int
or float
value. It is used for correct argument extraction from IR parameters.source
– Contains the name of the parameter in the IR file or input/output (I
/O
, In
/On
, where n
is a port number) followed by dimension B
(batch), Y
(height), X
(width), or F
(channels).Data
node must contain the following attributes:arg-name
– A name of a kernel parameter in the kernel signature.type
– Node type. Currently, local_data
is the only supported value, which defines buffer allocated in fast local on-chip memory. It is limited to 100K for all __local
and __private
arrays defined inside the kernel as well as all __local
parameters passed to the kernel. Please, consider that a manual-DMA extension requires double buffering. If the custom layer is detected to run out of local memory, the inference fails.dim
– The dim source with the same direction,port
format used for WorkSizes
bindings.size
– Amount of bytes needed. The current expression syntax supports only expression over dimensions of over selected input/output tensor or constants and may be extended in the future. The example binding below illustrates a kernel with two local buffers passed to the kernel. NOTE: If both native and custom layer implementations are present, the custom kernel has a priority over the native one.
Before loading the network that features the custom layers, provide a separate configuration file and load it using the InferenceEngine::Core::SetConfig() method with the PluginConfigParams::KEY_CONFIG_FILE key and the configuration file name as a value:
Optionally, set a path to a custom layers description with a pair of VPU_CUSTOM_LAYERS
and /path/to/your/customLayers.xml
as a network configuration:
This section provides optimization guidelines on writing custom layers with OpenCL for VPU devices. Knowledge about general OpenCL programming model and OpenCL kernel language is assumed and not a subject of this section. The OpenCL model mapping to VPU is described in the table below.
OpenCL Model | VPU Mapping |
---|---|
Device code | Executed on SHAVE cores |
Private memory | Mapped to CMX internal memory, limited to 100KB per work group, valid only while the work group is executed |
Local memory | Mapped to CMX internal memory, limited to 100KB per work group, valid only while the work group is executed |
Global memory | Mapped to DDR, used to pass execution preserved parameters for inputs, outputs, and blobs |
Work group | Executed on a single SHAVE core iterating over multiple work items |
Note that by the OpenCL specification, the work group execution order is not specified. This means that it is your responsibility to ensure that race conditions among work groups are not introduced. Custom layer runtime spits evenly work grid among available compute resources and executes them in an arbitrary order. This static scheduling approach works best if the load is evenly spread out across work groups, which is a typical case for Deep Learning kernels. The following guidelines are recommended to use for work group partitioning:
max-shaves
attribute for the CustomLayer
node). This keeps more resources for the rest of topology. It is also useful if the kernel scalability reached its limits, which may happen while optimizing memory bound kernels or kernels with poor parallelization.BFXY
/BYXF
) for the kernel if it improves work group partitioning or data access patterns. Consider full topology performance (not just specific layer boost) since data conversion layers would be automatically inserted as appropriate.Offline OpenCL compiler (clc
) features automatic vectorization over get_global_id(0)
usage, if uniform access is detected. For example, the kernel below could be automatically vectorized:
However, this work-group based vectorizer (WGV) conflicts with the default LLVM vectorizer based on superword level parallelism (SLP) for the current compiler version. Manual vectorization is recommended to provide the best performance for non-uniform code patterns. WGV works if and only if vector types are not used in the code.
Here is a short list of optimization tips:
restrict
where possible.ocl_grn
from the example below.restrict
markers for kernels with manually vectorized codes. In the ocl_grn
kernel below, the unrolled version without restrict
is up to 20% slower than the most optimal one, which combines unrolling and restrict
.#pragma unroll N
to your loop header. Since the compiler does not trigger unrolling by default, it is your responsibility to annotate the code with pragmas as appropriate. The ocl_grn
version with #pragma unroll 4
is up to 50% faster, most of which comes from unrolling the first loop, because LLVM, in general, is better in scheduling 3-stage loops (load-compute-store), while the fist loop variance += (float)(src_data[c*H*W + y*W + x] * src_data[c*H*W + y*W + x]);
is only 2-stage (load-compute). Please, pay attention to unrolling such cases first. Unrolling factor is loop-dependent. Choose the smallest number that still improves performance as an optimum between the kernel size and execution speed. For this specific kernel, changing the unroll factor from 4
to 6
results in the same performance, so unrolling factor equal to 4 is an optimum. For Intel® Neural Compute Stick 2, unrolling is conjugated with the automatic software pipelining for load, store, and compute stages: reqd_work_group_size
kernel attribute to ask the compiler to unroll the code up to local size of the work group. Please note that if the kernel is actually executed with the different work group configuration, the result is undefined.half
compute, if it keeps reasonable accuracy. 16-bit float is a native type for Intel® Neural Compute Stick 2, most of the functions half_*
are mapped to a single hardware instruction. Use the standard native_*
function for the rest of types.convert_half
function over vstore_half
if conversion to 32-bit float is required. convert_half
is mapped to a single hardware instruction. For the cvtf32f16
kernel above, the line outImage[idx] = convert_half(inImage[idx]*scale+bais);
is 8 times slower than the code with vstore_half
.clc
compiler due to conflicts with the auto-vectorizer. The generic advice would be to setup local size by x
dimension equal to inputs or/and outputs width. If it is impossible to define the work grid that exactly matches inputs or/and outputs to eliminate checks, for example, if (get_global_id(0) >= width) return
, use line-wise kernel variant with manual vectorization. The kernel example below demonstrates the impact of early exits on kernel performance. reorg
kernel is auto-vectorizable, but an input for YOLO v2 topology is NCHW=<1,64,26,26>
and it is not multiple of vector width (which is 8
for half
data type). As a result, the Inference Engine does not select the auto-vectorized kernel. To compare performance of auto-vectorized and scalar version of the kernel, change the input size toNCHW=<1,64,26,32>
. This allows the auto-vectorized version to be selected by the Inference Engine and can give you about 30% uplift. Since the auto-vectorized version is faster, it makes sense to enable it for the YOLO v2 topology input size by setting the local size multiple of vector (e.g. 32) and adjust global sizes accordingly. As a result, the execution work grid exceeds actual input dimension, so out-of-bound checks should be inserted. See the updated kernel version below: w = min(w, W-1);
with if (w >= W) return;
, runtime increases up to 2x against to code without branching (initial version).__local
memory.reorg
kernel unrolled by stride
: scr
data in this case loaded only once. As the result, the cycle count drops up to 45% against the line-wise version.__dlobal
to __local
or __private
memory if the data is accessed more than once. Access to __dlobal
memory is orders of magnitude slower than access to __local
/__private
due to statically scheduled pipeline, which stalls completely on memory access without any prefetch. The same recommendation is applicable for scalar load/store from/to a __blobal
pointer since work-group copying could be done in a vector fashion.__dma_preload
and __dma_postwrite intrinsics
. This means that instead of one kernel, a group of three kernels should be implemented: kernelName
, __dma_preload_kernelName
and __dma_postwrite_kernelName
. __dma_preload_kernelName
for a particular work group n
is guaranteed to be executed before n
-th work group itself, while __dma_postwrite_kernelName
is guarantied to be executed after a corresponding work group. You can define one of those functions that are intended to be used to copy data from-to __global
and __local
memory. The syntactics requires exact functional signature match. The example below illustrates how to prepare your kernel for manual-DMA. async_work_group_copy
, which is also mapped to DMA call.Here is the list of supported functions:
where T
can be uchar
, char
, short
, ushort
, int
, uint
, long
, ulong
, half
or float
.
Modified version of the GRN kernel could be the following:
Please note get_local_size
and get_local_id
usage inside the kernel. 21x speedup is expected for a kernel on enet-curbs setup since it was completely limited by memory usage.
An alternative method of using DMA is to use work item copy extension. Those functions are executed inside a kernel and requires work groups equal to single work item.
Here is the list of supported work item functions:
where T
can be uchar
, char
, short
, ushort
, int
, uint
, long
, ulong
, half
or float
.