Inference Engine Kernels Extensibility

The Inference Engine workflow involves the creation of custom kernels and either custom or existing layers.

A Layer is a convolutional neural network (CNN) building block implemented in the training framework, for example, Convolution in Caffe*. A Kernel is defined as the corresponding implementation in Inference Engine.

Please refer to the Custom Layers in the Model Optimizer section for the details of how a mapping between framework layers and Inference Engine kernels is registered.

In short, you can plug your own kernel implementations into the Inference Engine and map them to the layers in the original framework.

The rest of the section covers custom kernels and how do you integrate them into the Inference Engine.

Example of Custom Kernels Support in the Samples

Every sample uses the Inference Engine API to load custom kernels depending on the device type. Specifically, for the CPU, it is a shared library that exports certain interface that registers the kernels. For GPU or MYRIAD, it is an .xml file that lists the kernels along with parameters that the kernels accept and how these map to the specific Intermediate Representation (IR) values.

Example Custom Kernels

You can find the examples of CPU-targeted kernels in the <INSTALL_DIR>/deployment_tools/inference_engine/src/extension directory. You can also use as an example global GPU kernels delivered with the OpenVINO toolkit.

Several GPU-targeted kernels are also added to the binaries upon samples compilation so that the samples application can easy load them. Refer to the cldnn_global_custom_kernels folder in GPU plugin installation directory.

How to Implement Custom GPU Layers

The GPU codepath abstracts many details about OpenCL™. You need to provide the kernel code in the OpenCL C and the configuration file that connects the kernel and its parameters to the parameters of the layer.

There are two options of using custom layer configuration file:

All Inference Engine samples (except trivial hello_classification) feature a dedicated command-line option -c to load custom kernels. For example, to load custom layers for the classification sample:

$ ./classification_sample -m <path_to_model>/bvlc_alexnet_fp16.xml -i ./validation_set/daily/227x227/apron.bmp -d GPU
-c <absolute_path_to_config>/custom_layer_example.xml

Configuration File Format

The configuration file is expected to follow the .xml file structure with a node of type CustomLayer for every custom layer you provide.

The following definitions will use the notations:

CustomLayer Node and Sub-node Structure

CustomLayer node contains the entire configuration for a single custom layer.

Attribute Name # Description
name (1) The name of the layer type to be used. This name should be identical to the type used in the IR.
type (1) Must be SimpleGPU
version (1) Must be 1

Sub-nodes: Kernel (1), Buffers (1), CompilerOptions (0+), WorkSizes (0/1)

Kernel Node and Sub-node Structure

Kernel node contains all kernel source code configuration. No kernel node structure exists.

Sub-nodes: Source (1+), Define (0+)

Source Node and Sub-node Structure

Source node points to a single OpenCL source file.

Attribute Name # Description
filename (1) Name of the file containing OpenCL source code. Notice that path is relative to your executable. Multiple source nodes will have their sources concatenated in order.

Sub-nodes: None

Define Node and Sub-node Structure

Define node configures a single #‍define instruction to be added to the sources during compilation (JIT).

Attribute Name # Description
name (1) The name of the defined JIT. For static constants, this can include the value as well (taken as a string).
param (0/1) This parameter value will be used as the value of this JIT definition.
type (0/1) The parameter type. Accepted values: int, float, and int[], float[] for arrays.
default (0/1) The default value to be used if the specified parameters is missing from the layer in the IR.

Sub-nodes: None

The resulting JIT will be of the form: #‍define [name] [type] [value/default].

Buffers Node and Sub-node Structure

Buffers node configures all input/output buffers for the OpenCL entry function. No buffers node structure exists.

Sub-nodes: Data (0+), Tensor (1+)

Data Node and Sub-node Structure

Data node configures a single input with static data (for example, weight or biases).

Attribute Name # Description
name (1) Name of a blob attached to a layer in the IR
arg-index (1) 0-based index in the entry function arguments to be bound to

Sub-nodes: None

Tensor Node and Sub-node Structure

Tensor node configures a single input or output tensor.

Attribute Name # Description
arg-index (1) 0-based index in the entry function arguments to be bound to.
type (1) input or output
port-index (1) 0-based index in the layer’s input/output ports in the IR
format (0/1) Data layout declaration for the tensor. Accepted values: BFYX, BYXF, YXFB, FYXB (also in all lowercase). Default value: BFYX

CompilerOptions Node and Sub-node Structure

CompilerOptions node configures the compilation flags for the OpenCL sources.

Attribute Name # Description
options (1) Options string to be passed to the OpenCL compiler

Sub-nodes: None

WorkSizes Node and Sub-node Structure

WorkSizes node configures the global/local work sizes to be used when queuing the OpenCL program for execution.

Attribute Name # Description
global
local
(0/1)
(0/1)
An array of up to 3 integers (or formulas) for defining the OpenCL work-sizes to be used during execution.
The formulas can use the values of the B,F,Y,X dimensions and contain the operators: +,-,/,*,% (all evaluated in integer arithmetic).
Default value: global=”B*F*Y*X” local=””
dim (0/1) A tensor to take the work size from. Accepted values: input N, output, where N is an index of input tensor starting with 0. Default value: output

Sub-nodes: None

Example Configuration file

The following code sample provides an example configuration file (in .xml format). For information on configuration file structure, see Configuration File Format.

<CustomLayer name="ReLU" type="SimpleGPU" version="1">
<Kernel entry="example_relu_kernel">
<Source filename="custom_layer_kernel.cl"/>
<Define name="neg_slope" type="float" param="negative_slope" default="0.0"/>
</Kernel>
<Buffers>
<Tensor arg-index="0" type="input" port-index="0" format="BFYX"/>
<Tensor arg-index="1" type="output" port-index="0" format="BFYX"/>
</Buffers>
<CompilerOptions options="-cl-mad-enable"/>
<WorkSizes global="X,Y,B*F"/>
</CustomLayer>

Built-In Defines for Custom Layers

The following table includes definitions that will be attached before the user sources, where <TENSOR> is the actual input and output, (for example, INPUT0 or OUTPUT0).

For an example, see Example Kernel.

Name Value
NUM_INPUTS Number of the input tensors bound to this kernel
GLOBAL_WORKSIZE An array of global work sizes used to execute this kernel
GLOBAL_WORKSIZE_SIZE The size of the GLOBAL_WORKSIZE array
LOCAL_WORKSIZE An array of local work sizes used to execute this kernel
LOCAL_WORKSIZE_SIZE The size of the LOCAL_WORKSIZE array
<TENSOR>_DIMS An array of the tensor dimension sizes. Always ordered as BFYX
<TENSOR>_DIMS_SIZE The size of the <TENSOR>_DIMS array.
<TENSOR>_TYPE The data-type of the tensor: float, half or char
<TENSOR>_FORMAT_ The format of the tensor, BFYX, BYXF, YXFB , FYXB or ANY. The format will be concatenated to the defined name. You can use the tensor format to define codepaths in your code with #‍ifdef/#‍endif.
<TENSOR>_LOWER_PADDING An array of padding elements used for the tensor dimensions before they start. Always ordered as BFYX.
<TENSOR>_ LOWER_PADDING_SIZE The size of the <TENSOR>_LOWER_PADDING array
<TENSOR>_UPPER_PADDING An array of padding elements used for the tensor dimensions after they end. Always ordered as BFYX.
<TENSOR>_UPPER_PADDING_SIZE The size of the <TENSOR>_UPPER_PADDING array
<TENSOR>_PITCHES The number of elements between adjacent elements in each dimension. Always ordered as BFYX.
<TENSOR>_PITCHES_SIZE The size of the <TENSOR>_PITCHES array
<TENSOR>_OFFSET The number of elements from the start of the tensor to the first valid element (bypassing the lower padding)

All <TENSOR> values will be automatically defined for every tensor bound to this layer (INPUT0, INPUT1, OUTPUT0, and so on), as shown in the following for example:

#define INPUT0_DIMS_SIZE 4
#define INPUT0_DIMS (int []){ 1,96,55,55, }

Example Kernel

#pragma OPENCL EXTENSION cl_khr_fp16 : enable
__kernel void example_relu_kernel(
const __global INPUT0_TYPE* input0,
__global OUTPUT0_TYPE* output)
{
const uint idx = get_global_id(0);
const uint idy = get_global_id(1);
const uint idbf = get_global_id(2);//batches*features, as OpenCL supports 3D nd-ranges only
const uint feature = idbf%OUTPUT0_DIMS[1];
const uint batch = idbf/OUTPUT0_DIMS[1];
//notice that pitches are in elements, not in bytes!
const uint in_id = batch*INPUT0_PITCHES[0] + feature*INPUT0_PITCHES[1] + idy*INPUT0_PITCHES[2] + idx*INPUT0_PITCHES[3] + INPUT0_OFFSET;
const uint out_id = batch*OUTPUT0_PITCHES[0] + feature*OUTPUT0_PITCHES[1] + idy*OUTPUT0_PITCHES[2] + idx*OUTPUT0_PITCHES[3] + OUTPUT0_OFFSET;
INPUT0_TYPE value = input0[in_id];
//neg_slope (which is non-zero for leaky ReLU) is put automatically as #define, refer to the config xml
output[out_id] = value < 0 ? value * neg_slope : value;
}

NOTE: As described in the previous section, all the things like INPUT0_TYPE are actually defined as OpenCL (pre-)compiler inputs by the Inference Engine for efficiency reasons. See Debugging Tips for information on debugging the results.

Debugging Tips

How to Implement Custom CPU Layers

Since the primary vehicle for the performance of the CPU codepath in the Inference Engine is the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN), new CPU kernels extend the Inference Engine plugin for the Intel MKL-DNN. Implementing the InferenceEngine::ILayerImplFactory defines a general CPU-side extension. There are no Intel MKL-DNN specifics in the way you need to implement a kernel.

All Inference Engine samples (except trivial hello_classification) feature a dedicated command-line option -l to CPU load custom kernels. Use the following command-line code to execute the Classification Sample with custom CPU kernels:

$ ./classification_sample -m <path_to_model>/CustomAlexNet.xml -i <path_to_image>/inputImage.bmp -d CPU
-l <absolute_path_to_library>/libmy_sample_extension.so

Consider simple CustomLayerFactory class that registers example kernels which make multiplication by two of its input data, but and does not change the dimensions:

  1. Create your custom layer factory CustomLayerFactory class:
    // custom_layer.h
    // A CustomLayerFactory class is an example layer which make exponentiation by 2 for the input and doesn't change dimensions
    class CustomLayerFactory {
    };
  2. Inherit it from the abstract class: InferenceEngine::ILayerImplFactory
    // custom_layer.h
    class CustomLayerFactory: public InferenceEngine::ILayerImplFactory {
    };
  3. Create a constructor, a virtual destructor, and a data member to keep the layer info:
    // custom_layer.h
    class CustomLayerFactory: public InferenceEngine::ILayerImplFactory {
    public:
    explicit CustomLayerFactory(const CNNLayer *layer): cnnLayer(*layer) {}
    private:
    CNNLayer cnnLayer;
    };
  4. Overload and implement the abstract methods getShapes and getImplementations of the InferenceEngine::ILayerImplFactory class:
    // custom_layer.h
    class CustomLayerFactory: public InferenceEngine::ILayerImplFactory {
    public:
    // ... constructor and destructor
    StatusCode getShapes(const std::vector<TensorDesc>& inShapes, std::vector<TensorDesc>& outShapes, ResponseDesc *resp) noexcept override {
    if (cnnLayer == nullptr) {
    std::string errorMsg = "Cannot get cnn layer!";
    errorMsg.copy(resp->msg, sizeof(resp->msg) - 1);
    return GENERAL_ERROR;
    }
    if (inShapes.size() != 1) {
    std::string errorMsg = "Incorrect input shapes!";
    errorMsg.copy(resp->msg, sizeof(resp->msg) - 1);
    return GENERAL_ERROR;
    }
    outShapes.clear();
    outShapes.emplace_back(inShapes[0]);
    return OK;
    }
    StatusCode getImplementations(std::vector<ILayerImpl::Ptr>& impls, ResponseDesc *resp) noexcept override {
    // You can add cnnLayer to implementation if it is necessary
    impls.push_back(ILayerImpl::Ptr(new CustomLayerImpl()));
    return OK;
    }
    };
  5. Create your custom layer implementation CustomLayerImpl class:
    // custom_layer.h
    // A CustomLayerImpl class is an example implementation
    class CustomLayerImpl {
    };
  6. Because the layer uses execute method to change data, inherit it from the abstract class InferenceEngine::ILayerExecImpl, overload and implement the abstract methods of this class:
    // custom_layer.h
    // A CustomLayerImpl class is an example implementation
    class CustomLayerImpl: public ILayerExecImpl {
    public:
    explicit CustomLayerImpl(const CNNLayer *layer): cnnLayer(*layer) {}
    StatusCode getSupportedConfigurations(std::vector<LayerConfig>& conf, ResponseDesc *resp) noexcept override;
    StatusCode init(LayerConfig& config, ResponseDesc *resp) noexcept override;
    StatusCode execute(std::vector<Blob::Ptr>& inputs, std::vector<Blob::Ptr>& outputs, ResponseDesc *resp) noexcept override;
    private:
    CNNLayer cnnLayer;
    };
  7. Implement the getSupportedConfigurations virtual method, which returns all supported configuration formats (input/output tensor layouts) for your implementation. To specify formats of data, use InferenceEngine::TensorDesc. Refer to the Memory Primitives section for instructions on how to do it.
    // custom_layer.cpp
    virtual StatusCode CustomLayerImpl::getSupportedConfigurations(std::vector<LayerConfig>& conf, ResponseDesc *resp) noexcept {
    try {
    // This layer can be in-place but not constant
    if (cnnLayer == nullptr)
    THROW_IE_EXCEPTION << "Cannot get CNN layer";
    if (cnnLayer->insData.size() != 1 || cnnLayer->outData.empty())
    THROW_IE_EXCEPTION << "Incorrect number of input/output edges";
    LayerConfig config;
    DataPtr dataPtr = cnnLayer->insData[0].lock();
    if (!dataPtr)
    THROW_IE_EXCEPTION << "Cannot get input data";
    DataConfig dataConfig;
    dataConfig.inPlace = -1;
    dataConfig.constant = false;
    SizeVector order;
    for (size_t i = 0; i < dataPtr->getTensorDesc().getDims().size(); i++) {
    order.push_back(i);
    }
    // Planar formats for N dimensions
    dataConfig.desc = TensorDesc(dataPtr->getTensorDesc().getPrecision(),
    dataPtr->getTensorDesc().getDims(),
    {dataPtr->getTensorDesc().getDims(), order});
    config.inConfs.push_back(dataConfig);
    DataConfig outConfig;
    outConfig.constant = false;
    outConfig.inPlace = 0;
    order.clear();
    for (size_t i = 0; i < cnnLayer->outData[0]->getTensorDesc().getDims().size(); i++) {
    order.push_back(i);
    }
    outConfig.desc = TensorDesc(cnnLayer->outData[0]->getTensorDesc().getPrecision(),
    cnnLayer->outData[0]->getDims(),
    {cnnLayer->outData[0]->getDims(), order});
    config.outConfs.push_back(outConfig);
    config.dynBatchSupport = 0;
    conf.push_back(config);
    return OK;
    } catch (InferenceEngine::details::InferenceEngineException& ex) {
    std::string errorMsg = ex.what();
    errorMsg.copy(resp->msg, sizeof(resp->msg) - 1);
    return GENERAL_ERROR;
    }
    }
  8. Implement the init method to get a runtime-selected configuration from a vector that populated in the previous step and check the parameters:
    // custom_layer.cpp
    virtual StatusCode CustomLayerImpl::init(LayerConfig& config, ResponseDesc *resp) noexcept {
    StatusCode rc = OK;
    if (config.dynBatchSupport) {
    config.dynBatchSupport = 0;
    rc = NOT_IMPLEMENTED;
    }
    for (auto& input : config.inConfs) {
    if (input.inPlace >= 0) {
    input.inPlace = -1;
    rc = NOT_IMPLEMENTED;
    }
    for (auto& offset : input.desc.getBlockingDesc().getOffsetPaddingToData()) {
    if (offset) {
    return GENERAL_ERROR; // our simplified implementation does not support data offsets
    }
    }
    if (input.desc.getBlockingDesc().getOffsetPadding()) {
    return GENERAL_ERROR; // our simplified implementation does not support padding
    }
    for (size_t i = 0; i < input.desc.getBlockingDesc().getOrder().size(); i++) {
    if (input.desc.getBlockingDesc().getOrder()[i] != i) {
    // our simplified tensors support only 4D dimensions with regular order
    if (i != 4 || input.desc.getBlockingDesc().getOrder()[i] != 1)
    return GENERAL_ERROR;
    }
    }
    }
    for (auto& output : config.outConfs) {
    if (output.inPlace < 0) {
    // no in-place support for the output
    return GENERAL_ERROR;
    }
    for (auto& offset : output.desc.getBlockingDesc().getOffsetPaddingToData()) {
    if (offset) {
    return GENERAL_ERROR;
    }
    }
    if (output.desc.getBlockingDesc().getOffsetPadding()) {
    return GENERAL_ERROR;
    }
    for (size_t i = 0; i < output.desc.getBlockingDesc().getOrder().size(); i++) {
    if (output.desc.getBlockingDesc().getOrder()[i] != i) {
    if (i != 4 || output.desc.getBlockingDesc().getOrder()[i] != 1)
    return GENERAL_ERROR;
    }
    }
    }
    return rc;
    }
  9. Implement the execute method, which accepts and processes the actual tenors as input/output blobs:
    // custom_layer.cpp
    virtual StatusCode CustomLayerImpl::execute(std::vector<Blob::Ptr>& inputs, std::vector<Blob::Ptr>& outputs, ResponseDesc *resp) noexcept {
    if (inputs.size() != 1 || outputs.empty()) {
    std::string errorMsg = "Incorrect number of input or output edges!";
    errorMsg.copy(resp->msg, sizeof(resp->msg) - 1);
    return GENERAL_ERROR;
    }
    const float* src_data = inputs[0]->buffer();
    float* dst_data = outputs[0]->buffer();
    for (size_t o = 0; o < outputs->size(); o++) {
    if (dst_data == src_data) {
    dst_data[o] *= dst_data[o];
    } else {
    dst_data[o] = src_data[o]*src_data[o];
    }
    }
    }
  10. Pack the kernels into a shared library:
    1. Create a factory for your own primitives inherited from the abstract class InferenceEngine::IExtension, which defines the functions that you need to implement:
      // custom_extension.h
      class CustomExtention : public InferenceEngine::IExtension {
      };
    2. Implement the utility methods Unload, Release, SetLogCallback:
      // custom_extension.h
      class CustomExtention : public InferenceEngine::IExtension {
      public:
      // cleans up resources, in this case, does nothing
      void Unload() noexcept override {
      }
      // is used when destruction happens
      void Release() noexcept override {
      delete this;
      }
      // logging is used to track what is going on inside
      void SetLogCallback(InferenceEngine::IErrorListener &listener) noexcept override {}
      };
    3. Implement the utility method GetVersion:
      // custom_extension.h
      class CustomExtention : public InferenceEngine::IExtension {
      private:
      static InferenceEngine::Version ExtensionDescription = {
      {1, 0}, // extension API version
      "1.0",
      "CustomExtention" // extension description message
      };
      public:
      // gets extension version information
      void GetVersion(const InferenceEngine::Version *& versionInfo) const noexcept override {
      versionInfo = &ExtensionDescription;
      }
      };
    4. Implement main extension methods:
      // custom_extension.h
      class CustomExtention : public InferenceEngine::IExtension {
      public:
      // ... utility methods
      // retrunes the list of supported kernels/layers
      StatusCode getPrimitiveTypes(char**& types, unsigned int& size, ResponseDesc* resp) noexcept override {
      std::string type_name = "CustomLayer";
      types = new char *[1];
      size = 1;
      types[0] = new char[type_name.size() + 1];
      std::copy(type_name.begin(), type_name.end(), types[0]);
      types[0][type_name.size()] = '\0';
      return OK;
      }
      // main function
      StatusCode getFactoryFor(ILayerImplFactory *&factory, const CNNLayer *cnnLayer, ResponseDesc *resp) noexcept override {
      if (cnnLayer->type != "CustomLayer") {
      std::string errorMsg = std::string("Factory for ") + cnnLayer->type + " wasn't found!";
      errorMsg.copy(resp->msg, sizeof(resp->msg) - 1);
      return NOT_FOUND;
      }
      factory = new CustomLayerFactory(cnnLayer);
      return OK;
      }
      };
  11. To use your custom layers, you need to compile the code as the shared library. After that use the AddExtension method of the general plugin interface to load your primitives:
    // Load CPU extension as a shared library
    auto extension_ptr = make_so_pointer<InferenceEngine::IExtension>("<shared lib path>");
    // Add extension to the CPU device
    core.AddExtension(extension_ptr, "CPU");

How to Implement Custom Layers for VPU (Intel® Neural Compute Stick 2)

NOTE: OpenCL™ custom layer support is available in the preview mode.

NOTE: This section assumes you are familiar with developing kernels using OpenCL™.

To customize your topology with an OpenCL™ layer, follow the steps below:

  1. Write and compile you OpenCL™ code with the standalone offline OpenCL™ compiler (clc).
  2. Write a configuration file to bind the OpenCL™ kernel to the topology file (.xml) of the model IR.
  3. Pass the configuration file to Inference engine with the model IR.

Compile OpenCL™ code for VPU (Intel® Neural Compute Stick 2)

NOTE: OpenCL compiler, targeting Intel® Neural Compute Stick 2 for SHAVE* processor only, is re-distributed with OpenVINO.

OpenCL support is provided by ComputeAorta*, and is distributed under a license agreement between Intel and Codeplay Software Ltd.

Since the OpenCL™ toolchain for Intel® Neural Compute Stick 2 supports offline compilation only, you should first compile OpenCL C code using the standalone clc compiler. You can find compiler binary at <INSTALL_DIR>/deployment_tools/tools/cl_compiler.

NOTE: By design, custom OpenCL layers support any OpenCL kernels written with 1.2 version assumed. It also supports half float

extension and optimized for this type since it's a native type for Movidius VPU.

  1. Prior to running compilation, make sure that the following variables are set:
    • SHAVE_MA2X8XLIBS_DIR=<INSTALL_DIR>/deployment_tools/tools/cl_compiler/lib/
    • SHAVE_LDSCRIPT_DIR=<INSTALL_DIR>/deployment_tools/tools/cl_compiler/ldscripts/
    • SHAVE_MYRIAD_LD_DIR=<INSTALL_DIR>/deployment_tools/tools/cl_compiler/bin/
    • SHAVE_MOVIASM_DIR=<INSTALL_DIR>/deployment_tools/tools/cl_compiler/bin/
  2. Run the compilation with the command below. You should use --strip-binary-header to make an OpenCL runtime-agnostic binary runnable with inference engine.
    cd <INSTALL_DIR>/deployment_tools/tools/cl_compiler/bin
    ./clc --strip-binary-header custom_layer.cl -o custom_layer.bin

Write a configuration file

To tie the topology IR for a layer you customize, you need to prepare a configuration file. The main purpose of this is to tell the Inference Engine where to find parameters for your kernel and describe the execution work grid. For example, given the following OpenCL kernel signature:

__kernel void reorg_nhwc(__global const half *src, __global half *out, int w, int h, int c, int stride);

Configuration file for this kernel might be the following:

<CustomLayer name="ReorgYolo" type="MVCL" version="1">
<Kernel entry="reorg_nhwc">
<Source filename="reorg.bin"/>
</Kernel>
<Parameters>
<Tensor arg-name="src" type="input" port-index="0" format="BYXF"/>
<Tensor arg-name="out" type="output" port-index="0" format="BYXF"/>
<Scalar arg-name="w" type="int" port-index="0" source="I.X" />
<Scalar arg-name="h" type="int" port-index="0" source="I.Y" />
<Scalar arg-name="c" type="int" port-index="0" source="I.F" />
<Scalar arg-name="stride" type="int" source="stride" />
</Parameters>
<WorkSizes dim="input,0" global="(Y+7)/8*8,1,1" local="8,1,1"/>
</CustomLayer>

Each custom layer is described with the CustomLayer node. It has the following nodes and attributes:

Parameter description supports Tensor of one of tensor types such as input, output, input_buffer, output_buffer or data and Scalar nodes and has the following format:

Pass the configuration file to the inference runtime

NOTE: If both native and custom layer implementations are present, the custom kernel has a priority over the native one.

Before loading the network that features the custom layers, provide a separate configuration file and load it using the InferenceEngine::Core::SetConfig() method with the PluginConfigParams::KEY_CONFIG_FILE key and the configuration file name as a value:

// Load custom layers
core.SetConfig({ { InferenceEngine::PluginConfigParams::KEY_CONFIG_FILE, "<path to the xml file>" } }, "MYRIAD");

Optionally, you can set a path to a custom layers description with a pair of VPU_CUSTOM_LAYERS and /path/to/your/customLayers.xml as a network configuration:

std::map<std::string, std::string> networkConfig;
config["VPU_CUSTOM_LAYERS"] = "/path/to/your/customLayers.xml";
// Load custom layers in network config
auto exeNetwork = core.LoadNetwork(cnnNetwork, "MYRIAD", networkConfig);

Optimizing kernels with OpenCL™ for VPU (Intel® Neural Compute Stick 2)

This section provides optimization guidelines on writing custom layers with OpenCL for VPU devices. Knowledge about general OpenCL programming model and OpenCL kernel language is assumed and not a subject of this section. The OpenCL model mapping to VPU is described in the table below.

OpenCL Model VPU Mapping
Device code Executed on SHAVE cores
Private memory Mapped to CMX internal memory, limited to 100KB per work group, valid only while work group is executed
Local memory Mapped to CMX internal memory, limited to 100KB per work group, valid only while work group is executed
Global memory Mapped to DDR, used to pass execution preserved parameters for inputs, outputs and blobs
Work group Executed on a single SHAVE core iterating over multiple work items

Note that by the OpenCL specification, the work group execution order is not specified. This means that it's your responsibility to ensure that race conditions among work groups are not introduced. Custom layer runtime spits evenly work grid among available compute resources and executes them in an arbitrary order. This static scheduling approach works best if the load is evenly spread out across work groups, which is a typical case for Deep Learning kernels. The following guidelines are recommended to use for work group partitioning:

  1. Split work evenly across work groups.
  2. Adjust work group granularity to maintain equal workload for all compute codes.
  3. Set the maximum number of cores (using the max-shaves attribute for the CustomLayer node). This keeps more resources for the rest of topology. It also useful if the kernel scalability reached its limits, which may happen while optimizing memory bound kernels or kernels with poor parallelization.
  4. Try an alternate data layout (BFXY/BYXF) for the kernel if it improves work group partitioning or data access patterns. Consider full topology performance (not just specific layer boost) since data conversion layers would be automatically inserted as appropriate.

Offline OpenCL compiler (clc) features automatic vectorization over get_global_id(0) usage, if uniform access is detected. For example, the kernel below could be automatically vectorized:

__kernel void cvtf32f16(__global float* restrict inImage, __global half* restrict outImage,
float scale, float bais)
{
int idx = get_global_id(0) + get_global_id(1) * get_global_size(0) + get_global_id(2) * get_global_size(0) * get_global_size(1);
outImage[idx] = convert_half(inImage[idx]*scale+bais);
}

However, this work-group based vectorizer (WGV) conflicts with the default LLVM vectorizer based on superword-level parallelism (SLP) for the current compiler version. Manual vectorization is recommended to provide the best performance for non-uniform code patterns. WGV works if and only if vector types aren't used in the code.

Here is a short list of optimization tips:

  1. Help auto-vectorizer ensure non-aliasing pointers for kernel parameters by putting restrict where it possible.
    • This may give a performance boost, especially for kernels with unrolling, like ocl_grn from example below.
    • It's important to place restrict markers for kernels with manually vectorized codes. In the ocl_grn kernel below, the unrolled version without restrict up to 20% slower then the most optimal one, which combines unrolling and restrict.
  2. Put #‍pragma unroll N to your loop header. Since the compiler doesn't trigger unrolling by default, it's your responsibility to annotate the code with pragmas as appropriate. In ocl_grn version with #‍pragma unroll 4 is up to 50% faster, most of which comes from unrolling the first loop. The reason for it is that LLVM, in general, is better in scheduling 3-stage loops (load-compute-store), while the fist loop variance += (float)(src_data[c*H*W + y*W + x] * src_data[c*H*W + y*W + x]); is only 2-stage (load-compute). Please, pay attention to unrolling such cases first. Unrolling factor is loop-dependent. Choose the smallest number that still improves performance as an optimum between the kernel size and execution speed. For this specific kernel, changing the unroll factor from 4to 6 results in the same performance, so unrolling factor equal to 4 is an optimum. For Intel® Neural Compute Stick 2, unrolling is conjugated with the automatic software pipelining for load, store and compute stages:
    __kernel void ocl_grn(__global const half* restrict src_data, __global half* restrict dst_data, int C, float bias)
    {
    int x = get_global_id(0);
    int W = get_global_size(0);
    int y = get_global_id(1);
    int H = get_global_size(1);
    float variance = bias + 1e-9f;
    #pragma unroll 4
    for (int c = 0; c < C; c++)
    variance += (float)(src_data[c*H*W + y*W + x] * src_data[c*H*W + y*W + x]);
    variance = 1.f / native_sqrt(variance);
    #pragma unroll 4
    for (int c = 0; c < C; c++)
    dst_data[c*H*W + y*W + x] = (half)((float)src_data[c*H*W + y*W + x] * variance);
    }
    To check the efficiency of WGV, you can compare performance of the kernel above with the kernel below, which is manually vectorized over width:
    __kernel void ocl_grn_line(__global const half* restrict src_data, __global half* restrict dst_data, int C, int W, float bias)
    {
    int y = get_global_id(1);
    int H = get_global_size(1);
    for (int x = 0; x < W/8; x++)
    {
    float8 variance = (float8)(bias+1e-9f);
    #pragma unroll 4
    for (int c = 0; c < C; c++)
    {
    __global const half8* restrict src_line = ((__global const half8 * restrict)(src_data + c*H*W + y*W));
    half8 sh = src_line[x];
    variance += convert_float8(sh*sh);
    }
    variance = 1.f/native_sqrt(variance);
    #pragma unroll 4
    for (int c = 0; c < C; c++)
    {
    __global const half8* restrict src_line = ((__global const half8 * restrict)(src_data + c*H*W + y*W));
    __global half8* restrict dst_line = ((__global half8 * restrict)(dst_data + c*H*W + y*W));
    dst_line[x] = convert_half8(convert_float8(src_line[x])*variance);
    }
    }
    for (int x = W/8*8; x < W; x++)
    {
    float variance = bias+1e-9f;
    #pragma unroll 4
    for (int c = 0; c < C; c++)
    variance += (float)(src_data[c*H*W + y*W + x]*src_data[c*H*W + y*W + x]);
    variance = 1.f/native_sqrt(variance);
    #pragma unroll 4
    for (int c = 0; c < C; c++)
    dst_data[c*H*W + y*W + x] = (float)src_data[c*H*W + y*W + x]*variance;
    }
    }
    Both versions perform the same, but the second one has more complex code.
  3. If it's easy to predict the work group size, you can also use the reqd_work_group_size kernel attribute to ask the compiler to unroll the code up to local size of the work group. Please note that if the kernel is actually executed with the different work group configuration, the result is undefined.
  4. Prefer to use the half compute, if it keeps reasonable accuracy. 16-bit float is a native type for Intel® Neural Compute Stick 2, most of the functions half_* are mapped to a single hardware instruction. Use the standard native_* function for the rest of types.
  5. Prefer to use the convert_half function over vstore_half if conversion to 32-bit float is required. convert_half is mapped to a single hardware instruction. For the cvtf32f16 kernel above, the line outImage[idx] = convert_half(inImage[idx]*scale+bais); is 8 times slower than the code with vstore_half.
  6. Mind early exits. Early exit may be extremely costly for the current version of the clc compiler due to conflicts with the auto-vectorizer. The generic advice would be to setup local size by x dimension equal to inputs or/and outputs width. If it's impossible to define the work grid that exactly matches inputs or/and outputs to eliminate checks (for example, if (get_global_id(0) >= width) return), use line-wise kernel variant with manual vectorization.

The kernel example below demonstrates the impact of early exits on kernel performance.

// Initial version
__kernel void reorg(const __global half* restrict src, __global half* restrict out, int stride)
{
int w = get_global_id(0);
int W = get_global_size(0);
int h = get_global_id(1);
int H = get_global_size(1);
int c = get_global_id(2);
int C = get_global_size(2);
int C2 = C/(stride*stride);
int offset = c / C2;
int c2 = c - C2 * offset;
int H2 = H*stride;
int W2 = W*stride;
int h2 = h*stride + offset / stride;
int w2 = w*stride + offset - stride * (offset / stride);
out[W*H*c + W*h + w] = src[W2*H2*c2 + W2*h2 + w2];
}

This reorg kernel is auto-vectorizable, but an input for Yolo v2 topology is NCHW=<1,64,26,26> and it's not multiple of vector width (which is 8 for half data type). As a result, Inference Engine doesn't select the auto-vectorized kernel. To compare performance of auto-vectorized and scalar version of the kernel, change the input size toNCHW=<1,64,26,32>. This allows the auto-vectorized version to be selected by Inference Engine and can give you about 30% uplift. Since the auto-vectorized version is faster, it makes sense to enable it for the Yolo v2 topology input size by setting the local size multiple of vector (e.g. 32) and adjust global sizes accordingly. As the result, the execution work grid exceeds actual input dimension so out-of-bound checks should be inserted. See the updated kernel version below:

// Version with out-of-bound checks added
__kernel void reorg(const __global half* restrict src, __global half* restrict out, int W, int stride)
{
int w = get_global_id(0);
w = min(w, W-1);
int h = get_global_id(1);
int H = get_global_size(1);
int c = get_global_id(2);
int C = get_global_size(2);
int C2 = C/(stride*stride);
int offset = c / C2;
int c2 = c - C2 * offset;
int H2 = H*stride;
int W2 = W*stride;
int h2 = h*stride + offset / stride;
int w2 = w*stride + offset - stride * (offset / stride);
out[W*H*c + W*h + w] = src[W2*H2*c2 + W2*h2 + w2];
}

This code performs the same as the initial kernel above (scalar) due to branching overhead. If you replace min/max expression w = min(w, W-1); with if (w >= W) return;, runtime increases up to 2x against to code without branching (initial version).

If branching is inevitable for your element-based kernel, it's recommended to change the scheme to line-based. See the kernel variant below:

// Line-wise version
__kernel void reorg(const __global half* restrict src, __global half* restrict out, int H, int W, int stride)
{
int h = min((int)get_global_id(0), H-1);
int c = get_global_id(1);
int C = get_global_size(1);
int C2 = C/(stride*stride);
int offset = c / C2;
int c2 = c - C2 * offset;
int H2 = H*stride;
int W2 = W*stride;
for (int w = 0; w < W; ++w)
{
int h2 = h*stride + offset / stride;
int w2 = w*stride + offset - stride * (offset / stride);
out[W*H*c + W*h + w] = src[W2*H2*c2 + W2*h2 + w2];
}
}

This decreases the execution time up to 40% against the best performing vectorized kernel without early exits (initial version).

  1. Reuse computations among work items by using line-based kernels or sharing values though __local memory.
  2. Improve data access locality. Most of custom kernels are memory bound while convolution and fully connected layers are hardware-implemented. The code below demonstrates a further optimized version of the reorg kernel unrolled by stride:
    // Unrolled line-wise version
    __kernel void reorg_unrolled_by_stride(const __global half* restrict src, __global half* restrict dst,
    int H, int W, int stride)
    {
    int h = min((int)get_global_id(0), H-1);
    int c2 = get_global_id(1);
    int C2 = get_global_size(1);
    int C = C2*stride*stride;
    int H2 = H*stride;
    int W2 = W*stride;
    for (int stride_y = 0; stride_y < stride; stride_y++)
    for (int stride_x = 0; stride_x < stride; stride_x++)
    for (int w2 = 0, w = 0; w < W; w2 += stride, w++)
    dst[W*H*C2*(stride_y*stride+stride_x) + W*H*c2 + W*h + w] = src[W2*H2*c2 + W2*h*stride + W2*stride_y + w2 + stride_x];
    }
    scr data in this case loaded only once. As the result, the cycle count drops up to 45% against the line-wise version.
  3. Copy data from __dlobal to __local or __private memory if the data is accessed more than once. Access to __dlobal memory is orders of magnitude slower than access to __local/__private due to statically scheduled pipeline, which stalls completely on memory access without any prefetch. The same recommendation is applicable for scalar load/store from/to a __blobal pointer since work-group copying could be done in a vector fashion.

See Also