Optimization Guide


The purpose of this document is to give you performance-related insights to every step of the network deployment process.

For information on the general workflow, refer to the documentation in See Also. For an example Inference Engine API snippet, see Request-Based API and “GetBlob” Idiom.

Deep Learning Inference Engine Overview

Deep Learning Inference Engine is a part of Intel® Deep Learning Deployment Toolkit (Intel® DL Deployment Toolkit) and OpenVINO™ toolkit. Inference Engine facilitates deployment of deep learning solutions by delivering a unified, device-agnostic API.

Below, there are the three main steps of the deployment process:

  1. Conversion
    Trained models are converted from a specific framework (like Caffe* or TensorFlow*) to a framework-agnostic Intermediate Representation (IR) format.
    • Performance flow: This is an offline step where general topology-level optimizations happen automatically (see Model Optimizer Knobs Related to Performance).
    • Tools: Intel DL Deployment Toolkit features the Model Optimizer that enables automatic and seamless transition from the training environment to the deployment environment.
  2. Model Inference/Execution
    After conversion, Inference Engine consumes the IR to perform inference. While Inference Engine API itself is target-agnostic, internally, it has a notion of plugins, which are device-specific libraries facilitating the hardware-assisted acceleration.
    • Performance flow: Upon conversion to IR, the execution starts with existing Inference Engine samples to measure and tweak the performance of the network on different devices.
      > NOTE: While consuming the same IR, each plugin performs additional device-specific optimizations at load time, so the resulting accuracy might differ. Also, enabling and optimizing custom kernels is error-prone (see Optimizing Custom Kernels).
    • Tools: Beyond inference performance that samples report (see Latency vs. Throughput), you can get further device- and kernel-level timing with the Inference Engine performance counters and Intel® VTune™.
  3. Integration to the product
    After model inference is verified with the samples, the Inference Engine code is typically integrated into a real application or pipeline.
    • Performance flow: The most important point is to preserve the sustained performance achieved with the stand-alone model execution. Take precautions when combining with other APIs and be careful testing the performance of every integration step.
    • Tools: Beyond tracking the actual wall-clock time of your application, see Intel® VTune™ Examples for application-level and system-level information.

Gathering the Performance Numbers

Performance data comes in a variety of forms. For example, one of the the most common performance metrics is latency, which represents the time required to complete a unit of work (for instance, inference time for a single image). In the following sections, you will see important recommendations for measuring the performance.

Measure the Proper Set of Operations

When evaluating performance of your model with the Inference Engine, you must measure the proper set of operations. To do so, consider the following tips:

NOTE: Some image pre-processing can be baked into the IR and accelerated. For more information, refer to Model Optimizer Knobs Related to Performance.

Latency vs. Throughput

In the asynchronous case (see Request-Based API and “GetBlob” Idiom), no direct way exists to measure the performance of an individual infer request. Instead, you typically execute multiple requests asynchronously and measure the throughput in images per second by dividing the number of images that were processed by the processing time. For example, see Image Classification Sample Async or other Async Inference Engine demos.

In contrast, for the latency-oriented tasks, the time to a single frame is more important. This is how the Image Classification Sample and other getting started examples measure the performance.

NOTE: Most samples also support batching (automatically packing multiple input images into a single request). However, high batch size results in a latency penalty. So for more real-time oriented usages, lower batch sizes (as low as a single input) are usually used. In contrast, batch with many (potentially tens of) input images might be required to achieve optimal throughput.

Refer to the Benchmark App sample, which allows latency vs. throughput measuring.

Comparing Performance with Native/Framework Code

When comparing the Inference Engine performance with the framework or another reference code, make sure that both versions are as similar as possible:

Getting Credible Performance Numbers

You need to build your performance conclusions on reproducible data. Do the performance measurements with a large number of invocations of the same routine. Since the first iteration is almost always significantly slower than the subsequent ones, you can use an aggregated value for the execution time for final projections:

Refer to the Inference Engine Samples for code examples for the performance measurements. Almost every sample, except interactive demos, has a -ni option to specify the number of iterations.

Model Optimizer Knobs Related to Performance

Networks training is typically done on high-end data centers, using popular training frameworks like Caffe*, TensorFlow*, and MXNet*. Model Optimizer converts the trained model in original proprietary formats to IR that describes the topology. IR is accompanied by a binary file with weights. These files in turn are consumed by the Inference Engine and used for scoring.


As described in the Model Optimizer Guide, there are a number of device-agnostic optimizations the tool performs. For example, certain primitives like linear operations (BatchNorm and ScaleShift), are automatically fused into convolutions. Generally, these layers should not be manifested in the resulting IR:


The picture above shows Caffe* Resnet269* topology. The left model is the original model, and the one on the right (after conversion) is the resulting model that the Model Optimizer produces, with BatchNorm and ScaleShift layers fused into the convolution weights rather than constituting separate layers.

If you still see these operations, inspect the Model Optimizer output carefully while searching for warnings, such as on the tool being unable to fuse. For example, non-linear operations (like activations) in between convolutions and linear operations might prevent the fusing. If performance is of concern, try to change (and potentially re-train) the topology. Refer to the Model Optimizer Guide for more optimizations.

Notice that the activation (_relu) is not touched by the Model Optimizer, and while it can be merged into convolution as well, this is rather a device-specific optimization, covered by Inference Engine during the model loading time. You are encouraged to inspect performance counters from plugins that should indicate that these particular layers are not executed (“Optimized out”). For more information, refer to Internal Inference Performance Counters.


Device-Specific Optimizations

The Inference Engine supports several target devices (CPU, GPU, Intel® Movidius™ Myriad™ 2 VPU, Intel® Movidius™ Myriad™ X VPU, and FPGA), and each of them has a corresponding plugin. If you want to optimize a specific device, you must keep in mind the following tips to increase the performance.

CPU Checklist

CPU plugin completely relies on the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) for major primitives acceleration, for example, Convolutions or FullyConnected.

The only hint from the Inference Engine execution is how the major primitives are accelerated (and you cannot change this). For example, on the Core machines, you should see variations of the jit_avx2 when inspecting the internal inference performance counters. If you are an advanced user, you can further trace the CPU execution..

Internally, the Inference Engine CPU plugin has a threading abstraction level, which allows for compiling the open source version with either Intel® Threading Building Blocks (Intel® TBB) or OpenMP* as a parallelism solution. This is particularly important for aligning threading model with the rest of your application (and any third-party libraries that you use) to avoid oversubscription. For more information, see Note on the App-Level Threading section.

The OpenVINO™ toolkit currently comes pre-compiled with OpenMP*, so all general and Intel-specific OpenMP* environment settings like OMP_NUM_THREADS apply. To align configuration capabilities of the Inference Engine compiled with TBB, use general CPU configuration options rather than OpenMP-specific settings. Note that it also affects the rest of the program beyond Inference Engine itself.

Other general recommendations:

Throughput Mode for CPU

Unlike most accelerators, CPU is perceived as an inherently latency-oriented device. With the 2018 R5 release, the OpenVINO introduced the "throughput" mode for the CPU, which allows the Inference Engine to efficiently run multiple inference requests on the CPU simultaneously, greatly improving the throughput.

Internally, the execution resources are split/pinned into execution "streams". This feature provides much better performance for the networks that are not originally scaled well with a number of threads (for example, lightweight topologies). This is especially pronounced for the many-core server machines.

Try the Benchmark App sample and play with number of infer requests running in parallel. The rule of thumb is tying up to a number of CPU cores on your machine. For example, on an 8-core CPU, compare the -nireq 1 (which is a legacy scenario) to the 2, 4, and 8 requests.

In addition to the number of the inference requests in flight (and hence, the CPU execution streams), you can also play with the batch size to find the best throughput.

If your application is hard or impossible to change in accordance with the multiple-requests logic, consider the "multiple-instance" trick to improve the throughput:

GPU Checklist

Inference Engine relies on the Compute Library for Deep Neural Networks (clDNN) for Convolutional Neural Networks acceleration on Intel® GPUs. Internally, clDNN uses OpenCL™ to implement the kernels. Thus, many general tips apply:

Intel® Movidius™ Myriad™ X Visual Processing Unit

Since Intel® Movidius™ Myriad™ X Visual Processing Unit (Intel® Movidius™ Myriad™ 2 VPU) communicates with the host over USB, minimum two infer requests in flight are recommended to hide the data transfer costs. See Request-Based API and “GetBlob” Idiom and Image Classification Sample Async for more information.


Below are listed the most important tips for the efficient usage of the FPGA:


Heterogeneous execution (constituted by the dedicated Inference Engine “Hetero” plugin) enables to schedule a network inference to the multiple devices.

Typical Heterogeneous Scenarios of Concern

The primary points for executing a network in heterogeneous mode are as follows:

Heterogeneous Flow

The execution through heterogeneous plugin has three distinct steps:

  1. Applying affinity setting for the layers, that is, binding them to the devices.
    • This can be done automatically using fallback priorities, or on the per-layer basis.
    • The affinity setting is made before loading the network to the (heterogeneous) plugin, so this is always a static setup with respect to execution.
  2. Loading a network to the heterogeneous plugin, which internally splits the network into subgraphs.
    You can check the decisions the plugin makes, see Analysing the Heterogeneous Execution.
  3. Executing the infer requests. From user’s side, this looks identical to a single-device case, while internally, the subgraphs are executed by actual plugins/devices.

Performance benefits of the heterogeneous execution depend heavily on the communications granularity between devices. If transmitting/converting data from one part device to another takes more time than the execution, the heterogeneous approach makes little or no sense. Using Intel® VTune™ helps to visualize the execution flow on a timeline (see Intel® VTune™ Examples).

Similarly, if there are too much subgraphs, the synchronization and data transfers might eat the entire performance. In some cases, you can define the (coarser) affinity manually to avoid sending data back and forth many times during one inference.

The general affinity “rule of thumb” is to keep computationally-intensive kernels on the accelerator, and auxiliary, or secondary, kernels on the CPU. Notice that this includes the granularity considerations. For example, running some custom activation (that comes after every accelerator-equipped convolution) on the CPU might result in performance degradation due to too much data type and/or layout conversions, even though the activation itself can be extremely fast. In this case, it might make sense to consider implementing the kernel for the accelerator (see Optimizing Custom Kernels). The conversions typically manifest themselves as outstanding (comparing to CPU-only execution) Reorders (see Internal Inference Performance Counters).

For general details on the heterogeneous plugin, refer to the corresponding section in the Inference Engine Developer Guide.

Trying the Heterogeneous Plugin with Inference Engine Samples

Every Inference Engine sample that supports the -d (device) option now accepts the new HETERO option.

For example, here is a command to run an Object Detection Sample SSD Sample:

./object_detection_sample_ssd -m <path_to_model>/ModelSSD.xml -i <path_to_pictures>/picture.jpg -d HETERO:FPGA,CPU


You can point more than two devices: -d HETERO:FPGA,GPU,CPU.

Heterogeneous Scenarios with FPGA

As FPGA is considered as an inference accelerator, most performance issues are related to the fact that due to the fallback, the CPU can be still used quite heavily. In this case, it minimizes the busy wait time when OpenMP threads loop in between parallel regions:

NOTE: General threading tips (see Note on the App-Level Threading) apply well, even when the entire topology fits the FPGA, because there is still a host-side code for data pre- and post-processing.

General Tips on GPU/CPU Execution

The following tips are provided to give general guidance on optimizing execution on GPU/CPU devices.

Analyzing Heterogeneous Execution

There is a dedicated configuration option that enables dumping the visualization of the subgraphs created by the heterogeneous plugin:

enginePtr = dispatcher.getPluginByDevice("HETERO:FPGA,CPU");
InferencePlugin plugin(enginePtr);
plugin.SetConfig({ {KEY_HETERO_DUMP_GRAPH_DOT, YES} });

After enabling the configuration key, the heterogeneous plugin generates two files:

You can use GraphViz* utility or *.dot converters (for example, to .png or .pdf), like xdot*, available on Linux* OS with sudo apt-get install xdot. Below is an example of the output trimmed to the two last layers (one executed on the FPGA and another on the CPU):


You can also use performance data (in samples, it is an option -pc) to get performance data on each subgraph. Refer to Internal Inference Performance Counters for more information.

Optimizing Custom Kernels

Few Initial Performance Considerations

Today, the Inference Engine supports only CPU and GPU custom kernels. Typically, custom kernels are used to quickly implement missing layers for new topologies. You should not override standard layers implementation, especially on the critical path, for example, Convolutions. Also, overriding existing layers can disable some existing performance optimizations, such as fusing.

It is usually easier to start with the CPU extension and switch to the GPU after debugging with the CPU path. Sometimes, when the custom layers are at the very end of your pipeline, it is easier to implement them as regular post-processing in your application without wrapping them as kernels. This is particularly true for the kernels that do not fit the GPU well, for example, output bounding boxes sorting. In many cases, you can do such post-processing on the CPU.

There are many cases when sequence of the custom kernels can be implemented as a “super” kernel allowing to save on data accesses.

Finally, with the heterogeneous execution, it is possible to execute the vast majority of intensive computations with the accelerator and keep the custom pieces on the CPU. The tradeoff is granularity/costs of communication between different devices.

For more details on the API of the custom layers, see Custom Layers Support in Inference Engine

Understanding Performance Contribution of Your Custom Kernels

In most cases, before actually implementing a full-blown code for the kernel, you can estimate the final performance by doing a simple stub kernel that does nothing (and thus is "infinitely" fast) just to let the topology execute end-to-end. Of course, the estimation is valid only if the kernel output does not affect the performance, for instance, if its output is not driving any branches or loops.

Other than that, when implementing the kernels, you can try the methods from the previous chapter to understand actual contribution and, if any custom kernel is in the hotspots, optimize that.

Few Device-Specific Tips

Plugging Inference Engine to Applications

Note on the App-Level Threading

Letting the Inference Engine Accelerate Image Pre-processing/Conversion

In many cases, a network expects a pre-processed image, so make sure you do not perform unnecessary steps in your code:

Notice that in many cases, you can directly share the (input) data with the Inference Engine.

Basic Interoperability with Other APIs

The general approach for sharing data between Inference Engine and media/graphics APIs like Intel® Media Server Studio (Intel® MSS) is based on sharing the system memory. That is, in your code, you should map or copy the data from the API to the CPU address space first.

For Intel MSS, it is recommended to perform a viable pre-processing, for example, crop/resize, and then convert to RGB again with the Video Processing Procedures (VPP). Then lock the result and create an Inference Engine blob on top of that. The resulting pointer can be used for the SetBlob:

//Lock Intel MSS surface
mfxFrameSurface1 *frame_in; //Input MSS surface.
mfxFrameAllocator* pAlloc = &m_mfxCore.FrameAllocator();
pAlloc->Lock(pAlloc->pthis, frame_in->Data.MemId, &frame_in->Data);
//Inference Engine code

WARNING: The InferenceEngine::NHWC layout natively supported only by Intel Movidius Myriad™ 2 VPU device today, internal conversion might happen for the rest of plugins.

1 /* batch, N*/,
(size_t) frame_in->Info.Height /* Height */,
(size_t) frame_in->Info.Width /* Width */,
3 /*Channels,*/,
TensorDesc desc(InferenceEngine::Precision::U8, dims_src, InferenceEngine::NHWC);
/* wrapping the surface data, as RGB is interleaved, need to pass only ptr to the R, notice that this wouldn’t work with planar formats as these are 3 separate planes/pointers*/
InferenceEngine::TBlob<uint8_t>::Ptr p = InferenceEngine::make_shared_blob<uint8_t>( desc, (uint8_t*) frame_in->Data.R);
inferRequest.SetBlob(“input”, p);
//Make sure to unlock the surface upon inference completion, to return the ownership back to the Intel MSS
pAlloc->Unlock(pAlloc->pthis, frame_in->Data.MemId, &frame_in->Data);

Alternatively, you can use RGBP (planar RGB) output from Intel MSS. This allows to wrap the (locked) result as regular NCHW which is generally friendly for most plugins (unlike NHWC). Then you can use it with SetBlob just like in previous example:

1 /* batch, N*/,
3 /*Channels,*/,
(size_t) frame_in->Info.Height /* Height */,
(size_t) frame_in->Info.Width /* Width */,
TensorDesc desc(InferenceEngine::Precision::U8, dims_src, InferenceEngine::NCHW);
/* wrapping the RGBP surface data*/
InferenceEngine::TBlob<uint8_t>::Ptr p = InferenceEngine::make_shared_blob<uint8_t>( desc, (uint8_t*) frame_in->Data.R);
inferRequest.SetBlob("input", p);

The only downside of this approach is that VPP conversion to RGBP is not hardware accelerated (and performed on the GPU EUs).

OpenCV* Interoperability Example

Unlike APIs that use dedicated address space and/or special data layouts (for instance, compressed OpenGL* textures), regular OpenCV data objects like cv::Mat reside in the conventional system memory. That is, the memory can be actually shared with the Inference Engine and only data ownership to be transferred.

Again, if the OpenCV and Inference Engine layouts match, the data can be wrapped as Inference Engine (input/output) blob. Notice that by default, Inference Engine accepts the planar and not interleaved inputs in NCHW, so the NHWC (which is exactly the interleaved layout) should be specified explicitly:

WARNING: The InferenceEngine::NHWC layout natively supported only by Intel Movidius Myriad™ 2 VPU device today, internal conversion might happen for the rest of plugins.

cv::Mat frame = …;//regular cv_8uc3 image, interleaved
// creating blob that wraps the OpenCV’s Mat
// (the data it points should persists until the blob is released):
1 /* batch, N*/,
(size_t)frame.rows /* Height */,
(size_t)frame.cols /* Width */,
(size_t)frame.channels() /*Channels,*/,
TensorDesc desc(InferenceEngine::Precision::U8, dims_src, InferenceEngine::NHWC);
InferenceEngine::TBlob<uint8_t>::Ptr p = InferenceEngine::make_shared_blob<uint8_t>( desc, (uint8_t*)frame.data, frame.step[0] * frame.rows);
inferRequest.SetBlob(“input”, p);
// similarly, you can wrap the output tensor (let’s assume it is FP32)
// notice that it should be alos explicitly stated as NHWC with setLayout
const float* output_data = output_blob->buffer().
cv::Mat res (rows, cols, CV_32FC3, output_data, CV_AUTOSTEP);

Notice that original cv::Mat/blobs cannot be used simultaneously by the application and the Inference Engine. Alternatively, the data that the pointer references to can be copied to unlock the original data and return ownership to the original API.

Request-Based API and “GetBlob” Idiom

Infer Request based API offers two types of request: Sync and Async. The Sync is considered below. The Async splits (synchronous) Infer into StartAsync and Wait (see Inference Engine Async API).

More importantly, an infer request encapsulates the reference to the “executable” network and actual inputs/outputs. Now, when you load the network to the plugin, you get a reference to the executable network (you may consider that as a queue). Actual infer requests are created by the executable network:

CNNNetReader network_reader;
auto network = network_reader.getNetwork();
InferenceEngine::InputsDataMap input_info(network.getInputsInfo());
InferenceEnginePluginPtr engine_ptr = PluginDispatcher(pluginDirs).getSuitablePlugin(TargetDevice::eGPU);
InferencePlugin plugin(engine_ptr);
auto executable_network = plugin.LoadNetwork(network, {/*opt config*/});
auto infer_request = executable_network.CreateInferRequest();
for (auto & item : inputInfo) {
std::string input_name = item->first;
auto input = infer_request.GetBlob(input_name);
/** Lock/Fill input tensor with data **/
unsigned char* data =
infer_request->Infer();//blocking call

GetBlob is a recommend way to communicate with the network, as it internally allocates the data with right padding/alignment for the device. For example, the GPU inputs/outputs blobs are mapped to the host (which is fast) if the GetBlob is used. But if you called the SetBlob, the copy (from/to the blob you have set) into the internal GPU plugin structures will happen.

Performance Aspects of Running Multiple CPU Requests Simultaneously

By default, the Inference Engine is compiled with the OpenMP as a parallelism solution. Default OpenMP behavior is that every new thread that calls the CPU plugin implicitly spawns a full set of OpenMP threads (as this new thread becomes an OpenMP master for them). This happens for every InferenceEngine::ExecutableNetwork, as every executable networkgets its own dedicated thread automatically.

If your application simultaneously executes infer requests from the multiple threads (or networks) on the CPU, make sure you do not oversubscribe the machine:

Inference Engine Async API

Inference Engine Async API can improve overall frame rate of the application. While accelerator is busy with the inference, the application can continue doing things on the host rather than wait for the inference to complete.

In the example below, inference is applied to the results of the video decoding. So it is possible to keep two parallel infer requests, and while the current is processed, the input frame for the next is being captured. This essentially hides the latency of capturing, so that the overall frame rate is rather determined only by the slowest part of the pipeline (decoding IR inference) and not by the sum of the stages.

You can compare the pseudo-codes for the regular and async-based approaches:

There are important performance caveats though: for example, the tasks that run in parallel should try to avoid oversubscribing the shared compute resources. If the inference is performed on the FPGA and the CPU is essentially idle, it makes sense to do things on the CPU in parallel. However, multiple infer requests can oversubscribe that. Notice that heterogeneous execution can implicitly use the CPU, refer to Heterogeniouty .

Also, if the inference is performed on the graphics processing unit (GPU), it can take little gain to do the encoding, for instance, of the resulting video, on the same GPU in parallel, because the device is already busy.

Refer to the Object Detection SSD Demo (latency-oriented Async API showcase) and Image Classification Sample Async (throughput-oriented) for complete examples of the Async API in action.

Using Tools

Whether you are tuning for the first time or doing advanced performance optimization, you need a a tool that provides accurate insights. Intel® VTune™ Amplifier gives you the tool to mine it and interpret the profiling data.

Alternatively, you can gather the raw profiling data that samples report, the second chapter provides example of how to interpret these.

Intel® VTune™ Examples

All major performance calls of the Inference Engine are instrumented with Instrumentation and Tracing Technology APIs. This allows viewing the Inference Engine calls on the Intel® VTune™ timelines and aggregations plus correlating them to the underlying APIs, like OpenCL. In turn, this enables careful per-layer execution breakdown.

When choosing the Analysis type in Intel® VTune™ Amplifier, make sure to select the Analyze user tasks, events, and counters option:


See the corresponding section in the Intel® VTune™ Amplifier 2018 User's Guide for details.

Example of Inference Engine calls:

Similarly, you can use any GPU analysis in the Intel VTune Amplifier and get general correlation with Inference Engine API as well as the execution breakdown for OpenCL kernels.

Just like with regular native application, further drill down in the counters is possible, however, this is mostly useful for optimizing custom kernels. Finally, with the Intel VTune Amplifier, the profiling is not limited to your user-level code (see the corresponding section in the Intel® VTune™ Amplifier 2018 User's Guide).

Internal Inference Performance Counters

Almost every sample (inspect command-line options for a specific sample with -h) supports a -pc command that outputs internal execution breakdown. Refer to the samples code for the actual Inference Engine API behind that.

Below is example of CPU plugin output for a network (since the device is CPU, the layers wall clock realTime and the cpu time are the same):

conv1 EXECUTED layerType: Convolution realTime: 706 cpu: 706 execType: jit_avx2
conv2_1_x1 EXECUTED layerType: Convolution realTime: 137 cpu: 137 execType: jit_avx2_1x1
fc6 EXECUTED layerType: Convolution realTime: 233 cpu: 233 execType: jit_avx2_1x1
fc6_nChw8c_nchw EXECUTED layerType: Reorder realTime: 20 cpu: 20 execType: reorder
out_fc6 EXECUTED layerType: Output realTime: 3 cpu: 3 execType: unknown
relu5_9_x2 OPTIMIZED_OUT layerType: ReLU realTime: 0 cpu: 0 execType: undef

This contains layers name (as seen in IR), layers type and execution statistics. Notice the OPTIMIZED_OUT, which indicates that the particular activation was fused into adjacent convolution. Also, the unknown stays for the Inference Engine specific CPU (helper) primitives that are not part of the Intel MKL-DNN.

Notice that there are some helper layers in the CPU execution breakdown, which were not presented in the original topology. These are automatically added by the plugin. For example, the Reorder re-packs the Intel MKL-DNN internal (blocked) layout to the regular plain NCHW (that the user expects as the output). As explained in the Few Device-Specific Tips, if your custom kernels introduces a lot of outstanding/expensive Reorders, consider blocked implementation for the kernels.

Notice that in the heterogeneous cases, there will be additional information on which subgraph the statistics is about (the first subgraph is GPU, so its cpu/host time is really small compared to the actual realTime):

subgraph1: squeeze1x1 EXECUTED layerType: Convolution realTime: 227 cpu:3 execType: GPU
subgraph2: detection_out EXECUTED layerType: DetectionOutput realTime: 121 cpu:121 execType: unknown

As mentioned earlier, unknown here means CPU kernel with unknown (for example, not AVX2 or AVX512) acceleration path. Since FPGA execution does not separate individual kernels, only bulk execution/data transfer statistics is available:

subgraph1: 1. input preprocessing (mean data/FPGA):EXECUTED layerType: preprocessing realTime: 129 cpu: 129
subgraph1: 2. input transfer to DDR:EXECUTED layerType: realTime: 201 cpu: 0
subgraph1: 3. FPGA execute time:EXECUTED layerType: realTime: 3808 cpu: 0 subgraph1: 4. output transfer from DDR:EXECUTED layerType: realTime: 55 cpu: 0
subgraph1: 5. FPGA output postprocessing:EXECUTED layerType: realTime: 7 cpu: 7
subgraph1: 6. softmax/copy: EXECUTED layerType: realTime: 2 cpu: 2
subgraph2: out_prob: NOT_RUN layerType: Output realTime: 0 cpu: 0
subgraph2: prob: EXECUTED layerType: SoftMax realTime: 10 cpu: 10
Total time: 4212 microseconds

The softmax/copy is a glue layer that connects the FPGA subgraph to the CPU subgraph (and copies the data).

See Also