Getting Performance Numbers

Tip 1. Measure the Proper Set of Operations

When evaluating performance of your model with the OpenVINO Runtime, you must measure the proper set of operations. To do so, consider the following tips:

  • Avoid including one-time costs like model loading.

  • Track separately the operations that happen outside the OpenVINO Runtime, like video decoding.

Note

Some image pre-processing can be baked into the IR and accelerated accordingly. For more information, refer to Embedding the Preprocessing. Also consider Runtime Optimizations of the Preprocessing.

Tip 2. Getting Credible Performance Numbers

You need to build your performance conclusions on reproducible data. Do the performance measurements with a large number of invocations of the same routine. Since the first iteration is almost always significantly slower than the subsequent ones, you can use an aggregated value for the execution time for final projections:

  • If the warm-up run does not help or execution time still varies, you can try running a large number of iterations and then average or find a mean of the results.

  • For time values that range too much, consider geomean.

  • Beware of the throttling and other power oddities. A device can exist in one of several different power states. When optimizing your model, for better performance data reproducibility consider fixing the device frequency. However the end to end (application) benchmarking should be also performed under real operational conditions.

Tip 3. Measure Reference Performance Numbers with OpenVINO’s benchmark_app

To get performance numbers, use the dedicated Benchmark App sample which is the best way to produce the performance reference. It has a lot of device-specific knobs, but the primary usage is as simple as:

$ ./benchmark_app d GPU m <model> -i <input>

to measure the performance of the model on the GPU. Or

$ ./benchmark_app d CPU m <model> -i <input>

to execute on the CPU instead.

Each of the OpenVINO supported devices offers performance settings that have command-line equivalents in the Benchmark App. While these settings provide really low-level control and allow to leverage the optimal model performance on the specific device, we suggest always starting the performance evaluation with the OpenVINO High-Level Performance Hints first:

  • benchmark_app -hint tput -d ‘device’ -m ‘path to your model’

  • benchmark_app -hint latency -d ‘device’ -m ‘path to your model’

Comparing Performance with Native/Framework Code

When comparing the OpenVINO Runtime performance with the framework or another reference code, make sure that both versions are as similar as possible:

  • Wrap exactly the inference execution (refer to the Benchmark App for examples).

  • Do not include model loading time.

  • Ensure the inputs are identical for the OpenVINO Runtime and the framework. For example, beware of random values that can be used to populate the inputs.

  • Consider Image Pre-processing and Conversion, while any user-side pre-processing should be tracked separately.

  • When applicable, leverage the Dynamic Shapes support

  • If possible, demand the same accuracy. For example, TensorFlow allows FP16 execution, so when comparing to that, make sure to test the OpenVINO Runtime with the FP16 as well.

Internal Inference Performance Counters and Execution Graphs

Further, finer-grained insights into inference performance breakdown can be achieved with device-specific performance counters and/or execution graphs. Both C++ and Python versions of the benchmark_app supports a -pc command-line parameter that outputs internal execution breakdown.

For example, below is the part of performance counters for quantized TensorFlow* implementation of ResNet-50 model inference on CPU Plugin. Notice that since the device is CPU, the layers wall clock realTime and the cpu time are the same. Information about layer precision is also stored in the performance counters.

layerName

execStatus

layerType

execType

realTime (ms)

cpuTime (ms)

resnet_model/batch_normalization_15/FusedBatchNorm/Add

EXECUTED

Convolution

jit_avx512_1x1_I8

0.377

0.377

resnet_model/conv2d_16/Conv2D/fq_input_0

NOT_RUN

FakeQuantize

undef

0

0

resnet_model/batch_normalization_16/FusedBatchNorm/Add

EXECUTED

Convolution

jit_avx512_I8

0.499

0.499

resnet_model/conv2d_17/Conv2D/fq_input_0

NOT_RUN

FakeQuantize

undef

0

0

resnet_model/batch_normalization_17/FusedBatchNorm/Add

EXECUTED

Convolution

jit_avx512_1x1_I8

0.399

0.399

resnet_model/add_4/fq_input_0

NOT_RUN

FakeQuantize

undef

0

0

resnet_model/add_4

NOT_RUN

Eltwise

undef

0

0

resnet_model/add_5/fq_input_1

NOT_RUN

FakeQuantize

undef

0

0

The exeStatus column of the table includes possible values:

  • EXECUTED - layer was executed by standalone primitive,

  • NOT_RUN - layer was not executed by standalone primitive or was fused with another operation and executed in another layer primitive.

The execType column of the table includes inference primitives with specific suffixes. The layers have the following marks:

  • Suffix I8 for layers that had 8-bit data type input and were computed in 8-bit precision

  • Suffix FP32 for layers computed in 32-bit precision

All Convolution layers are executed in int8 precision. Rest layers are fused into Convolutions using post operations optimization technique, which is described in Internal CPU Plugin Optimizations. This contains layers name (as seen in IR), layers type and execution statistics.

Both benchmark_app versions also support “exec_graph_path” command-line option governing the OpenVINO to output the same per-layer execution statistics, but in the form of the plugin-specific Netron-viewable graph to the specified file.

Notice that on some devices, the execution graphs/counters may be pretty intrusive overhead-wise. Also, especially when performance-debugging the latency case notice that the counters do not reflect the time spent in the plugin/device/driver/etc queues. If the sum of the counters is too different from the latency of an inference request, consider testing with less inference requests. For example running single OpenVINO stream with multiple requests would produce nearly identical counters as running single inference request, yet the actual latency can be quite different.

Finally, the performance statistics with both performance counters and execution graphs is averaged, so such a data for the dynamically-shaped inputs should be measured carefully (ideally by isolating the specific shape and executing multiple times in a loop, to gather the reliable data).

OpenVINO in general and individual plugins are heavily instrumented with Intel® instrumentation and tracing technology (ITT), so another option is to compile the OpenVINO from the source code with the ITT enabled and using tools like Intel® VTune™ Profiler to get detailed inference performance breakdown and additional insights in the application-level performance on the timeline view.