Getting Performance Numbers#

This guide explains how to use the benchmark_app to get performance numbers. It also explains how the performance numbers are reflected through internal inference performance counters and execution graphs. It also includes information on using ITT and Intel® VTune™ Profiler to get performance insights.

Test performance with the benchmark_app

You can run OpenVINO benchmarks in both C++ and Python APIs, yet the experience differs in each case. The Python one is part of OpenVINO Runtime installation, while C++ is available as a code sample. For a detailed description, see: benchmark_app.

Make sure to install the latest release package with support for frameworks of the models you want to test. For the most reliable performance benchmarks, prepare the model for use with OpenVINO.

Running the benchmark application

The benchmark_app includes a lot of device-specific options, but the primary usage is as simple as:

benchmark_app -m <model> -d <device> -i <input>

Each of the OpenVINO supported devices offers performance settings that contain command-line equivalents in the Benchmark app.

While these settings provide really low-level control for the optimal model performance on the specific device, it is recommended to always start performance evaluation with the OpenVINO High-Level Performance Hints first, like so:

# for throughput prioritization
benchmark_app -hint tput -m <model> -d <device>
# for latency prioritization
benchmark_app -hint latency -m <model> -d <device>

Additional benchmarking considerations

1 - Select a Proper Set of Operations to Measure

When evaluating performance of a model with OpenVINO Runtime, it is required to measure a proper set of operations.

  • Avoid including one-time costs such as model loading.

  • Track operations that occur outside OpenVINO Runtime (such as video decoding) separately.

Note

Some image pre-processing can be baked into OpenVINO IR and accelerated accordingly. For more information, refer to Embedding Pre-processing and General Runtime Optimizations.

2 - Try to Get Credible Data

Performance conclusions should be build upon reproducible data. As for the performance measurements, they should be done with a large number of invocations of the same routine. Since the first iteration is almost always significantly slower than the subsequent ones, an aggregated value can be used for the execution time for final projections:

  • If the warm-up run does not help or execution time still varies, you can try running a large number of iterations and then average or find a mean of the results.

  • If the time values range too much, consider geomean.

  • Be aware of the throttling and other power oddities. A device can exist in one of several different power states. When optimizing your model, consider fixing the device frequency for better performance data reproducibility. However, the end-to-end (application) benchmarking should also be performed under real operational conditions.

3 - Compare Performance with Native/Framework Code

When comparing the OpenVINO Runtime performance with the framework or another reference code, make sure that both versions are as similar as possible:

  • Wrap the exact inference execution (for examples, see Benchmark app).

  • Do not include model loading time.

  • Ensure that the inputs are identical for OpenVINO Runtime and the framework. For example, watch out for random values that can be used to populate the inputs.

  • In situations when any user-side pre-processing should be tracked separately, consider image pre-processing and conversion.

  • When applicable, leverage the Dynamic Shapes support.

  • If possible, demand the same accuracy. For example, TensorFlow allows FP16 execution, so when comparing to that, make sure to test the OpenVINO Runtime with the FP16 as well.

Internal Inference Performance Counters and Execution Graphs

More detailed insights into inference performance breakdown can be achieved with device-specific performance counters and/or execution graphs. Both C++ and Python versions of the benchmark_app support a -pc command-line parameter that outputs internal execution breakdown.

For example, the table shown below is part of performance counters for quantized TensorFlow implementation of ResNet-50 model inference on CPU Plugin. Keep in mind that since the device is CPU, the realTime wall clock and the cpu time layers are the same. Information about layer precision is also stored in the performance counters.

layerName

execStatus

layerType

execType

realTime (ms)

cpuTime (ms)

resnet_model/batch_normalization_15/FusedBatchNorm/Add

EXECUTED

Convolution

jit_avx512_1x1_I8

0.377

0.377

resnet_model/conv2d_16/Conv2D/fq_input_0

NOT_RUN

FakeQuantize

undef

0

0

resnet_model/batch_normalization_16/FusedBatchNorm/Add

EXECUTED

Convolution

jit_avx512_I8

0.499

0.499

resnet_model/conv2d_17/Conv2D/fq_input_0

NOT_RUN

FakeQuantize

undef

0

0

resnet_model/batch_normalization_17/FusedBatchNorm/Add

EXECUTED

Convolution

jit_avx512_1x1_I8

0.399

0.399

resnet_model/add_4/fq_input_0

NOT_RUN

FakeQuantize

undef

0

0

resnet_model/add_4

NOT_RUN

Eltwise

undef

0

0

resnet_model/add_5/fq_input_1

NOT_RUN

FakeQuantize

undef

0

0

The execStatus column of the table includes the following possible values:
- EXECUTED - the layer was executed by standalone primitive.
- NOT_RUN - the layer was not executed by standalone primitive or was fused with another operation and executed in another layer primitive.

The execType column of the table includes inference primitives with specific suffixes. The layers could have the following marks:
- The I8 suffix is for layers that had 8-bit data type input and were computed in 8-bit precision.
- The FP32 suffix is for layers computed in 32-bit precision.

All Convolution layers are executed in int8 precision. The rest of the layers are fused into Convolutions using post-operation optimization, as described in CPU Device. This contains layer names (as seen in OpenVINO IR), type of the layer, and execution statistics.

Both benchmark_app versions also support the exec_graph_path command-line option. It requires OpenVINO to output the same execution statistics per layer, but in the form of plugin-specific Netron-viewable graph to the specified file.

Especially when performance-debugging the latency, note that the counters do not reflect the time spent in the plugin/device/driver/etc queues. If the sum of the counters is too different from the latency of an inference request, consider testing with less inference requests. For example, running single OpenVINO stream with multiple requests would produce nearly identical counters as running a single inference request, while the actual latency can be quite different.

Lastly, the performance statistics with both performance counters and execution graphs are averaged, so such data for the inputs of dynamic shapes should be measured carefully, preferably by isolating the specific shape and executing multiple times in a loop, to gather reliable data.

Use ITT to Get Performance Insights

In general, OpenVINO and its individual plugins are heavily instrumented with Intel® Instrumentation and Tracing Technology (ITT). Therefore, you can also compile OpenVINO from the source code with ITT enabled and use tools like Intel® VTune™ Profiler to get detailed inference performance breakdown and additional insights in the application-level performance on the timeline view.

Results may vary. For more information, see F.A.Q. and Platforms, Configurations, Methodology. See Legal Information.