Getting Performance Numbers#
This guide explains how to use the benchmark_app to get performance numbers. It also explains how the performance numbers are reflected through internal inference performance counters and execution graphs. It also includes information on using ITT and Intel® VTune™ Profiler to get performance insights.
Test performance with the benchmark_app
You can run OpenVINO benchmarks in both C++ and Python APIs, yet the experience differs in each case. The Python one is part of OpenVINO Runtime installation, while C++ is available as a code sample. For a detailed description, see: benchmark_app.
Make sure to install the latest release package with support for frameworks of the models you want to test. For the most reliable performance benchmarks, prepare the model for use with OpenVINO.
Running the benchmark application
The benchmark_app includes a lot of device-specific options, but the primary usage is as simple as:
benchmark_app -m <model> -d <device> -i <input>
Each of the OpenVINO supported devices offers performance settings that contain command-line equivalents in the Benchmark app.
While these settings provide really low-level control for the optimal model performance on the specific device, it is recommended to always start performance evaluation with the OpenVINO High-Level Performance Hints first, like so:
# for throughput prioritization
benchmark_app -hint tput -m <model> -d <device>
# for latency prioritization
benchmark_app -hint latency -m <model> -d <device>
Additional benchmarking considerations
1 - Select a Proper Set of Operations to Measure
When evaluating performance of a model with OpenVINO Runtime, it is required to measure a proper set of operations.
Avoid including one-time costs such as model loading.
Track operations that occur outside OpenVINO Runtime (such as video decoding) separately.
Note
Some image pre-processing can be baked into OpenVINO IR and accelerated accordingly. For more information, refer to Embedding Pre-processing and General Runtime Optimizations.
2 - Try to Get Credible Data
Performance conclusions should be build upon reproducible data. As for the performance measurements, they should be done with a large number of invocations of the same routine. Since the first iteration is almost always significantly slower than the subsequent ones, an aggregated value can be used for the execution time for final projections:
If the warm-up run does not help or execution time still varies, you can try running a large number of iterations and then average or find a mean of the results.
If the time values range too much, consider geomean.
Be aware of the throttling and other power oddities. A device can exist in one of several different power states. When optimizing your model, consider fixing the device frequency for better performance data reproducibility. However, the end-to-end (application) benchmarking should also be performed under real operational conditions.
3 - Compare Performance with Native/Framework Code
When comparing the OpenVINO Runtime performance with the framework or another reference code, make sure that both versions are as similar as possible:
Wrap the exact inference execution (for examples, see Benchmark app).
Do not include model loading time.
Ensure that the inputs are identical for OpenVINO Runtime and the framework. For example, watch out for random values that can be used to populate the inputs.
In situations when any user-side pre-processing should be tracked separately, consider image pre-processing and conversion.
When applicable, leverage the Dynamic Shapes support.
If possible, demand the same accuracy. For example, TensorFlow allows
FP16
execution, so when comparing to that, make sure to test the OpenVINO Runtime with theFP16
as well.
Internal Inference Performance Counters and Execution Graphs
More detailed insights into inference performance breakdown can be achieved with device-specific performance counters and/or execution graphs.
Both C++ and Python
versions of the benchmark_app support a -pc
command-line parameter that outputs internal execution breakdown.
For example, the table shown below is part of performance counters for quantized
TensorFlow implementation of ResNet-50
model inference on CPU Plugin.
Keep in mind that since the device is CPU, the realTime
wall clock and the cpu
time layers are the same.
Information about layer precision is also stored in the performance counters.
layerName |
execStatus |
layerType |
execType |
realTime (ms) |
cpuTime (ms) |
---|---|---|---|---|---|
resnet_model/batch_normalization_15/FusedBatchNorm/Add |
EXECUTED |
Convolution |
jit_avx512_1x1_I8 |
0.377 |
0.377 |
resnet_model/conv2d_16/Conv2D/fq_input_0 |
NOT_RUN |
FakeQuantize |
undef |
0 |
0 |
resnet_model/batch_normalization_16/FusedBatchNorm/Add |
EXECUTED |
Convolution |
jit_avx512_I8 |
0.499 |
0.499 |
resnet_model/conv2d_17/Conv2D/fq_input_0 |
NOT_RUN |
FakeQuantize |
undef |
0 |
0 |
resnet_model/batch_normalization_17/FusedBatchNorm/Add |
EXECUTED |
Convolution |
jit_avx512_1x1_I8 |
0.399 |
0.399 |
resnet_model/add_4/fq_input_0 |
NOT_RUN |
FakeQuantize |
undef |
0 |
0 |
resnet_model/add_4 |
NOT_RUN |
Eltwise |
undef |
0 |
0 |
resnet_model/add_5/fq_input_1 |
NOT_RUN |
FakeQuantize |
undef |
0 |
0 |
execStatus
column of the table includes the following possible values:EXECUTED
- the layer was executed by standalone primitive.NOT_RUN
- the layer was not executed by standalone primitive or was fused with another operation and executed in another layer primitive.execType
column of the table includes inference primitives with specific suffixes. The layers could have the following marks:I8
suffix is for layers that had 8-bit data type input and were computed in 8-bit precision.FP32
suffix is for layers computed in 32-bit precision.Convolution
layers are executed in int8
precision. The rest of the layers are fused into Convolutions using post-operation optimization,
as described in CPU Device. This contains layer names
(as seen in OpenVINO IR), type of the layer, and execution statistics.Both benchmark_app versions also support the exec_graph_path
command-line option. It requires OpenVINO to output the same execution
statistics per layer, but in the form of plugin-specific Netron-viewable graph to the specified file.
Especially when performance-debugging the latency, note that the counters
do not reflect the time spent in the plugin/device/driver/etc
queues. If the sum of the counters is too different from the latency
of an inference request, consider testing with less inference requests. For example, running single
OpenVINO stream with multiple requests would produce nearly identical
counters as running a single inference request, while the actual latency can be quite different.
Lastly, the performance statistics with both performance counters and execution graphs are averaged, so such data for the inputs of dynamic shapes should be measured carefully, preferably by isolating the specific shape and executing multiple times in a loop, to gather reliable data.
Use ITT to Get Performance Insights
In general, OpenVINO and its individual plugins are heavily instrumented with Intel® Instrumentation and Tracing Technology (ITT). Therefore, you can also compile OpenVINO from the source code with ITT enabled and use tools like Intel® VTune™ Profiler to get detailed inference performance breakdown and additional insights in the application-level performance on the timeline view.