Getting Performance Numbers#

Benchmarking methodology for OpenVINO
How to obtain benchmark results

Benchmarking methodology for OpenVINO#

OpenVINO benchmarking (general)#

The OpenVINO benchmark setup includes a single system with OpenVINO™, as well as the benchmark application installed. It measures the time spent on actual inference (excluding any pre or post processing) and then reports on the inferences per second (or Frames Per Second).

OpenVINO Model Server benchmarking (general)#

OpenVINO™ Model Server (OVMS) employs the Intel® Distribution of OpenVINO™ toolkit runtime libraries and exposes a set of models via a convenient inference API over gRPC or HTTP/REST. Its benchmark results are measured with the configuration of multiple-clients-single-server, using two hardware platforms connected by ethernet. Network bandwidth depends on both platforms and models used. It is set not to be a bottleneck for workload intensity. The connection is dedicated only to measuring performance.

OpenVINO Model Server benchmarking (LLM)#

In the benchmarking results presented here, the load from clients is simulated using the benchmark_serving.py script from vLLM and the ShareGPT dataset. It represents real life usage scenarios. Both OpenVINO Model Server and vLLM expose OpenAI-compatible REST endpoints so the methodology is identical.

In the experiments, we change the average request rate to identify the tradeoff between total throughput and the TPOT latency.

Note that in the benchmarking, the feature of prefix_caching is not used.

How to obtain benchmark results#

General considerations#

OpenVINO benchmarking (general)#

The default way of measuring OpenVINO performance is running a piece of code, referred to as the benchmark tool. For Python, it is part of the OpenVINO Runtime installation, while for C++, it is available as a code sample.

Running the benchmark application#

The benchmark_app includes a lot of device-specific options, but the primary usage is as simple as:

benchmark_app -m <model> -d <device> -i <input>

Each of the OpenVINO supported devices offers performance settings that contain command-line equivalents in the Benchmark app.

While these settings provide really low-level control for the optimal model performance on a specific device, it is recommended to always start performance evaluation with the OpenVINO High-Level Performance Hints first, like so:

# for throughput prioritization
benchmark_app -hint tput -m <model> -d <device>
# for latency prioritization
benchmark_app -hint latency -m <model> -d <device>

Internal Inference Performance Counters and Execution Graphs#

More detailed insights into inference performance breakdown can be achieved with device-specific performance counters and/or execution graphs. Both C++ and Python versions of the benchmark_app support a -pc command-line parameter that outputs an internal execution breakdown.

For example, the table below is part of performance counters for CPU inference. of a TensorFlow implementation of ResNet-50 Keep in mind that since the device is CPU, the realTime wall clock and the cpu time layers are the same. Information about layer precision is also stored in the performance counters.

layerName	execStatus	layerType	execType	realTime (ms)	cpuTime (ms)
resnet_model/batch_normalization_15/FusedBatchNorm/Add	EXECUTED	Convolution	jit_avx512_1x1_I8	0.377	0.377
resnet_model/conv2d_16/Conv2D/fq_input_0	NOT_RUN	FakeQuantize	undef	0	0
resnet_model/batch_normalization_16/FusedBatchNorm/Add	EXECUTED	Convolution	jit_avx512_I8	0.499	0.499
resnet_model/conv2d_17/Conv2D/fq_input_0	NOT_RUN	FakeQuantize	undef	0	0
resnet_model/batch_normalization_17/FusedBatchNorm/Add	EXECUTED	Convolution	jit_avx512_1x1_I8	0.399	0.399
resnet_model/add_4/fq_input_0	NOT_RUN	FakeQuantize	undef	0	0
resnet_model/add_4	NOT_RUN	Eltwise	undef	0	0
resnet_model/add_5/fq_input_1	NOT_RUN	FakeQuantize	undef	0	0

The execStatus column of the table includes the following possible values:

- EXECUTED - the layer was executed by standalone primitive.
- NOT_RUN - the layer was not executed by standalone primitive or was fused with
another operation and executed in another layer primitive.

The execType column of the table includes inference primitives with specific suffixes. The layers could have the following marks:

- The I8 suffix is for layers that had 8-bit data type input and were computed in
8-bit precision.
- The FP32 suffix is for layers computed in 32-bit precision.

All Convolution layers are executed in int8 precision. The rest of the layers are fused into Convolutions using post-operation optimization, as described in CPU Device. This contains layer names (as seen in OpenVINO IR), type of the layer, and execution statistics.

Both benchmark_app versions also support the exec_graph_path command-line option. It requires OpenVINO to output the same execution statistics per layer, but in the form of plugin-specific Netron-viewable graph to the specified file.

Especially when performance-debugging latency, note that the counters do not reflect the time spent in the plugin/device/driver/etc queues. If the sum of the counters is too different from the latency of an inference request, consider testing with less inference requests. For example, running single OpenVINO stream with multiple requests would produce nearly identical counters as running a single inference request, while the actual latency can be quite different.

Lastly, the performance statistics with both performance counters and execution graphs are averaged, so such data for the inputs of dynamic shapes should be measured carefully, preferably by isolating the specific shape and executing multiple times in a loop, to gather reliable data.

Use ITT to Get Performance Insights#

In general, OpenVINO and its individual plugins are heavily instrumented with Intel® Instrumentation and Tracing Technology (ITT). Therefore, you can also compile OpenVINO from the source code with ITT enabled and use tools like Intel® VTune™ Profiler to get detailed inference performance breakdown and additional insights in the application-level performance on the timeline view.

OpenVINO benchmarking (LLM)#

Large Language Models require a different benchmarking approach to static models. A detailed description will be added soon.