Optimize Inference#

Runtime, or deployment optimization focuses on tuning inference and execution parameters. Unlike model-level optimization, it is highly specific to the hardware you use and the goal you want to achieve. You need to plan whether to prioritize accuracy or performance, throughput or latency, or aim at the golden mean. You should also predict how scalable your application needs to be and how exactly it is going to work with the inference component. This way, you will be able to achieve the best results for your product.

Note

For more information on this topic, see the following articles:

Inference Devices and Modes
Inputs Pre-processing with the OpenVINO
Async API
The ‘get_tensor’ Idiom
For variably-sized inputs, consider dynamic shapes

Performance-Portable Inference#

To make configuration easier and performance optimization more portable, OpenVINO offers the Performance Hints feature. It comprises two high-level “presets” focused on latency (default) or throughput.

Although inference with OpenVINO Runtime can be configured with a multitude of low-level performance settings, it is not recommended, as:

It requires deep understanding of device architecture and the inference engine.
It may not translate well to other device-model combinations. For example:
- CPU and GPU deduce their optimal number of streams differently.
- Different devices of the same type, favor different execution configurations.
- Different models favor different parameter configurations (e.g., compute vs memory-bandwidth, inference precision, and possible model quantization).
- Execution “scheduling” impacts performance strongly and is highly device specific. GPU-oriented optimizations do not always map well to the CPU.

Optimize Inference#

Performance-Portable Inference#

Additional Resources#