Runtime Inference Optimizations¶
Runtime or deployment optimizations are focused on tuning of the inference parameters (e.g. optimal number of the requests executed simultaneously) and other means of how a model is executed.
As referenced in the parent performance introduction topic, the dedicated document covers the model-level optimizations like quantization that unlocks the 8-bit inference. Model-optimizations are most general and help any scenario and any device (that e.g. accelerates the quantized models). The relevant runtime configuration is
ov::hint::inference_precision allowing the devices to trade the accuracy for the performance (e.g. by allowing the fp16/bf16 execution for the layers that remain in fp32 after quantization of the original fp32 model).
Then, possible optimization should start with defining the use-case. For example, whether the target scenario emphasizes throughput over latency like processing millions of samples by overnight jobs in the data centers. In contrast, real-time usages would likely trade off the throughput to deliver the results at minimal latency. Often this is a combined scenario that targets highest possible throughput while maintaining a specific latency threshold. Below you can find summary on the associated tips.
How the full-stack application uses the inference component end-to-end is also important. For example, what are the stages that needs to be orchestrated? In some cases a significant part of the workload time is spent on bringing and preparing the input data. Below you can find multiple tips on connecting the data input pipeline and the model inference efficiently. These are also common performance tricks that help both latency and throughput scenarios.
Further documents cover the associated runtime performance optimizations topics. Please also consider matrix support of the features by the individual devices.
General, application-level optimizations, and specifically:
For variably-sized inputs, consider dynamic shapes
Writing Performance Portable Inference Application¶
Each of the OpenVINO’s supported devices offers a bunch of low-level performance settings. Tweaking this detailed configuration requires deep architecture understanding.
Also, while the resulting performance may be optimal for the specific combination of the device and the model that is inferred, it is actually neither device/model nor future-proof:
Even within a family of the devices (like various CPUs), different instruction set, or number of CPU cores would eventually result in different execution configuration to be optimal.
Similarly the optimal batch size is very much specific to the particular instance of the GPU.
Compute vs memory-bandwidth requirements for the model being inferenced, as well as inference precision, possible model’s quantization also contribute to the optimal parameters selection.
Finally, the optimal execution parameters of one device do not transparently map to another device type, for example:
Both the CPU and GPU devices support the notion of the streams, yet the optimal number of the streams is deduced very differently.
Here, to mitigate the performance configuration complexity the Performance Hints offer the high-level “presets” for the latency and throughput, as detailed in the Performance Hints usage document.
Beyond execution parameters there is a device-specific scheduling that greatly affects the performance. Specifically, GPU-oriented optimizations like batching, which combines many (potentially tens) of inputs to achieve optimal throughput, do not always map well to the CPU, as e.g. detailed in the further internals sections.
The hints really hide the execution specifics required to saturate the device. In the internals sections you can find the implementation details (particularly how the OpenVINO implements the ‘throughput’ approach) for the specific devices. Keep in mind that the hints make this transparent to the application. For example, the hints obviates the need for explicit (application-side) batching or streams.
With the hints, it is enough to keep separate infer requests per camera or another source of input and process the requests in parallel using Async API as explained in the application design considerations section. The main requirement for the application to leverage the throughput is running multiple inference requests in parallel.
In summary, when the performance portability is of concern, consider the Performance Hints as a solution. You may find further details and API examples here.