Inference is a single execution of the model that consists in feeding the data to the model and obtaining the results. Inferencing performance is a key characteristic of the model quality. OpenVINO has several techniques to evaluate and accelerate model performance.
The DL Workbench allows users to assess the model performance in one of two inference modes:
Latency is the time required to complete a unit of work, for example, the time required to infer a single image. The lower the value, the better. The latency mode is typical for lightweight models and real-time services.
Throughput is the number of input data processed in a given amount of time. It is recommended to use the throughput mode for models designed for high-performant applications. For example, there are several surveillance cameras and they work simultaneously passing the video frames to the accelerator at once. Using asynchronous inference can significantly improve performance and ensure that models process as many frames as possible.
OpenVINO allows users to parallelize the neural model and propagate several input data instances to speed up the model by specifying the following parameters:
Streams : stream is the number of instances of your model running simultaneously. Inferring the same model in several streams simultaneously leads to higher model performance.
Batches : batch is the number of input data instances propagated to the model at a time.
Using one of the methods or a combination of them allows getting a noticeable performance boost (especially for lightweight topologies) without any accuracy loss. Another optimization technique is the INT8 Calibration which results in a controllable accuracy drop.
The DL Workbench allows you to evaluate the performance of the model and provides a set of analytical capabilities which includes:
detailed performance assessment, including evaluation by layers, precision, kernel
visualization of the computational graph of the model with the detection of bottlenecks
comparison within different configurations of the model (batches and streams), different versions of the model (optimized and parent) by a variety of criteria.