OpenVINO Inference

Inference is a single execution of the model that consists in feeding the data to the model and obtaining the results. Inferencing performance is a key characteristic of the model quality. OpenVINO has several techniques to evaluate and accelerate model performance.

The DL Workbench allows users to assess the model performance in one of two inference modes:

  • Latency

    Latency is the time required to complete a unit of work, for example, the time required to infer a single image. The lower the value, the better. The latency mode is typical for lightweight models and real-time services.

  • Throughput

    Throughput is the number of input data processed in a given amount of time. It is recommended to use the throughput mode for models designed for high-performant applications. For example, there are several surveillance cameras and they work simultaneously passing the video frames to the accelerator at once. Using asynchronous inference can significantly improve performance and ensure that models process as many frames as possible.

_images/LATENCY_VS_THROUGHPUT.svg

OpenVINO allows users to parallelize the neural model and propagate several input data instances to speed up the model by specifying the following parameters:

  • Streams : stream is the number of instances of your model running simultaneously. Inferring the same model in several streams simultaneously leads to higher model performance.

  • Batches : batch is the number of input data instances propagated to the model at a time.

Using one of the methods or a combination of them allows getting a noticeable performance boost (especially for lightweight topologies) without any accuracy loss. Another optimization technique is the INT8 Calibration which results in a controllable accuracy drop.

The DL Workbench allows you to evaluate the performance of the model and provides a set of analytical capabilities which includes: