Introduction to the Performance Topics

This section is a shorter version of the Optimization Guide for the Intel Deep Learning Deployment Toolkit.

Precision

Inference precision directly affects the performance.

Model Optimizer can produce an IR with different precision. For example, float16 IR initially targets VPU and GPU devices, while for example the CPU can also able to execute regular float32. Also, further device-specific inference precision settings are available, e.g. 8-bit integer inference on the CPU. Notice that for MULTI device that supports automatic inference on multiple devices in parallel, you can use the FP16 IR. More information, such as preferred data types for specific devices can also be found in the Supported Devices section.

Latency vs. Throughput

One way to increase computational efficiency is batching, which combines many (potentially tens) of input images to achieve optimal throughput. However, high batch size also comes with a latency penalty. So, for more real-time oriented usages, lower batch sizes (as low as a single input) are used. Refer to the Benchmark App sample, which allows latency vs. throughput measuring.

Using Async API

To gain better performance on accelerators, such as VPU or FPGA, the Inference Engine uses the asynchronous approach (see Integrating Inference Engine in Your Application (current API)). The point is amortizing the costs of data transfers, by pipe-lining, see Async API explained. Since the pipe-lining relies on the availability of the parallel slack, running multiple inference requests in parallel is essential. Refer to the Benchmark App sample, which enables running a number of inference requests in parallel. Specifying different number of request produces different throughput measurements.

Throughput Mode for CPU

Unlike most accelerators, CPU is perceived as an inherently latency-oriented device. Since 2018 R5 release, the Inference Engine introduced the "throughput" mode, which allows the Inference Engine to efficiently run multiple inference requests on the CPU simultaneously, greatly improving the throughput.

Internally, the execution resources are split/pinned into execution "streams". Using this feature gains much better performance for the networks that originally are not scaled well with a number of threads (for example, lightweight topologies). This is especially pronounced for the many-core server machines.

Try the Benchmark App sample and play with number of infer requests running in parallel. The rule of thumb is trying up to CPU #cores in your machine. For example, on a 8-core CPU, compare the "-nireq 1" (which is a legacy scenario) to the 2, 4 and 8 requests.

In addition to the number of streams, it is also possible to play with the batch size to find the throughput sweet-spot.

The throughput mode relaxes the requirement to saturate the CPU by using a large batch: running multiple independent inference requests in parallel often gives much better performance, than using a batch only.

This allows you to simplify the app-logic, as you don't need to combine multiple inputs into a batch to achieve good CPU performance. Instead, it is possible to keep a separate infer request per camera or another source of input and process the requests in parallel using Async API.