# Optimizing for Throughput¶

## General Throughput Considerations¶

As described in the section on the latency-specific considerations one possible use-case is delivering every single request at the minimal delay. Throughput on the other hand, is about inference scenarios in which potentially large number of inference requests are served simultaneously to improve the device utilization.

The associated increase in latency is not linearly dependent on the number of requests executed in parallel. Here, a trade-off between overall throughput and serial performance of individual requests can be achieved with the right OpenVINO performance configuration.

## Basic and Advanced Ways of Leveraging Throughput¶

With the OpenVINO there are two means of leveraging the throughput with the individual device:

• Basic (high-level) flow with OpenVINO performance hints which is inherently portable and future-proof

• Advanced (low-level) approach of explicit batching and streams, explained in the separate document.

In both cases application should be designed to execute multiple inference requests in parallel as detailed in the next section.

Finally, consider the automatic multi-device execution covered below.

## Throughput-Oriented Application Design¶

Most generally, throughput-oriented inference applications should:

• Expose substantial amounts of inputs parallelism (e.g. process multiple video- or audio- sources, text documents, etc)

• Decompose the data flow into a collection of concurrent inference requests that are aggressively scheduled to be executed in parallel

• Setup the configuration for the device (e.g. as parameters of the ov::Core::compile_model) via either low-level explicit options, introduced in the previous section or OpenVINO performance hints (preferable):

auto compiled_model = core.compile_model(model, "GPU",
ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT));

compiled_model = core.compile_model(model, "GPU", {"PERFORMANCE_HINT": "THROUGHPUT"})

• Query the ov::optimal_number_of_infer_requests from the ov::CompiledModel (resulted from compilation of the model for a device) to create the number of the requests required to saturate the device

• Use the Async API with callbacks, to avoid any dependency on the requests’ completion order and possible device starvation, as explained in the common-optimizations section

## Multi-Device Execution¶

OpenVINO offers automatic, scalable multi-device inference. This is simple application-transparent way to improve the throughput. No need to re-architecture existing applications for any explicit multi-device support: no explicit network loading to each device, no separate per-device queues, no additional logic to balance the inference requests between devices, etc. From the application point of view, it is communicating to the single device that internally handles the actual machinery. Just like with other throughput-oriented scenarios, there are two major pre-requisites for optimal multi-device performance:

• Using the Asynchronous API and callbacks in particular

• Providing the multi-device (and hence the underlying devices) with enough data to crunch. As the inference requests are naturally independent data pieces, the multi-device performs load-balancing at the “requests” (outermost) level to minimize the scheduling overhead.

Notice that the resulting performance is usually a fraction of the “ideal” (plain sum) value, when the devices compete for a certain resources, like the memory-bandwidth which is shared between CPU and iGPU.

Note

While the legacy approach of optimizing the parameters of each device separately works, the OpenVINO performance hints allow to configure all devices (that are part of the specific multi-device configuration) at once.