OpenVINO™ Runtime Python API Advanced Inference

Warning

All mentioned methods are very dependent on a specific hardware and software set-up. Consider conducting your own experiments with various models and different input/output sizes. The methods presented here are not universal, they may or may not apply to the specific pipeline. Please consider all tradeoffs and avoid premature optimizations.

Direct Inference with CompiledModel

The CompiledModel class provides the __call__ method that runs a single synchronous inference using the given model. In addition to a compact code, all future calls to CompiledModel.__call__ will result in less overhead, as the object reuses the already created InferRequest.

# Calling CompiledModel creates and saves InferRequest object
results_0 = compiled_model({"input_0": data_0, "input_1": data_1})
# Second call reuses previously created InferRequest object
results_1 = compiled_model({"input_0": data_2, "input_1": data_3})

Shared Memory on Inputs

While using CompiledModel, InferRequest and AsyncInferQueue, OpenVINO™ Runtime Python API provides an additional mode - “Shared Memory”. Specify the shared_memory flag to enable or disable this feature. The “Shared Memory” mode may be beneficial when inputs are large and copying data is considered an expensive operation. This feature creates shared Tensor instances with the “zero-copy” approach, reducing overhead of setting inputs to minimum. Example usage:

# Data can be shared
_ = compiled_model({"input_0": data_0, "input_1": data_1}, shared_memory=True)
_ = request.infer({"input_0": data_0, "input_1": data_1}, shared_memory=True)

Note

“Shared Memory” is enabled by default in CompiledModel.__call__. For other methods, like InferRequest.infer or InferRequest.start_async, it is required to set the flag to True manually.

Warning

When data is being shared, all modifications may affect inputs of the inference! Use this feature with caution, especially in multi-threaded/parallel code, where data can be modified outside of the function’s control flow.

Hiding Latency with Asynchronous Calls

Asynchronous calls allow to hide latency to optimize overall runtime of a codebase. For example, InferRequest.start_async releases the GIL and provides non-blocking call. It is beneficial to process other calls while waiting to finish compute-intensive inference. Example usage:

import time

# Long running function
def run(time_in_sec):
    time.sleep(time_in_sec)

# No latency hiding
results = request.infer({"input_0": data_0, "input_1": data_1})[0]
run(time_in_sec)

# Hiding latency
request.start_async({"input_0": data_0, "input_1": data_1})
run()
request.wait()
results = request.get_output_tensor(0).data  # Gather data from InferRequest

Note

It is up to the user/developer to optimize the flow in a codebase to benefit from potential parallelization.

“Postponed Return” with Asynchronous Calls

“Postponed Return” is a practice to omit overhead of OVDict, which is always returned from synchronous calls. “Postponed Return” could be applied when:

  • only a part of output data is required. For example, only one specific output is significant in a given pipeline step and all outputs are large, thus, expensive to copy.

  • data is not required “now”. For example, it can be later extracted inside the pipeline as a part of latency hiding.

  • data return is not required at all. For example, models are being chained with the pure Tensor interface.

# Standard approach
results = request.infer({"input_0": data_0, "input_1": data_1})[0]

# "Postponed Return" approach
request.start_async({"input_0": data_0, "input_1": data_1})
request.wait()
results = request.get_output_tensor(0).data  # Gather data "on demand" from InferRequest