OpenVINO™ Runtime Python API Advanced Inference#
Warning
All mentioned methods are very dependent on a specific hardware and software set-up. Consider conducting your own experiments with various models and different input/output sizes. The methods presented here are not universal, they may or may not apply to the specific pipeline. Please consider all tradeoffs and avoid premature optimizations.
Direct Inference with CompiledModel
#
The CompiledModel
class provides the __call__
method that runs a single synchronous inference using the given model. In addition to a compact code, all future calls to CompiledModel.__call__
will result in less overhead, as the object reuses the already created InferRequest
.
# Calling CompiledModel creates and saves InferRequest object
results_0 = compiled_model({"input_0": data_0, "input_1": data_1})
# Second call reuses previously created InferRequest object
results_1 = compiled_model({"input_0": data_2, "input_1": data_3})
Hiding Latency with Asynchronous Calls#
Asynchronous calls allow to hide latency to optimize overall runtime of a codebase.
For example, InferRequest.start_async
releases the GIL and provides non-blocking call.
It is beneficial to process other calls while waiting to finish compute-intensive inference.
Example usage:
import time
# Long running function
def run(time_in_sec):
time.sleep(time_in_sec)
# No latency hiding
results = request.infer({"input_0": data_0, "input_1": data_1})[0]
run(time_in_sec)
# Hiding latency
request.start_async({"input_0": data_0, "input_1": data_1})
run(time_in_sec)
request.wait()
results = request.get_output_tensor(0).data # Gather data from InferRequest
Note
It is up to the user/developer to optimize the flow in a codebase to benefit from potential parallelization.
“Postponed Return” with Asynchronous Calls#
“Postponed Return” is a practice to omit overhead of OVDict
, which is always returned from
synchronous calls. “Postponed Return” could be applied when:
only a part of output data is required. For example, only one specific output is significant in a given pipeline step and all outputs are large, thus, expensive to copy.
data is not required “now”. For example, it can be later extracted inside the pipeline as a part of latency hiding.
data return is not required at all. For example, models are being chained with the pure
Tensor
interface.
# Standard approach
results = request.infer({"input_0": data_0, "input_1": data_1})[0]
# "Postponed Return" approach
request.start_async({"input_0": data_0, "input_1": data_1})
request.wait()
results = request.get_output_tensor(0).data # Gather data "on demand" from InferRequest