OpenVINO™ Inference Request#
To set up and run inference, use the ov::InferRequest
class. It enables you to run
inference on different devices either synchronously or asynchronously. It also includes
methods to retrieve data or adjust data from model inputs and outputs.
The ov::InferRequest
can be created from the ov::CompiledModel
.
infer_request = compiled_model.create_infer_request()
auto infer_request = compiled_model.create_infer_request();
Synchronous / asynchronous inference#
The synchronous mode is the basic mode of inference and means that inference stages block
the application execution, as one waits for the other to finish. Use ov::InferRequest::infer
to execute in this mode.
infer_request.infer()
infer_request.infer();
The asynchronous mode may improve application performance, as it enables the app to operate
before inference finishes, with the accelerator still running inference. Use
ov::InferRequest::start_async
to execute in this mode.
infer_request.start_async()
infer_request.start_async();
The asynchronous mode supports two ways the application waits for inference results. Both are thread-safe.
ov::InferRequest::wait_for
- the method is blocked until the specified time has passed or the result becomes available, whichever comes first.infer_request.wait_for(10)
infer_request.wait_for(std::chrono::milliseconds(10));
ov::InferRequest::wait
- waits until inference results become available.infer_request.wait()
infer_request.wait();
Keep in mind that the completion order cannot be guaranteed when processing inference requests simultaneously, possibly complicating the application logic. Therefore, for multi-request scenarios, consider also the
ov::InferRequest::set_callback
method, to trigger a callback when the request is complete. Note that to avoid cyclic references in the callback, weak reference of infer_request should be used (ov::InferRequest*
,ov::InferRequest&, std::weal_ptr<ov::InferRequest>
, etc.).def callback(request, _): request.start_async() callbacks_info = {} callbacks_info["finished"] = 0
infer_request.set_callback([&](std::exception_ptr ex_ptr) { if (!ex_ptr) { // all done. Output data can be processed. // You can fill the input data and run inference one more time: infer_request.start_async(); } else { // Something wrong, you can analyze exception_ptr } });
If you want to abort a running inference request, use the
ov::InferRequest::cancel
method.infer_request.cancel()
infer_request.cancel();
For more information, see the Classification Async Sample, as well as the articles on synchronous and asynchronous inference requests.
Working with Input and Output tensors#
ov::InferRequest
enables you to get input/output tensors by tensor name, index, and port.
Note that a similar logic is applied to retrieving data using the ov::Model
methods.
get_input_tensor
, set_input_tensor
, get_output_tensor
, set_output_tensor
for a model with only one input/output, no arguments are required
input_tensor = infer_request.get_input_tensor() output_tensor = infer_request.get_output_tensor()
auto input_tensor = infer_request.get_input_tensor(); auto output_tensor = infer_request.get_output_tensor();
to select a specific input/output tensor provide its index number as a parameter
input_tensor = infer_request.get_input_tensor(0) output_tensor = infer_request.get_output_tensor(0)
auto input_tensor = infer_request.get_input_tensor(0); auto output_tensor = infer_request.get_output_tensor(1);
ov::InferRequest::get_tensor
, ov::InferRequest::set_tensor
to select an input/output tensor by tensor name, provide it as a parameter
tensor1 = infer_request.get_tensor("result") tensor2 = ov.Tensor(ov.Type.f32, [1, 3, 32, 32]) infer_request.set_tensor(input_tensor_name, tensor2)
auto tensor1 = infer_request.get_tensor("tensor_name1"); ov::Tensor tensor2; infer_request.set_tensor("tensor_name2", tensor2);
to select an input/output tensor by port
input_port = model.input(0) output_port = model.input(input_tensor_name) input_tensor = ov.Tensor(ov.Type.f32, [1, 3, 32, 32]) infer_request.set_tensor(input_port, input_tensor) output_tensor = infer_request.get_tensor(output_port)
auto input_port = model->input(0); auto output_port = model->output("tensor_name"); ov::Tensor input_tensor; infer_request.set_tensor(input_port, input_tensor); auto output_tensor = infer_request.get_tensor(output_port);
Infer Request Use Scenarios#
Cascade of Models#
ov::InferRequest
can be used to organize a cascade of models. Infer Requests are required
for each model. In this case, you can get the output tensor from the first request, using
ov::InferRequest::get_tensor
and set it as input for the second request, using
ov::InferRequest::set_tensor
. Keep in mind that tensors shared across compiled models can
be rewritten by the first model if the first infer request is run once again, while the
second model has not started yet.
output = infer_request1.get_output_tensor(0)
infer_request2.set_input_tensor(0, output)
auto output = infer_request1.get_output_tensor(0);
infer_request2.set_input_tensor(0, output);
Using Remote Tensors#
By using ov::RemoteContext
you can create a remote tensor to work with remote device memory.
# NOT SUPPORTED
ov::RemoteContext context = core.get_default_context("GPU");
auto input_port = compiled_model.input("tensor_name");