Asynchronous Inference Request

Asynchronous Inference Request runs an inference pipeline asynchronously in one or several task executors depending on a device pipeline structure. OpenVINO Runtime Plugin API provides the base InferenceEngine::AsyncInferRequestThreadSafeDefault class:

  • The class has the _pipeline field of std::vector<std::pair<ITaskExecutor::Ptr, Task> >, which contains pairs of an executor and executed task.

  • All executors are passed as arguments to a class constructor and they are in the running state and ready to run tasks.

  • The class has the InferenceEngine::AsyncInferRequestThreadSafeDefault::StopAndWait method, which waits for _pipeline to finish in a class destructor. The method does not stop task executors and they are still in the running stage, because they belong to the executable network instance and are not destroyed.


OpenVINO Runtime Plugin API provides the base InferenceEngine::AsyncInferRequestThreadSafeDefault class for a custom asynchronous inference request implementation:

class TemplateAsyncInferRequest : public InferenceEngine::AsyncInferRequestThreadSafeDefault {
    TemplateAsyncInferRequest(const TemplateInferRequest::Ptr& inferRequest,
                              const InferenceEngine::ITaskExecutor::Ptr& taskExecutor,
                              const InferenceEngine::ITaskExecutor::Ptr& waitExecutor,
                              const InferenceEngine::ITaskExecutor::Ptr& callbackExecutor);


    TemplateInferRequest::Ptr _inferRequest;
    InferenceEngine::ITaskExecutor::Ptr _waitExecutor;

Class Fields

  • _inferRequest - a reference to the synchronous inference request implementation. Its methods are reused in the AsyncInferRequest constructor to define a device pipeline.

  • _waitExecutor - a task executor that waits for a response from a device about device tasks completion


If a plugin can work with several instances of a device, _waitExecutor must be device-specific. Otherwise, having a single task executor for several devices does not allow them to work in parallel.

The main goal of the AsyncInferRequest constructor is to define a device pipeline _pipeline. The example below demonstrates _pipeline creation with the following stages:

  • inferPreprocess is a CPU compute task.

  • startPipeline is a CPU ligthweight task to submit tasks to a remote device.

  • waitPipeline is a CPU non-compute task that waits for a response from a remote device.

  • inferPostprocess is a CPU compute task.

TemplateAsyncInferRequest::TemplateAsyncInferRequest(const TemplateInferRequest::Ptr& inferRequest,
                                                     const InferenceEngine::ITaskExecutor::Ptr& cpuTaskExecutor,
                                                     const InferenceEngine::ITaskExecutor::Ptr& waitExecutor,
                                                     const InferenceEngine::ITaskExecutor::Ptr& callbackExecutor)
    : AsyncInferRequestThreadSafeDefault(inferRequest, cpuTaskExecutor, callbackExecutor),
      _waitExecutor(waitExecutor) {
    // In current implementation we have CPU only tasks and no needs in 2 executors
    // So, by default single stage pipeline is created.
    // This stage executes InferRequest::Infer() using cpuTaskExecutor.
    // But if remote asynchronous device is used the pipeline can by splitted tasks that are executed by cpuTaskExecutor
    // and waiting tasks. Waiting tasks can lock execution thread so they use separate threads from other executor.
    constexpr const auto remoteDevice = false;

    if (remoteDevice) {
        _pipeline = {{cpuTaskExecutor,
                      [this] {
                      [this] {
                          OV_ITT_SCOPED_TASK(itt::domains::TemplatePlugin, "TemplateAsyncInferRequest::WaitPipeline");
                     {cpuTaskExecutor, [this] {
                          OV_ITT_SCOPED_TASK(itt::domains::TemplatePlugin, "TemplateAsyncInferRequest::Postprocessing");

The stages are distributed among two task executors in the following way:

  • inferPreprocess and startPipeline are combined into a single task and run on _requestExecutor, which computes CPU tasks.

  • You need at least two executors to overlap compute tasks of a CPU and a remote device the plugin works with. Otherwise, CPU and device tasks are executed serially one by one.

  • waitPipeline is sent to _waitExecutor, which works with the device.


callbackExecutor is also passed to the constructor and it is used in the base InferenceEngine::AsyncInferRequestThreadSafeDefault class, which adds a pair of callbackExecutor and a callback function set by the user to the end of the pipeline.

Inference request stages are also profiled using IE_PROFILING_AUTO_SCOPE, which shows how pipelines of multiple asynchronous inference requests are run in parallel via the Intel® VTune™ Profiler tool.

In the asynchronous request destructor, it is necessary to wait for a pipeline to finish. It can be done using the InferenceEngine::AsyncInferRequestThreadSafeDefault::StopAndWait method of the base class.

TemplateAsyncInferRequest::~TemplateAsyncInferRequest() {