Multi-device execution

To run inference on multiple devices, you can choose either of the following ways:

  • Use the CUMULATIVE_THROUGHPUT option of the Automatic Device Selection mode. This way, you can use all available devices in the system without the need to specify them.

  • Use the Multi-Device execution mode. It shares the same behaviors as the CUMULATIVE_THROUGHPUT option of the Automatic Device Selection mode. The difference is,it needs <device list> or ov::device::priorities to be set explicitly.

How MULTI Works

The Multi-Device execution mode, or MULTI for short, acts as a “virtual” or a “proxy” device, which does not bind to a specific type of hardware. Instead, it assigns available computing devices to particular inference requests, which are then executed in parallel.

The potential gains from using Multi-Device execution are:

  • improved throughput from using multiple devices at once,

  • increase in performance stability due to multiple devices sharing inference workload.

Importantly, the Multi-Device mode does not change the application logic, so it does not require you to explicitly compile the model on every device or create and balance inference requests. It appears to use a typical device but internally handles the actual hardware.

Note that the performance increase in this mode comes from utilizing multiple devices at once. This means that you need to provide the devices with enough inference requests to keep them busy, otherwise you will not benefit much from using MULTI.

Using the Multi-Device Mode

Following the OpenVINO™ naming convention, the Multi-Device mode is assigned the label of “MULTI.” The only configuration option available for it is a prioritized list of devices to use:

Property

Property values

Description

<device list>

MULTI: <device names>
comma-separated, no spaces
Specifies the devices available for selection.
The device sequence will be taken as priority
from high to low.
Priorities can be set directly as a string.

ov::device::priorities

device names
comma-separated, no spaces

Specifying the device list explicitly is required by MULTI, as it defines the devices available for inference and sets their priorities.

Note that OpenVINO™ Runtime enables you to use “GPU” as an alias for “GPU.0” in function calls. More details on enumerating devices can be found in Working with devices.

The following commands are accepted by the API:

    core = Core()

    # Read a network in IR, PaddlePaddle, or ONNX format:
    model = core.read_model(model_path)
    
    # Option 1
    # Pre-configure MULTI globally with explicitly defined devices,
    # and compile the model on MULTI using the newly specified default device list.
    core.set_property(device_name="MULTI", properties={"MULTI_DEVICE_PRIORITIES":"GPU,CPU"})
    compiled_model = core.compile_model(model=model, device_name="MULTI")

    # Option 2
    # Specify the devices to be used by MULTI explicitly at compilation.
    # The following lines are equivalent:
    compiled_model = core.compile_model(model=model, device_name="MULTI:GPU,CPU")
    compiled_model = core.compile_model(model=model, device_name="MULTI", config={"MULTI_DEVICE_PRIORITIES": "GPU,CPU"}) 

ov::Core core;

// Read a model in IR, PaddlePaddle, or ONNX format:
std::shared_ptr<ov::Model> model = core.read_model("sample.xml");

// Option 1
// Pre-configure MULTI globally with explicitly defined devices,
// and compile the model on MULTI using the newly specified default device list.
core.set_property("MULTI", ov::device::priorities("GPU.1,GPU.0")); 
ov::CompiledModel compileModel0 = core.compile_model(model, "MULTI");

// Option 2
// Specify the devices to be used by MULTI explicitly at compilation.
// The following lines are equivalent:
ov::CompiledModel compileModel1 = core.compile_model(model, "MULTI:GPU.1,GPU.0");
ov::CompiledModel compileModel2 = core.compile_model(model, "MULTI", ov::device::priorities("GPU.1,GPU.0"));



To check what devices are present in the system, you can use the Device API. For information on how to do it, check Query device properties and configuration.

Configuring Individual Devices and Creating the Multi-Device On Top

As mentioned previously, executing inference with MULTI may be set up by configuring individual devices before creating the “MULTI” device on top. It may be considered for performance reasons.

    core = Core()
    cpu_config = {}
    gpu_config = {}

    # Read a network in IR, PaddlePaddle, or ONNX format:
    model = core.read_model(model_path)

    # When compiling the model on MULTI, configure CPU and GPU 
    # (devices, priorities, and device configurations; gpu_config and cpu_config will load during compile_model() ):
    compiled_model = core.compile_model(model=model, device_name="MULTI:GPU,CPU", config={"CPU":"NUM_STREAMS 4", "GPU":"NUM_STREAMS 8"})

    # Optionally, query the optimal number of requests:
    nireq = compiled_model.get_property("OPTIMAL_NUMBER_OF_INFER_REQUESTS")
ov::Core core;

// Read a network in IR, PaddlePaddle, or ONNX format:
std::shared_ptr<ov::Model> model = core.read_model("sample.xml");

// When compiling the model on MULTI, configure GPU and CPU 
// (devices, priorities, and device configurations):
ov::CompiledModel compileModel = core.compile_model(model, "MULTI",
    ov::device::priorities("GPU", "CPU"),
    ov::device::properties("GPU", gpu_config),
    ov::device::properties("CPU", cpu_config));

// Optionally, query the optimal number of requests:
uint32_t nireq = compileModel.get_property(ov::optimal_number_of_infer_requests);

Alternatively, you can combine all the individual device settings into a single config file and load it for MULTI to parse. See the code example in the next section.

Querying the Optimal Number of Inference Requests

When using MULTI, you don’t need to sum over included devices yourself, you can query the optimal number of requests directly, using the configure devices property:

ov::Core core;

// // Read a model and compile it on MULTI
ov::CompiledModel compileModel = core.compile_model("sample.xml", "MULTI:GPU,CPU");

// query the optimal number of requests
uint32_t nireq = compileModel.get_property(ov::optimal_number_of_infer_requests);

Using the Multi-Device with OpenVINO Samples and Benchmarking Performance

To see how the Multi-Device execution is used in practice and test its performance, take a look at OpenVINO’s Benchmark Application which presents the optimal performance of the plugin without the need for additional settings, like the number of requests or CPU threads. Here is an example command to evaluate performance of CPU + GPU:

./benchmark_app –d MULTI:CPU,GPU –m <model> -i <input> -niter 1000

For more information, refer to the C++ or Python version instructions.

Note

You can keep using the FP16 IR without converting it to FP32, even if some of the listed devices do not support it. The conversion will be done automatically for you.

No demos are yet fully optimized for MULTI, by means of supporting the ov::optimal_number_of_infer_requests property, using the GPU streams/throttling, and so on.

Performance Considerations for the Multi-Device Execution

For best performance when using the MULTI execution mode you should consider a few recommendations:

  • MULTI usually performs best when the fastest device is specified first in the device candidate list. This is particularly important when the request-level parallelism is not sufficient (e.g. the number of requests is not enough to saturate all devices).

  • Just like with any throughput-oriented execution mode, it is highly recommended to query the optimal number of inference requests directly from the instance of the ov:compiled_model. Refer to the code of the previously mentioned benchmark_app for more details.

  • Execution on certain device combinations, for example CPU+GPU, performs better with certain knobs. Refer to the benchmark_app code for details. One specific example is disabling GPU driver polling, which in turn requires multiple GPU streams to balance out slower communication of inference completion from the device to the host.

  • The MULTI logic always attempts to save on copying data between device-agnostic and user-facing inference requests, and device-specific ‘worker’ requests that are being actually scheduled behind the scene. To facilitate the copy savings, it is recommended to run the requests in the order in which they were created.

  • While performance of accelerators combines well with MULTI, the CPU+GPU execution may introduce certain performance issues. It is due to the devices sharing some resources, like power or bandwidth. Enabling the GPU throttling hint, which saves a CPU thread for CPU inference, is an example of a recommended solution addressing this issue.