Heterogeneous execution

The Heterogeneous Execution mode, or HETERO for short, acts as a “virtual” or a “proxy” device, which does not bind to a specific type of hardware. Instead, it executes inference of one model on several devices. Its purpose is to utilize all available hardware more efficiently during one inference. This means that accelerators are used to process the heaviest parts of the model, while fallback devices, like the CPU, execute operations not supported by accelerators.

Compiling a model to the Heterogeneous mode assumes splitting it into subgraphs. Each subgraph is compiled on a dedicated device and multiple ov::CompiledModel objects are created. The objects are connected via automatically allocated intermediate tensors.

Importantly, the model division is performed according to pre-defined affinities between hardware and operations. Every set of connected operations with the same affinity becomes a dedicated subgraph. Setting these affinities needs to be done as a separate step (ov::Core::query_model is used internally by HETERO), as described below.

Using the Hetero Mode

Following the OpenVINO™ naming convention, the Hetero execution mode is assigned the label of "HETERO". It may be defined with no additional parameters, resulting in defaults being used, or configured further with the following setup options:

Property

Property values

Description

<device list>

HETERO: <device names>
comma-separated, no spaces
Specifies the devices available for selection.
The device sequence will be taken as priority
from high to low.


ov::device::priorities

device names
comma-separated, no spaces

Assigning Affinities

Affinities can be set in one of two ways, used separately or in combination: with the manual or the automatic option.

The Manual Option

It assumes setting affinities explicitly for all operations in the model using ov::Node::get_rt_info with the "affinity" key.

If you assign specific operation to a specific device, make sure that the device actually supports the operation. Randomly selecting operations and setting affinities may lead to decrease in model accuracy. To avoid that, try to set the related operations or subgraphs of this operation to the same affinity, such as the constant operation that will be folded into this operation.

for (auto && op : model->get_ops()) {
    op->get_rt_info()["affinity"] = "CPU";
}
for op in model.get_ops():
    rt_info = op.get_rt_info()
    rt_info["affinity"] = "CPU"

The Automatic Option

It decides automatically which operation is assigned to which device according to the support from dedicated devices (GPU, CPU, MYRIAD, etc.) and query model step is called implicitly by Hetero device during model compilation.

The automatic option causes “greedy” behavior and assigns all operations that can be executed on a given device to it, according to the priorities you specify (for example, ov::device::priorities("GPU,CPU")). It does not take into account device peculiarities such as the inability to infer certain operations without other special operations placed before or after that layer. If the device plugin does not support the subgraph topology constructed by the HETERO device, then you should set affinity manually.

auto compiled_model = core.compile_model(model, "HETERO:GPU,CPU");
// or with ov::device::priorities with multiple args
compiled_model = core.compile_model(model, "HETERO", ov::device::priorities("GPU", "CPU"));
// or with ov::device::priorities with a single argument
compiled_model = core.compile_model(model, "HETERO", ov::device::priorities("GPU,CPU"));
compiled_model = core.compile_model(model, device_name="HETERO:GPU,CPU")
# device priorities via configuration property
compiled_model = core.compile_model(model, device_name="HETERO", config={"MULTI_DEVICE_PRIORITIES": "GPU,CPU"})

Using Manual and Automatic Options in Combination

In some cases you may need to consider manually adjusting affinities which were set automatically. It usually serves minimizing the number of total subgraphs to optimize memory transfers. To do it, you need to “fix” the automatically assigned affinities like so:

// This example demonstrates how to perform default affinity initialization and then
// correct affinity manually for some layers
const std::string device = "HETERO:GPU,CPU";

// query_model result contains mapping of supported operations to devices
auto supported_ops = core.query_model(model, device);

// update default affinities manually for specific operations
supported_ops["operation_name"] = "CPU";

// set affinities to a model
for (auto&& node : model->get_ops()) {
    auto& affinity = supported_ops[node->get_friendly_name()];
    // Store affinity mapping using op runtime information
    node->get_rt_info()["affinity"] = affinity;
}

// load model with manually set affinities
auto compiled_model = core.compile_model(model, device);
# This example demonstrates how to perform default affinity initialization and then
# correct affinity manually for some layers
device = "HETERO:GPU,CPU"

# query_model result contains mapping of supported operations to devices
supported_ops = core.query_model(model, device)

# update default affinities manually for specific operations
supported_ops["operation_name"] = "CPU"

# set affinities to a model
for node in model.get_ops():
    affinity = supported_ops[node.get_friendly_name()]
    node.get_rt_info()["affinity"] = "CPU"

# load model with manually set affinities
compiled_model = core.compile_model(model, device)

Importantly, the automatic option will not work if any operation in a model has its "affinity" already initialized.

Note

ov::Core::query_model does not depend on affinities set by a user. Instead, it queries for an operation support based on device capabilities.

Configure fallback devices

If you want different devices in Hetero execution to have different device-specific configuration options, you can use the special helper property ov::device::properties :

auto compiled_model = core.compile_model(model, "HETERO",
    // GPU with fallback to CPU
    ov::device::priorities("GPU", "CPU"),
    // profiling is enabled only for GPU
    ov::device::properties("GPU", ov::enable_profiling(true)),
    // FP32 inference precision only for CPU
    ov::device::properties("CPU", ov::hint::inference_precision(ov::element::f32))
);
core.set_property("HETERO", {"MULTI_DEVICE_PRIORITIES": "GPU,CPU"})
core.set_property("GPU", {"PERF_COUNT": "YES"})
core.set_property("CPU", {"INFERENCE_PRECISION_HINT": "f32"})
compiled_model = core.compile_model(model=model, device_name="HETERO")

In the example above, the GPU device is configured to enable profiling data and uses the default execution precision, while CPU has the configuration property to perform inference in fp32.

Handling of Difficult Topologies

Some topologies are not friendly to heterogeneous execution on some devices, even to the point of being unable to execute. For example, models having activation operations that are not supported on the primary device are split by Hetero into multiple sets of subgraphs which leads to suboptimal execution. If transmitting data from one subgraph to another part of the model in the heterogeneous mode takes more time than under normal execution, heterogeneous execution may be unsubstantiated. In such cases, you can define the heaviest part manually and set the affinity to avoid sending data back and forth many times during one inference.

Analyzing Performance of Heterogeneous Execution

After enabling the OPENVINO_HETERO_VISUALIZE environment variable, you can dump GraphViz .dot files with annotations of operations per devices.

The Heterogeneous execution mode can generate two files:

  • hetero_affinity_<model name>.dot - annotation of affinities per operation.

  • hetero_subgraphs_<model name>.dot - annotation of affinities per graph.

You can use the GraphViz utility or a file converter to view the images. On the Ubuntu operating system, you can use xdot:

  • sudo apt-get install xdot

  • xdot hetero_subgraphs.dot

You can use performance data (in sample applications, it is the option -pc) to get the performance data on each subgraph.

Here is an example of the output for Googlenet v1 running on HDDL with fallback to CPU:

subgraph1: 1. input preprocessing (mean data/HDDL):EXECUTED layerType:          realTime: 129   cpu: 129  execType:
subgraph1: 2. input transfer to DDR:EXECUTED                layerType:          realTime: 201   cpu: 0    execType:
subgraph1: 3. HDDL execute time:EXECUTED                    layerType:          realTime: 3808  cpu: 0    execType:
subgraph1: 4. output transfer from DDR:EXECUTED             layerType:          realTime: 55    cpu: 0    execType:
subgraph1: 5. HDDL output postprocessing:EXECUTED           layerType:          realTime: 7     cpu: 7    execType:
subgraph1: 6. copy to IE blob:EXECUTED                      layerType:          realTime: 2     cpu: 2    execType:
subgraph2: out_prob:          NOT_RUN                       layerType: Output   realTime: 0     cpu: 0    execType: unknown
subgraph2: prob:              EXECUTED                      layerType: SoftMax  realTime: 10    cpu: 10   execType: ref
Total time: 4212 microseconds

Sample Usage

OpenVINO™ sample programs can use the Heterogeneous execution used with the -d option:

./hello_classification <path_to_model>/squeezenet1.1.xml <path_to_pictures>/picture.jpg HETERO:GPU,CPU

where:

  • HETERO stands for the Heterogeneous execution

  • GPU,CPU points to a fallback policy with the priority on GPU and fallback to CPU

You can also point to more than two devices: -d HETERO:MYRIAD,GPU,CPU