GPU device

The GPU plugin is an OpenCL based plugin for inference of deep neural networks on Intel GPUs, both integrated and discrete ones. For an in-depth description of the GPU plugin, see:

It is a part of the Intel® Distribution of OpenVINO™ toolkit. For more information on how to configure a system to use it, see GPU configuration page.

Device Naming Convention

  • Devices are enumerated as "GPU.X" where X={0, 1, 2,...}. Only Intel® GPU devices are considered.

  • If the system has an integrated GPU, its ‘id’ is always ‘0’ ("GPU.0").

  • Other GPUs’ order is not predefined and depends on the GPU driver.

  • "GPU" is an alias for "GPU.0"

  • If the system doesn’t have an integrated GPU, devices are enumerated starting from 0.

  • For GPUs with multi-tile architecture (multiple sub-devices in OpenCL terms) a specific tile may be addressed as "GPU.X.Y" where X,Y={0, 1, 2,...}, X - id of the GPU device, Y - id of the tile within device X

For demonstration purposes, see the Hello Query Device C++ Sample that can print out the list of available devices with associated indices. Below is an example output (truncated to the device names only):

./hello_query_device
Available devices:
    Device: CPU
...
    Device: GPU.0
...
    Device: GPU.1
...
    Device: HDDL

Then device name can be passed to ov::Core::compile_model() method:

ov::Core core;
auto model = core.read_model("model.xml");
auto compiled_model = core.compile_model(model, "GPU");
core = Core()
model = core.read_model("model.xml")
compiled_model = core.compile_model(model, "GPU")
ov::Core core;
auto model = core.read_model("model.xml");
auto compiled_model = core.compile_model(model, "GPU.1");
core = Core()
model = core.read_model("model.xml")
compiled_model = core.compile_model(model, "GPU.1")
ov::Core core;
auto model = core.read_model("model.xml");
auto compiled_model = core.compile_model(model, "GPU.1.0");
core = Core()
model = core.read_model("model.xml")
compiled_model = core.compile_model(model, "GPU.1.0")

Supported inference data types

The GPU plugin supports the following data types as inference precision of internal primitives:

  • Floating-point data types:

    • f32

    • f16

  • Quantized data types:

    • u8

    • i8

    • u1

Selected precision of each primitive depends on the operation precision in IR, quantization primitives, and available hardware capabilities. u1/u8/i8 data types are used for quantized operations only, i.e. those are not selected automatically for non-quantized operations. For more details on how to get a quantized model, refer to Model Optimization document.

Floating-point precision of a GPU primitive is selected based on operation precision in IR except compressed f16 IR form which is executed in the f16 precision.

Note

Hardware acceleration for i8/u8 precision may be unavailable on some platforms. In that case a model is executed in the floating-point precision taken from IR. Hardware support of u8/i8 acceleration can be queried via the ov::device::capabilities property.

Hello Query Device C++ Sample can be used to print out the supported data types for all detected devices.

Supported features

Multi-device execution

If a system has multiple GPUs (for example, an integrated and a discrete Intel GPU), then any supported model can be executed on all GPUs simultaneously. It is done by specifying "MULTI:GPU.1,GPU.0" as a target device.

ov::Core core;
auto model = core.read_model("model.xml");
auto compiled_model = core.compile_model(model, "MULTI:GPU.1,GPU.0");
core = Core()
model = core.read_model("model.xml")
compiled_model = core.compile_model(model, "MULTI:GPU.1,GPU.0")

See Multi-device execution page for more details.

Automatic batching

The GPU plugin is capable of reporting ov::max_batch_size and ov::optimal_batch_size metrics with respect to the current hardware platform and model. Thus, automatic batching is enabled by default when ov::optimal_batch_size is > 1 and ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT) is set. Alternatively, it can be enabled explicitly via the device notion, e.g. "BATCH:GPU".

ov::Core core;
auto model = core.read_model("model.xml");
auto compiled_model = core.compile_model(model, "BATCH:GPU");
core = Core()
model = core.read_model("model.xml")
compiled_model = core.compile_model(model, "BATCH:GPU")
ov::Core core;
auto model = core.read_model("model.xml");
auto compiled_model = core.compile_model(model, "GPU", ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT));
core = Core()
model = core.read_model("model.xml")
compiled_model = core.compile_model(model, "GPU", {"PERFORMANCE_HINT": "THROUGHPUT"})

See Automatic batching page for more details.

Multi-stream execution

If either ov::num_streams(n_streams) with n_streams > 1 or ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT) property is set for the GPU plugin, multiple streams are created for the model. In the case of GPU plugin each stream has its own host thread and an associated OpenCL queue which means that the incoming infer requests can be processed simultaneously.

Note

Simultaneous scheduling of kernels to different queues doesn’t mean that the kernels are actually executed in parallel on the GPU device. The actual behavior depends on the hardware architecture and in some cases the execution may be serialized inside the GPU driver.

When multiple inferences of the same model need to be executed in parallel, the multi-stream feature is preferred to multiple instances of the model or application. That’s because implementation of streams in the GPU plugin supports weight memory sharing across streams, thus, memory consumption may be lower, compared to the other approaches.

See optimization guide for more details.

Dynamic shapes

The GPU plugin supports dynamic shapes for batch dimension only (specified as ‘N’ in the layouts terms) with a fixed upper bound. Any other dynamic dimensions are unsupported. Internally, GPU plugin creates log2(N) (N - is an upper bound for batch dimension here) low-level execution graphs for batch sizes equal to powers of 2 to emulate dynamic behavior, so that incoming infer request with a specific batch size is executed via a minimal combination of internal networks. For example, batch size 33 may be executed via 2 internal networks with batch size 32 and 1.

Note

Such approach requires much more memory and the overall model compilation time is significantly longer, compared to the static batch scenario.

The code snippet below demonstrates how to use dynamic batching in simple scenarios:

// Read model
ov::Core core;
auto model = core.read_model("model.xml");

model->reshape({{ov::Dimension(1, 10), ov::Dimension(C), ov::Dimension(H), ov::Dimension(W)}});  // {1..10, C, H, W}

// compile model and create infer request
auto compiled_model = core.compile_model(model, "GPU");
auto infer_request = compiled_model.create_infer_request();
auto input = model->get_parameters().at(0);

// ...

// create input tensor with specific batch size
ov::Tensor input_tensor(input->get_element_type(), {2, C, H, W});

// ...

infer_request.set_tensor(input, input_tensor);
infer_request.infer();
core = ov.Core()

C = 3
H = 224
W = 224

model = core.read_model("model.xml")
model.reshape([(1, 10), C, H, W])

# compile model and create infer request
compiled_model = core.compile_model(model, "GPU")
infer_request = compiled_model.create_infer_request()

# create input tensor with specific batch size
input_tensor = ov.Tensor(model.input().element_type, [2, C, H, W])

# ...

infer_request.infer([input_tensor])

See dynamic shapes guide for more details.

Preprocessing acceleration

The GPU plugin has the following additional preprocessing options:

using namespace ov::preprocess;
auto p = PrePostProcessor(model);
p.input().tensor().set_element_type(ov::element::u8)
                  .set_color_format(ov::preprocess::ColorFormat::NV12_TWO_PLANES, {"y", "uv"})
                  .set_memory_type(ov::intel_gpu::memory_type::surface);
p.input().preprocess().convert_color(ov::preprocess::ColorFormat::BGR);
p.input().model().set_layout("NCHW");
auto model_with_preproc = p.build();
from openvino.runtime import Core, Type, Layout
from openvino.preprocess import PrePostProcessor, ColorFormat

core = Core()
model = core.read_model("model.xml")

p = PrePostProcessor(model)
p.input().tensor().set_element_type(Type.u8) \
                  .set_color_format(ColorFormat.NV12_TWO_PLANES, ["y", "uv"]) \
                  .set_memory_type("GPU_SURFACE")
p.input().preprocess().convert_color(ColorFormat.BGR)
p.input().model().set_layout(Layout("NCHW"))
model_with_preproc = p.build()

With such preprocessing GPU plugin will expect ov::intel_gpu::ocl::ClImage2DTensor (or derived) to be passed for each NV12 plane via ov::InferRequest::set_tensor() or ov::InferRequest::set_tensors() methods.

Refer to RemoteTensor API for usage examples.

See preprocessing API guide for more details.

Model caching

Cache for the GPU plugin may be enabled via the common OpenVINO ov::cache_dir property. GPU plugin implementation supports only caching of compiled kernels, so all plugin-specific model transformations are executed on each ov::Core::compile_model() call regardless of the cache_dir option. Still, since kernel compilation is a bottleneck in the model loading process, a significant load time reduction can be achieved with the ov::cache_dir property enabled.

See Model caching overview page for more details.

Extensibility

See GPU Extensibility page.

GPU context and memory sharing via RemoteTensor API

See RemoteTensor API of GPU Plugin.

Limitations

In some cases, the GPU plugin may implicitly execute several primitives on CPU using internal implementations which may lead to increase of CPU utilization. Below is a list of such operations:

  • Proposal

  • NonMaxSuppression

  • DetectionOutput

The behavior depends on specific parameters of the operations and hardware configuration.

GPU Performance Checklist: Summary

Since OpenVINO relies on the OpenCL kernels for the GPU implementation, many general OpenCL tips apply:

  • Prefer FP16 inference precision over FP32, as Model Optimizer can generate both variants and the FP32 is the default. Also, consider using the Post-training Optimization Tool.

  • Try to group individual infer jobs by using automatic batching.

  • Consider caching to minimize model load time.

  • If your application performs inference on the CPU alongside the GPU, or otherwise loads the host heavily, make sure that the OpenCL driver threads do not starve. You can use CPU configuration options to limit the number of inference threads for the CPU plugin.

  • Even in the GPU-only scenario, a GPU driver might occupy a CPU core with spin-looped polling for completion. If CPU load is a concern, consider the dedicated queue_throttle property mentioned previously. Notice that this option may increase inference latency, so consider combining with multiple GPU streams or throughput performance hints.

  • When operating media inputs consider remote tensors API of the GPU Plugin.