GPU Device#

The GPU plugin is an OpenCL based plugin for inference of deep neural networks on Intel GPUs, both integrated and discrete ones. For an in-depth description of the GPU plugin, see:

The GPU plugin is a part of the Intel® Distribution of OpenVINO™ toolkit. For more information on how to configure a system to use it, see the GPU configuration.

Device Naming Convention#

Devices are enumerated as GPU.X, where X={0, 1, 2,...} (only Intel® GPU devices are considered).
If the system has an integrated GPU, its id is always 0 (GPU.0).
The order of other GPUs is not predefined and depends on the GPU driver.
The GPU is an alias for GPU.0.
If the system does not have an integrated GPU, devices are enumerated, starting from 0.
For GPUs with multi-tile architecture (multiple sub-devices in OpenCL terms), a specific tile may be addressed as GPU.X.Y, where X,Y={0, 1, 2,...}, X - id of the GPU device, Y - id of the tile within device X

For demonstration purposes, see the Hello Query Device C++ Sample that can print out the list of available devices with associated indices. Below is an example output (truncated to the device names only):

./hello_query_device
Available devices:
    Device: CPU
...
    Device: GPU.0
...
    Device: GPU.1

Then, the device name can be passed to the ov::Core::compile_model() method, running on:

default device

Python

    core = ov.Core()
    compiled_model = core.compile_model(model, "GPU")

C++

    ov::Core core;
    auto model = core.read_model("model.xml");
    auto compiled_model = core.compile_model(model, "GPU");

specific GPU

Python

    core = ov.Core()
    compiled_model = core.compile_model(model, "GPU.1")

C++

    ov::Core core;
    auto model = core.read_model("model.xml");
    auto compiled_model = core.compile_model(model, "GPU.1");

specific tile

Python

    core = ov.Core()
    compiled_model = core.compile_model(model, "GPU.1.0")

C++

    ov::Core core;
    auto model = core.read_model("model.xml");
    auto compiled_model = core.compile_model(model, "GPU.1.0");

Supported Inference Data Types#

The GPU plugin supports the following data types as inference precision of internal primitives:

Floating-point data types:
- f32
- f16
Quantized data types:
- u8
- i8
- u1

Selected precision of each primitive depends on the operation precision in IR, quantization primitives, and available hardware capabilities. The u1/u8/i8 data types are used for quantized operations only, which means that they are not selected automatically for non-quantized operations. For more details on how to get a quantized model, refer to the Model Optimization guide.

Floating-point precision of a GPU primitive is selected based on operation precision in the OpenVINO IR, except for the <compressed f16 OpenVINO IR form, which is executed in the f16 precision.

Note

The newer generation Intel Iris Xe and Xe MAX GPUs provide accelerated performance for i8/u8 models. Hardware acceleration for i8/u8 precision may be unavailable on older generation platforms. In such cases, a model is executed in the floating-point precision taken from IR. Hardware support of u8/i8 acceleration can be queried via the ov::device::capabilities property.

Hello Query Device C++ Sample can be used to print out the supported data types for all detected devices.

Supported Properties#

The plugin supports the properties listed below.

Read-write properties#

All parameters must be set before calling ov::Core::compile_model() in order to take effect or passed as additional argument to ov::Core::compile_model().

ov::cache_dir
ov::enable_profiling
ov::hint::model_priority
ov::hint::performance_mode
ov::hint::execution_mode
ov::hint::num_requests
ov::hint::inference_precision
ov::num_streams
ov::compilation_num_threads
ov::device::id
ov::intel_gpu::hint::host_task_priority
ov::intel_gpu::hint::queue_priority
ov::intel_gpu::hint::queue_throttle
ov::intel_gpu::enable_loop_unrolling
ov::intel_gpu::disable_winograd_convolution

Read-only Properties#

ov::supported_properties
ov::available_devices
ov::range_for_async_infer_requests
ov::range_for_streams
ov::optimal_batch_size
ov::max_batch_size
ov::device::full_name
ov::device::type
ov::device::gops
ov::device::capabilities
ov::intel_gpu::device_total_mem_size
ov::intel_gpu::uarch_version
ov::intel_gpu::execution_units_count
ov::intel_gpu::memory_statistics

Limitations#

In some cases, the GPU plugin may implicitly execute several primitives on CPU using internal implementations, which may lead to an increase in CPU utilization. Below is a list of such operations:

Proposal
NonMaxSuppression
DetectionOutput

The behavior depends on specific parameters of the operations and hardware configuration.

Important

While working on a fine tuned model, inference may give an inaccuracy and performance drop on GPU if winograd convolutions are selected. This issue can be fixed by disabling winograd convolutions:

compiled_model = core.compile_model(ov_model, device_name=devStr1, config={ "GPU_DISABLE_WINOGRAD_CONVOLUTION": True })

GPU Performance Checklist: Summary#

Since OpenVINO relies on the OpenCL kernels for the GPU implementation, many general OpenCL tips apply:

Prefer FP16 inference precision over FP32, as Model Conversion API can generate both variants, and the FP32 is the default. To learn about optimization options, see Optimization Guide.
Try to group individual infer jobs by using automatic batching.
Consider caching to minimize model load time.
If your application performs inference on the CPU alongside the GPU, or otherwise loads the host heavily, make sure that the OpenCL driver threads do not starve. CPU configuration options can be used to limit the number of inference threads for the CPU plugin.
Even in the GPU-only scenario, a GPU driver might occupy a CPU core with spin-loop polling for completion. If CPU load is a concern, consider the dedicated queue_throttle property mentioned previously. Note that this option may increase inference latency, so consider combining it with multiple GPU streams or throughput performance hints.
When operating media inputs, consider remote tensors API of the GPU Plugin.

GPU Device#

Device Naming Convention#

Supported Inference Data Types#

Supported Features#

Automatic Device Selection#

Automatic Batching#

Multi-stream Execution#

Dynamic Shapes#

Bounded dynamic batch#

Notes for performance and memory consumption in dynamic shapes#

Recommendations for performance improvement#

Preprocessing Acceleration#

Model Caching#

Extensibility#

Supported Properties#

Read-write properties#

Read-only Properties#

Limitations#

GPU Performance Checklist: Summary#

Additional Resources#

GPU Device#

Device Naming Convention#

Supported Inference Data Types#

Supported Features#

Automatic Device Selection#

Automatic Batching#

Multi-stream Execution#

Dynamic Shapes#

Bounded dynamic batch#

Notes for performance and memory consumption in dynamic shapes#

Recommendations for performance improvement#

Preprocessing Acceleration#

Model Caching#

Extensibility#

GPU Context and Memory Sharing via RemoteTensor API#

Supported Properties#

Read-write properties#

Read-only Properties#

Limitations#

GPU Performance Checklist: Summary#

Additional Resources#