CPU device

The CPU plugin is a part of the Intel® Distribution of OpenVINO™ toolkit and is developed to achieve high performance inference of neural networks on Intel® x86-64 CPUs. For an in-depth description of the plugin, see:

Device name

The CPU device plugin uses the label of "CPU" and is the only device of this kind, even if multiple sockets are present on the platform. On multi-socket platforms, load balancing and memory usage distribution between NUMA nodes are handled automatically.

In order to use CPU for inference the device name should be passed to the ov::Core::compile_model() method:

ov::Core core;
auto model = core.read_model("model.xml");
auto compiled_model = core.compile_model(model, "CPU");
from openvino.runtime import Core

core = Core()
model = core.read_model("model.xml")
compiled_model = core.compile_model(model, "CPU")

Supported inference data types

The CPU device plugin supports the following data types as inference precision of internal primitives:

  • Floating-point data types:

    • f32

    • bf16

  • Integer data types:

    • i32

  • Quantized data types:

    • u8

    • i8

    • u1

Hello Query Device C++ Sample can be used to print out the supported data types for all detected devices.

Quantized data type specifics

Selected precision of each primitive depends on the operation precision in IR, quantization primitives, and available hardware capabilities. u1/u8/i8 data types are used for quantized operations only, i.e. those are not selected automatically for non-quantized operations.

See the low-precision optimization guide for more details on how to get a quantized model.

Note

Platforms that do not support Intel® AVX512-VNNI have a known “saturation issue” which in some cases leads to reduced computational accuracy for u8/i8 precision calculations. See the saturation (overflow) issue section to get more information on how to detect such issues and find possible workarounds.

Floating point data type specifics

The default floating-point precision of a CPU primitive is f32. To support f16 IRs, the plugin internally converts all the f16 values to f32 and all the calculations are performed using the native f32 precision. On platforms that natively support bfloat16 calculations (have AVX512_BF16 extension), the bf16 type is automatically used instead of f32 to achieve better performance, thus no special steps are required to run a model with bf16 precision. See the BFLOAT16 – Hardware Numerics Definition white paper for more details about bfloat16.

Using bf16 provides the following performance benefits:

  • Faster multiplication of two bfloat16 numbers because of shorter mantissa of bfloat16 data.

  • Reduced memory consumption since bfloat16 data is half the size of 32-bit float.

To check if the CPU device can support the bfloat16 data type use the query device properties interface to query ov::device::capabilities property, which should contain BF16 in the list of CPU capabilities:

ov::Core core;
auto cpuOptimizationCapabilities = core.get_property("CPU", ov::device::capabilities);
core = Core()
cpu_optimization_capabilities = core.get_property("CPU", "OPTIMIZATION_CAPABILITIES")

If the model has been converted to bf16, ov::hint::inference_precision is set to ov::element::bf16 and can be checked via ov::CompiledModel::get_property call. The code below demonstrates how to get the element type:

ov::Core core;
auto network = core.read_model("sample.xml");
auto exec_network = core.compile_model(network, "CPU");
auto inference_precision = exec_network.get_property(ov::hint::inference_precision);

To infer the model in f32 instead of bf16 on targets with native bf16 support, set the ov::hint::inference_precision to ov::element::f32.

core = Core()
core.set_property("CPU", {"INFERENCE_PRECISION_HINT": "f32"})

Bfloat16 software simulation mode is available on CPUs with Intel® AVX-512 instruction set which does not support the native avx512_bf16 instruction. This mode is used for development purposes and it does not guarantee good performance. To enable the simulation, you have to explicitly set ov::hint::inference_precision to ov::element::bf16.

Note

An exception is thrown if ov::hint::inference_precision is set to ov::element::bf16 on a CPU without native bfloat16 support or bfloat16 simulation mode.

Note

Due to the reduced mantissa size of the bfloat16 data type, the resulting bf16 inference accuracy may differ from the f32 inference, especially for models that were not trained using the bfloat16 data type. If the bf16 inference accuracy is not acceptable, it is recommended to switch to the f32 precision.

Supported features

Multi-device execution

If a machine has OpenVINO-supported devices other than the CPU (for example an integrated GPU), then any supported model can be executed on CPU and all the other devices simultaneously. This can be achieved by specifying "MULTI:CPU,GPU.0" as a target device in case of simultaneous usage of CPU and GPU.

ov::Core core;
auto model = core.read_model("model.xml");
auto compiled_model = core.compile_model(model, "MULTI:CPU,GPU.0");
core = Core()
model = core.read_model("model.xml")
compiled_model = core.compile_model(model, "MULTI:CPU,GPU.0")

See Multi-device execution page for more details.

Multi-stream execution

If either ov::num_streams(n_streams) with n_streams > 1 or the ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT) property is set for the CPU plugin, multiple streams are created for the model. In the case of the CPU plugin, each stream has its own host thread, which means that incoming infer requests can be processed simultaneously. Each stream is pinned to its own group of physical cores with respect to NUMA nodes physical memory usage to minimize overhead on data transfer between NUMA nodes.

See optimization guide for more details.

Note

When it comes to latency, keep in mind that running only one stream on a multi-socket platform may introduce additional overheads on data transfer between NUMA nodes. In that case it is better to use ov::hint::PerformanceMode::LATENCY performance hint (please see performance hints overview for details).

Dynamic shapes

The CPU device plugin provides full functional support for models with dynamic shapes in terms of the opset coverage.

Note

CPU does not support tensors with a dynamically changing rank. If you try to infer a model with such tensors, an exception will be thrown.

Dynamic shapes support introduces additional overhead on memory management and may limit internal runtime optimizations. The more degrees of freedom are used, the more difficult it is to achieve the best performance. The most flexible configuration and the most convenient approach is the fully undefined shape, where no constraints to the shape dimensions are applied. But reducing the level of uncertainty brings gains in performance. You can reduce memory consumption through memory reuse and achieve better cache locality, leading to better inference performance, if you explicitly set dynamic shapes with defined upper bounds.

ov::Core core;
auto model = core.read_model("model.xml");

model->reshape({{ov::Dimension(1, 10), ov::Dimension(1, 20), ov::Dimension(1, 30), ov::Dimension(1, 40)}});
core = Core()
model = core.read_model("model.xml")
model.reshape([(1, 10), (1, 20), (1, 30), (1, 40)])

Note

Using fully undefined shapes may result in significantly higher memory consumption compared to inferring the same model with static shapes. If the level of memory consumption is unacceptable but dynamic shapes are still required, you can reshape the model using shapes with defined upper bounds to reduce memory footprint.

Some runtime optimizations work better if the model shapes are known in advance. Therefore, if the input data shape is not changed between inference calls, it is recommended to use a model with static shapes or reshape the existing model with the static input shape to get the best performance.

ov::Core core;
auto model = core.read_model("model.xml");
ov::Shape static_shape = {10, 20, 30, 40};

model->reshape(static_shape);
core = Core()
model = core.read_model("model.xml")
model.reshape([10, 20, 30, 40])

See dynamic shapes guide for more details.

Preprocessing acceleration

CPU plugin supports a full set of the preprocessing operations, providing high performance implementations for them.

See preprocessing API guide for more details.

Model caching

The CPU device plugin supports Import/Export network capability. If model caching is enabled via the common OpenVINO™ ov::cache_dir property, the plugin will automatically create a cached blob inside the specified directory during model compilation. This cached blob contains partial representation of the network, having performed common runtime optimizations and low precision transformations. At the next attempt to compile the model, the cached representation will be loaded to the plugin instead of the initial IR, so the aforementioned steps will be skipped. These operations take a significant amount of time during model compilation, so caching their results makes subsequent compilations of the model much faster, thus reducing first inference latency (FIL).

See model caching overview for more details.

Extensibility

The CPU device plugin supports fallback on ov::Op reference implementation if it lacks own implementation of such operation. This means that OpenVINO™ Extensibility Mechanism can be used for the plugin extension as well. To enable fallback on a custom operation implementation, override the ov::Op::evaluate method in the derived operation class (see custom OpenVINO™ operations for details).

Note

At the moment, custom operations with internal dynamism (when the output tensor shape can only be determined as a result of performing the operation) are not supported by the plugin.

Stateful models

The CPU device plugin supports stateful models without any limitations.

See stateful models guide for details.

Supported properties

The plugin supports the following properties:

Read-write properties

All parameters must be set before calling ov::Core::compile_model() in order to take effect or passed as additional argument to ov::Core::compile_model()

External dependencies

For some performance-critical DL operations, the CPU plugin uses optimized implementations from the oneAPI Deep Neural Network Library (oneDNN).