CPU device

The CPU plugin is developed to achieve high performance inference of neural networks on Intel® x86-64 CPUs. For an in-depth description of CPU plugin, see

The CPU plugin is a part of the Intel® Distribution of OpenVINO™ toolkit.

Device name

For the CPU plugin "CPU" device name is used, and even though there can be more than one socket on a platform, from the plugin’s point of view, there is only one "CPU" device. On multi-socket platforms, load balancing and memory usage distribution between NUMA nodes are handled automatically.

In order to use CPU for inference the device name should be passed to ov::Core::compile_model() method:

ov::Core core;
auto model = core.read_model("model.xml");
auto compiled_model = core.compile_model(model, "CPU");
from openvino.runtime import Core

core = Core()
model = core.read_model("model.xml")
compiled_model = core.compile_model(model, "CPU")

Supported inference data types

CPU plugin supports the following data types as inference precision of internal primitives:

  • Floating-point data types:

    • f32

    • bf16

  • Integer data types:

    • i32

  • Quantized data types:

    • u8

    • i8

    • u1

Hello Query Device C++ Sample can be used to print out supported data types for all detected devices.

Quantized data types specifics

Selected precision of each primitive depends on the operation precision in IR, quantization primitives, and available hardware capabilities. u1/u8/i8 data types are used for quantized operations only, i.e. those are not selected automatically for non-quantized operations.

See low-precision optimization guide for more details on how to get a quantized model.

Note

Platforms that do not support Intel® AVX512-VNNI have a known “saturation issue” which in some cases leads to reduced computational accuracy for u8/i8 precision calculations. See saturation (overflow) issue section to get more information on how to detect such issues and possible workarounds.

Floating point data types specifics

Default floating-point precision of a CPU primitive is f32. To support f16 IRs the plugin internally converts all the f16 values to f32 and all the calculations are performed using native f32 precision. On platforms that natively support bfloat16 calculations (have AVX512_BF16 extension) bf16 type is automatically used instead of f32 to achieve better performance, thus no special steps are required to run a model with bf16 precision. See the BFLOAT16 – Hardware Numerics Definition white paper for more details about bfloat16 format.

Using bf16 precision provides the following performance benefits:

  • Faster multiplication of two bfloat16 numbers because of shorter mantissa of the bfloat16 data.

  • Reduced memory consumption since bfloat16 data size is two times less than 32-bit float.

To check if the CPU device can support the bfloat16 data type use the query device properties interface to query ov::device::capabilities property, which should contain BF16 in the list of CPU capabilities:

ov::Core core;
auto cpuOptimizationCapabilities = core.get_property("CPU", ov::device::capabilities);
core = Core()
cpu_optimization_capabilities = core.get_property("CPU", "OPTIMIZATION_CAPABILITIES")

In case if the model was converted to bf16, ov::hint::inference_precision is set to ov::element::bf16 and can be checked via ov::CompiledModel::get_property call. The code below demonstrates how to get the element type:

ov::Core core;
auto network = core.read_model("sample.xml");
auto exec_network = core.compile_model(network, "CPU");
auto inference_precision = exec_network.get_property(ov::hint::inference_precision);

To infer the model in f32 precision instead of bf16 on targets with native bf16 support, set the ov::hint::inference_precision to ov::element::f32.

core = Core()
core.set_property("CPU", {"INFERENCE_PRECISION_HINT": "f32"})

Bfloat16 software simulation mode is available on CPUs with Intel® AVX-512 instruction set that do not support the native avx512_bf16 instruction. This mode is used for development purposes and it does not guarantee good performance. To enable the simulation, one have to explicitly set ov::hint::inference_precision to ov::element::bf16.

Note

An exception is thrown in case of setting ov::hint::inference_precision to ov::element::bf16 on CPU without native bfloat16 support or bfloat16 simulation mode.

Note

Due to the reduced mantissa size of the bfloat16 data type, the resulting bf16 inference accuracy may differ from the f32 inference, especially for models that were not trained using the bfloat16 data type. If the bf16 inference accuracy is not acceptable, it is recommended to switch to the f32 precision.

Supported features

Multi-device execution

If a machine has OpenVINO supported devices other than CPU (for example integrated GPU), then any supported model can be executed on CPU and all the other devices simultaneously. This can be achieved by specifying "MULTI:CPU,GPU.0" as a target device in case of simultaneous usage of CPU and GPU.

ov::Core core;
auto model = core.read_model("model.xml");
auto compiled_model = core.compile_model(model, "MULTI:CPU,GPU.0");
core = Core()
model = core.read_model("model.xml")
compiled_model = core.compile_model(model, "MULTI:CPU,GPU.0")

See Multi-device execution page for more details.

Multi-stream execution

If either ov::num_streams(n_streams) with n_streams > 1 or ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT) property is set for CPU plugin, then multiple streams are created for the model. In case of CPU plugin each stream has its own host thread which means that incoming infer requests can be processed simultaneously. Each stream is pinned to its own group of physical cores with respect to NUMA nodes physical memory usage to minimize overhead on data transfer between NUMA nodes.

See optimization guide for more details.

Note

When it comes to latency, one needs to keep in mind that running only one stream on multi-socket platform may introduce additional overheads on data transfer between NUMA nodes. In that case it is better to use ov::hint::PerformanceMode::LATENCY performance hint (please see performance hints overview for details).

Dynamic shapes

CPU plugin provides full functional support for models with dynamic shapes in terms of the opset coverage.

Note

CPU plugin does not support tensors with dynamically changing rank. In case of an attempt to infer a model with such tensors, an exception will be thrown.

Dynamic shapes support introduce some additional overheads on memory management and may limit internal runtime optimizations. The more degrees of freedom we have, the more difficult it is to achieve the best performance. The most flexible configuration is the fully undefined shape, when we do not apply any constraints to the shape dimensions, which is the most convenient approach. But reducing the level of uncertainty will bring performance gains. We can reduce memory consumption through memory reuse, and as a result achieve better cache locality, which in its turn leads to better inference performance, if we explicitly set dynamic shapes with defined upper bounds.

ov::Core core;
auto model = core.read_model("model.xml");

model->reshape({{ov::Dimension(1, 10), ov::Dimension(1, 20), ov::Dimension(1, 30), ov::Dimension(1, 40)}});
core = Core()
model = core.read_model("model.xml")
model.reshape([(1, 10), (1, 20), (1, 30), (1, 40)])

Note

Using fully undefined shapes may result in significantly higher memory consumption compared to inferring the same model with static shapes. If the memory consumption is unacceptable but dynamic shapes are still required, one can reshape the model using shapes with defined upper bound to reduce memory footprint.

Some runtime optimizations works better if the model shapes are known in advance. Therefore, if the input data shape is not changed between inference calls, it is recommended to use a model with static shapes or reshape the existing model with the static input shape to get the best performance.

ov::Core core;
auto model = core.read_model("model.xml");
ov::Shape static_shape = {10, 20, 30, 40};

model->reshape(static_shape);
core = Core()
model = core.read_model("model.xml")
model.reshape([10, 20, 30, 40])

See dynamic shapes guide for more details.

Preprocessing acceleration

CPU plugin supports a full set of the preprocessing operations, providing high performance implementations for them.

See preprocessing API guide for more details.

Models caching

CPU plugin supports Import/Export network capability. If the model caching is enabled via common OpenVINO™ ov::cache_dir property, the plugin will automatically create a cached blob inside the specified directory during model compilation. This cached blob contains some intermediate representation of the network that it has after common runtime optimizations and low precision transformations. The next time the model is compiled, the cached representation will be loaded to the plugin instead of the initial IR, so the aforementioned transformation steps will be skipped. These transformations take a significant amount of time during model compilation, so caching this representation reduces time spent for subsequent compilations of the model, thereby reducing first inference latency (FIL).

See model caching overview for more details.

Extensibility

CPU plugin supports fallback on ov::Op reference implementation if the plugin do not have its own implementation for such operation. That means that OpenVINO™ Extensibility Mechanism can be used for the plugin extension as well. To enable fallback on a custom operation implementation, one have to override ov::Op::evaluate method in the derived operation class (see custom OpenVINO™ operations for details).

Note

At the moment, custom operations with internal dynamism (when the output tensor shape can only be determined as a result of performing the operation) are not supported by the plugin.

Stateful models

CPU plugin supports stateful models without any limitations.

See stateful models guide for details.

Supported properties

The plugin supports the properties listed below.

Read-write properties

All parameters must be set before calling ov::Core::compile_model() in order to take effect or passed as additional argument to ov::Core::compile_model()

External dependencies

For some performance-critical DL operations, the CPU plugin uses optimized implementations from the oneAPI Deep Neural Network Library (oneDNN).