CPU Device

The CPU plugin is a part of the Intel® Distribution of OpenVINO™ toolkit. It is developed to achieve high performance inference of neural networks on Intel® x86-64 CPUs. For an in-depth description of CPU plugin, see:

Device Name

The CPU device name is used for the CPU plugin. Even though there can be more than one physical socket on a platform, only one device of this kind is listed by OpenVINO. On multi-socket platforms, load balancing and memory usage distribution between NUMA nodes are handled automatically.

In order to use CPU for inference, the device name should be passed to the ov::Core::compile_model() method:

ov::Core core;
auto model = core.read_model("model.xml");
auto compiled_model = core.compile_model(model, "CPU");
from openvino.runtime import Core

core = Core()
model = core.read_model("model.xml")
compiled_model = core.compile_model(model, "CPU")

Supported Inference Data Types

CPU plugin supports the following data types as inference precision of internal primitives:

  • Floating-point data types:

    • f32

    • bf16

  • Integer data types:

    • i32

  • Quantized data types:

    • u8

    • i8

    • u1

Hello Query Device C++ Sample can be used to print out supported data types for all detected devices.

Quantized Data Types Specifics

Selected precision of each primitive depends on the operation precision in IR, quantization primitives, and available hardware capabilities. The u1/u8/i8 data types are used for quantized operations only, i.e., those are not selected automatically for non-quantized operations.

See the low-precision optimization guide for more details on how to get a quantized model.

Note

Platforms that do not support Intel® AVX512-VNNI have a known “saturation issue” that may lead to reduced computational accuracy for u8/i8 precision calculations. See the saturation (overflow) issue section to get more information on how to detect such issues and possible workarounds.

Floating Point Data Types Specifics

The default floating-point precision of a CPU primitive is f32. To support the f16 OpenVINO IR the plugin internally converts all the f16 values to f32 and all the calculations are performed using the native precision of f32. On platforms that natively support bfloat16 calculations (have the AVX512_BF16 extension), the bf16 type is automatically used instead of f32 to achieve better performance. Thus, no special steps are required to run a bf16 model. For more details about the bfloat16 format, see the BFLOAT16 – Hardware Numerics Definition white paper.

Using the bf16 precision provides the following performance benefits:

  • Faster multiplication of two bfloat16 numbers because of shorter mantissa of the bfloat16 data.

  • Reduced memory consumption since bfloat16 data half the size of 32-bit float.

To check if the CPU device can support the bfloat16 data type, use the query device properties interface to query ov::device::capabilities property, which should contain BF16 in the list of CPU capabilities:

ov::Core core;
auto cpuOptimizationCapabilities = core.get_property("CPU", ov::device::capabilities);
core = Core()
cpu_optimization_capabilities = core.get_property("CPU", "OPTIMIZATION_CAPABILITIES")

If the model has been converted to bf16, the ov::hint::inference_precision is set to ov::element::bf16 and can be checked via the ov::CompiledModel::get_property call. The code below demonstrates how to get the element type:

ov::Core core;
auto network = core.read_model("sample.xml");
auto exec_network = core.compile_model(network, "CPU");
auto inference_precision = exec_network.get_property(ov::hint::inference_precision);

To infer the model in f32 precision instead of bf16 on targets with native bf16 support, set the ov::hint::inference_precision to ov::element::f32.

core = Core()
core.set_property("CPU", {"INFERENCE_PRECISION_HINT": "f32"})

The Bfloat16 software simulation mode is available on CPUs with Intel® AVX-512 instruction set that do not support the native avx512_bf16 instruction. This mode is used for development purposes and it does not guarantee good performance. To enable the simulation, the ov::hint::inference_precision has to be explicitly set to ov::element::bf16.

Note

If ov::hint::inference_precision is set to ov::element::bf16 on a CPU without native bfloat16 support or bfloat16 simulation mode, an exception is thrown.

Note

Due to the reduced mantissa size of the bfloat16 data type, the resulting bf16 inference accuracy may differ from the f32 inference, especially for models that were not trained using the bfloat16 data type. If the bf16 inference accuracy is not acceptable, it is recommended to switch to the f32 precision.

Supported Features

Multi-device Execution

If a system includes OpenVINO-supported devices other than the CPU (e.g. an integrated GPU), then any supported model can be executed on all the devices simultaneously. This can be achieved by specifying MULTI:CPU,GPU.0 as a target device in case of simultaneous usage of CPU and GPU.

ov::Core core;
auto model = core.read_model("model.xml");
auto compiled_model = core.compile_model(model, "MULTI:CPU,GPU.0");
core = Core()
model = core.read_model("model.xml")
compiled_model = core.compile_model(model, "MULTI:CPU,GPU.0")

For more details, see the Multi-device execution article.

Multi-stream Execution

If either ov::num_streams(n_streams) with n_streams > 1 or ov::hint::performance_mode(ov::hint::PerformanceMode::THROUGHPUT) property is set for CPU plugin, then multiple streams are created for the model. In case of CPU plugin, each stream has its own host thread, which means that incoming infer requests can be processed simultaneously. Each stream is pinned to its own group of physical cores with respect to NUMA nodes physical memory usage to minimize overhead on data transfer between NUMA nodes.

For more details, see the optimization guide.

Note

When it comes to latency, be aware that running only one stream on multi-socket platform may introduce additional overheads on data transfer between NUMA nodes. In that case it is better to use the ov::hint::PerformanceMode::LATENCY performance hint. For more details see the performance hints overview.

Dynamic Shapes

CPU provides full functional support for models with dynamic shapes in terms of the opset coverage.

Note

The CPU plugin does not support tensors with dynamically changing rank. In case of an attempt to infer a model with such tensors, an exception will be thrown.

Dynamic shapes support introduces additional overhead on memory management and may limit internal runtime optimizations. The more degrees of freedom are used, the more difficult it is to achieve the best performance. The most flexible configuration, and the most convenient approach, is the fully undefined shape, which means that no constraints to the shape dimensions are applied. However, reducing the level of uncertainty results in performance gains. You can reduce memory consumption through memory reuse, achieving better cache locality and increasing inference performance. To do so, set dynamic shapes explicitly, with defined upper bounds.

ov::Core core;
auto model = core.read_model("model.xml");

model->reshape({{ov::Dimension(1, 10), ov::Dimension(1, 20), ov::Dimension(1, 30), ov::Dimension(1, 40)}});
core = Core()
model = core.read_model("model.xml")
model.reshape([(1, 10), (1, 20), (1, 30), (1, 40)])

Note

Using fully undefined shapes may result in significantly higher memory consumption compared to inferring the same model with static shapes. If memory consumption is unacceptable but dynamic shapes are still required, the model can be reshaped using shapes with defined upper bounds to reduce memory footprint.

Some runtime optimizations work better if the model shapes are known in advance. Therefore, if the input data shape is not changed between inference calls, it is recommended to use a model with static shapes or reshape the existing model with the static input shape to get the best performance.

ov::Core core;
auto model = core.read_model("model.xml");
ov::Shape static_shape = {10, 20, 30, 40};

model->reshape(static_shape);
core = Core()
model = core.read_model("model.xml")
model.reshape([10, 20, 30, 40])

For more details, see the dynamic shapes guide.

Preprocessing Acceleration

CPU plugin supports a full set of the preprocessing operations, providing high performance implementations for them.

For more details, see preprocessing API guide.

Models Caching

CPU supports Import/Export network capability. If model caching is enabled via the common OpenVINO™ ov::cache_dir property, the plugin automatically creates a cached blob inside the specified directory during model compilation. This cached blob contains partial representation of the network, having performed common runtime optimizations and low precision transformations. The next time the model is compiled, the cached representation will be loaded to the plugin instead of the initial OpenVINO IR, so the aforementioned transformation steps will be skipped. These transformations take a significant amount of time during model compilation, so caching this representation reduces time spent for subsequent compilations of the model, thereby reducing first inference latency (FIL).

For more details, see the model caching overview.

Extensibility

CPU plugin supports fallback on ov::Op reference implementation if the plugin do not have its own implementation for such operation. That means that OpenVINO™ Extensibility Mechanism can be used for the plugin extension as well. Enabling fallback on a custom operation implementation is possible by overriding the ov::Op::evaluate method in the derived operation class (see custom OpenVINO™ operations for details).

Note

At the moment, custom operations with internal dynamism (when the output tensor shape can only be determined as a result of performing the operation) are not supported by the plugin.

Stateful Models

The CPU plugin supports stateful models without any limitations.

For details, see stateful models guide.

External Dependencies

For some performance-critical DL operations, the CPU plugin uses optimized implementations from the oneAPI Deep Neural Network Library (oneDNN).

Optimization guide

Denormals Optimization

Denormal number is non-zero, finite float number that is very close to zero, i.e. the numbers in (0, 1.17549e-38) and (0, -1.17549e-38). In such case, normalized-number encoding format does not have capability to encode the number and underflow will happen. The computation involving this kind of numbers is extremly slow on many hardware.

As denormal number is extremly close to zero, treating denormal as zero directly is a straightforward and simple method to optimize denormals computation. As this optimization does not comply with IEEE standard 754, in case it introduce unacceptable accuracy degradation, the propery(ov::intel_cpu::denormals_optimization) is introduced to control this behavior. If there are denormal numbers in users’ use case, and see no or ignorable accuracy drop, we could set this property to “YES” to improve performance, otherwise set this to “NO”. If it’s not set explicitly by property, this optimization is disabled by default if application program also does not perform any denormals optimization. After this property is turned on, OpenVINO will provide an cross operation-system/compiler and safe optimization on all platform when applicable.

There are cases that application program where OpenVINO is used also perform this low-level denormals optimization. If it’s optimized by setting FTZ(Flush-To-Zero) and DAZ(Denormals-As-Zero) flag in MXCSR register in the begining of thread where OpenVINO is called, OpenVINO will inherite this setting in the same thread and sub-thread, and then no need set with property. In this case, application program users should be responsible for the effectiveness and safty of the settings.

It need also to be mentioned that this property should must be set before calling ‘compile_model()’.

To enable denormals optimization, the application must set ov::denormals_optimization property to true:

        ov::Core core;                                                    // Step 1: create ov::Core object
        core.set_property(ov::intel_cpu::denormals_optimization(true));   // Step 1b: Enable denormals optimization
        auto model = core.read_model(modelPath);                          // Step 2: Read Model
        //...                                                             // Step 3: Prepare inputs/outputs
        //...                                                             // Step 4: Set device configuration
        auto compiled = core.compile_model(model, device, config);        // Step 5: LoadNetwork
core = ov.Core()
core.set_property("CPU", ov.properties.intel_cpu.denormals_optimization(True))
model = core.read_model(model=xml_path)
compiled_model = core.compile_model(model=model, device_name=device_name)