Bfloat16 Inference

Bfloat16 Inference Usage (C++)

C++

Disclaimer

Inference Engine with the bfloat16 inference implemented on CPU must support the native avx512_bf16 instruction and therefore the bfloat16 data format. It is possible to use bfloat16 inference in simulation mode on platforms with Intel® Advanced Vector Extensions 512 (Intel® AVX-512), but it leads to significant performance degradation in comparison with FP32 or native avx512_bf16 instruction usage.

Introduction

Bfloat16 computations (referred to as BF16) is the Brain Floating-Point format with 16 bits. This is a truncated 16-bit version of the 32-bit IEEE 754 single-precision floating-point format FP32. BF16 preserves 8 exponent bits as FP32 but reduces precision of the sign and mantissa from 24 bits to 8 bits.

_images/bf16_format.png

Preserving the exponent bits keeps BF16 to the same range as the FP32 (~1e-38 to ~3e38). This simplifies conversion between two data types: you just need to skip or flush to zero 16 low bits. Truncated mantissa leads to occasionally less precision, but according to investigations, neural networks are more sensitive to the size of the exponent than the mantissa size. Also, in lots of models, precision is needed close to zero but not so much at the maximum range. Another useful feature of BF16 is possibility to encode INT8 in BF16 without loss of accuracy, because INT8 range completely fits in BF16 mantissa field. It reduces data flow in conversion from INT8 input image data to BF16 directly without intermediate representation in FP32, or in combination of INT8 inference and BF16 layers.

See the BFLOAT16 – Hardware Numerics Definition white paper for more bfloat16 format details.

There are two ways to check if CPU device can support bfloat16 computations for models:

  1. Query the instruction set using one of these system commands:

    • lscpu | grep avx512_bf16

    • cat /proc/cpuinfo | grep avx512_bf16

  2. Use the Query API with METRIC_KEY(OPTIMIZATION_CAPABILITIES), which should return BF16 in the list of CPU optimization options:

InferenceEngine::Core core;
auto cpuOptimizationCapabilities = core.GetMetric("CPU", METRIC_KEY(OPTIMIZATION_CAPABILITIES)).as<std::vector<std::string>>();

The current Inference Engine solution for bfloat16 inference uses the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) and supports inference of the significant number of layers in BF16 computation mode.

Lowering Inference Precision

Lowering precision to increase performance is widely used for optimization of inference. The bfloat16 data type usage on CPU for the first time opens the possibility of default optimization approach. The embodiment of this approach is to use the optimization capabilities of the current platform to achieve maximum performance while maintaining the accuracy of calculations within the acceptable range.

Using Bfloat16 precision provides the following performance benefits:

  1. Faster multiplication of two BF16 numbers because of shorter mantissa of bfloat16 data.

  2. No need to support denormals and handling exceptions as this is a performance optimization.

  3. Fast conversion of float32 to bfloat16 and vice versa.

  4. Reduced size of data in memory, as a result, larger models fit in the same memory bounds.

  5. Reduced amount of data that must be transferred, as a result, reduced data transition time.

For default optimization on CPU, the source model is converted from FP32 or FP16 to BF16 and executed internally on platforms with native BF16 support. In this case, KEY_ENFORCE_BF16 is set to YES in the PluginConfigParams for GetConfig(). The code below demonstrates how to check if the key is set:

InferenceEngine::Core core;
auto network = core.ReadNetwork("sample.xml");
auto exeNetwork = core.LoadNetwork(network, "CPU");
auto enforceBF16 = exeNetwork.GetConfig(PluginConfigParams::KEY_ENFORCE_BF16).as<std::string>();

To disable BF16 internal transformations in C++ API, set the KEY_ENFORCE_BF16 to NO. In this case, the model infers as is without modifications with precisions that were set on each layer edge.

InferenceEngine::Core core;
core.SetConfig({ { CONFIG_KEY(ENFORCE_BF16), CONFIG_VALUE(NO) } }, "CPU");

To disable BF16 in C API:

ie_config_t config = { "ENFORCE_BF16", "NO", NULL};
ie_core_load_network(core, network, device_name, &config, &exe_network);

An exception with the message Platform doesn't support BF16 format is formed in case of setting KEY_ENFORCE_BF16 to YES on CPU without native BF16 support or BF16 simulation mode.

Low-Precision 8-bit integer models cannot be converted to BF16, even if bfloat16 optimization is set by default.

Bfloat16 Simulation Mode

Bfloat16 simulation mode is available on CPU and Intel® AVX-512 platforms that do not support the native avx512_bf16 instruction. The simulator does not guarantee good performance. Note that the CPU must still support the AVX-512 extensions.

To enable the simulation of Bfloat16:

  • In the Benchmark App, add the -enforcebf16=true option

  • In C++ API, set KEY_ENFORCE_BF16 to YES

  • In C API:

    ie_config_t config = { "ENFORCE_BF16", "YES", NULL};
    ie_core_load_network(core, network, device_name, &config, &exe_network);

Performance Counters

Information about layer precision is stored in the performance counters that are available from the Inference Engine API. The layers have the following marks:

  • Suffix BF16 for layers that had bfloat16 data type input and were computed in BF16 precision

  • Suffix FP32 for layers computed in 32-bit precision

For example, the performance counters table for the Inception model can look as follows:

pool5     EXECUTED       layerType: Pooling            realTime: 143       cpu: 143        execType: jit_avx512_BF16
fc6       EXECUTED       layerType: FullyConnected     realTime: 47723     cpu: 47723      execType: jit_gemm_BF16
relu6     NOT_RUN        layerType: ReLU               realTime: 0         cpu: 0          execType: undef
fc7       EXECUTED       layerType: FullyConnected     realTime: 7558      cpu: 7558       execType: jit_gemm_BF16
relu7     NOT_RUN        layerType: ReLU               realTime: 0         cpu: 0          execType: undef
fc8       EXECUTED       layerType: FullyConnected     realTime: 2193      cpu: 2193       execType: jit_gemm_BF16
prob      EXECUTED       layerType: SoftMax            realTime: 68        cpu: 68         execType: jit_avx512_FP32

The execType column of the table includes inference primitives with specific suffixes.

Bfloat16 Inference Usage (Python)

Python

Disclaimer

Inference Engine with the bfloat16 inference implemented on CPU must support the native avx512_bf16 instruction and therefore the bfloat16 data format. It is possible to use bfloat16 inference in simulation mode on platforms with Intel® Advanced Vector Extensions 512 (Intel® AVX-512), but it leads to significant performance degradation in comparison with FP32 or native avx512_bf16 instruction usage.

Introduction

Bfloat16 computations (referred to as BF16) is the Brain Floating-Point format with 16 bits. This is a truncated 16-bit version of the 32-bit IEEE 754 single-precision floating-point format FP32. BF16 preserves 8 exponent bits as FP32 but reduces precision of the sign and mantissa from 24 bits to 8 bits.

_images/bf16_format.png

Preserving the exponent bits keeps BF16 to the same range as the FP32 (~1e-38 to ~3e38). This simplifies conversion between two data types: you just need to skip or flush to zero 16 low bits. Truncated mantissa leads to occasionally less precision, but according to investigations, neural networks are more sensitive to the size of the exponent than the mantissa size. Also, in lots of models, precision is needed close to zero but not so much at the maximum range. Another useful feature of BF16 is possibility to encode INT8 in BF16 without loss of accuracy, because INT8 range completely fits in BF16 mantissa field. It reduces data flow in conversion from INT8 input image data to BF16 directly without intermediate representation in FP32, or in combination of INT8 inference and BF16 layers.

See the BFLOAT16 – Hardware Numerics Definition white paper for more bfloat16 format details.

There are two ways to check if CPU device can support bfloat16 computations for models:

  1. Query the instruction set using one of these system commands:

    • lscpu | grep avx512_bf16

    • cat /proc/cpuinfo | grep avx512_bf16

  2. Use the Query API with METRIC_KEY(OPTIMIZATION_CAPABILITIES), which should return BF16 in the list of CPU optimization options:

from openvino.inference_engine import IECore

ie = IECore()
net = ie.read_network(path_to_xml_file)
cpu_caps = ie.get_metric(metric_name="OPTIMIZATION_CAPABILITIES", device_name="CPU")

The current Inference Engine solution for bfloat16 inference uses the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) and supports inference of the significant number of layers in BF16 computation mode.

Lowering Inference Precision

Lowering precision to increase performance is widely used for optimization of inference. The bfloat16 data type usage on CPU for the first time opens the possibility of default optimization approach. The embodiment of this approach is to use the optimization capabilities of the current platform to achieve maximum performance while maintaining the accuracy of calculations within the acceptable range.

Using Bfloat16 precision provides the following performance benefits:

  1. Faster multiplication of two BF16 numbers because of shorter mantissa of bfloat16 data.

  2. No need to support denormals and handling exceptions as this is a performance optimization.

  3. Fast conversion of float32 to bfloat16 and vice versa.

  4. Reduced size of data in memory, as a result, larger models fit in the same memory bounds.

  5. Reduced amount of data that must be transferred, as a result, reduced data transition time.

For default optimization on CPU, the source model is converted from FP32 or FP16 to BF16 and executed internally on platforms with native BF16 support. In this case, ENFORCE_BF16 is set to YES. The code below demonstrates how to check if the key is set:

from openvino.inference_engine import IECore

ie = IECore()
net = ie.read_network(path_to_xml_file)
exec_net = ie.load_network(network=net, device_name="CPU")
exec_net.get_config("ENFORCE_BF16")

To enable BF16 internal transformations, set the key “ENFORCE_BF16” to “YES” in the ExecutableNetwork configuration.

bf16_config = {"ENFORCE_BF16" : "YES"}
exec_net = ie.load_network(network=net, device_name="CPU", config = bf16_config)

To disable BF16 internal transformations, set the key “ENFORCE_BF16” to “NO”. In this case, the model infers as is without modifications with precisions that were set on each layer edge.

An exception with the message Platform doesn't support BF16 format is formed in case of setting “ENFORCE_BF16” to “YES”on CPU without native BF16 support or BF16 simulation mode.

Low-Precision 8-bit integer models cannot be converted to BF16, even if bfloat16 optimization is set by default.

Bfloat16 Simulation Mode

Bfloat16 simulation mode is available on CPU and Intel® AVX-512 platforms that do not support the native avx512_bf16 instruction. The simulator does not guarantee good performance. Note that the CPU must still support the AVX-512 extensions.

To Enable the simulation of Bfloat16:

  • In the Benchmark App, add the -enforcebf16=true option

  • In Python, use the following code as an example:

from openvino.inference_engine import IECore

ie = IECore()
net = ie.read_network(path_to_xml_file)
bf16_config = {"ENFORCE_BF16" : "YES"}
exec_net = ie.load_network(network=net, device_name="CPU", config=bf16_config)

Performance Counters

Information about layer precision is stored in the performance counters that are available from the Inference Engine API. The layers have the following marks:

  • Suffix BF16 for layers that had bfloat16 data type input and were computed in BF16 precision

  • Suffix FP32 for layers computed in 32-bit precision

For example, the performance counters table for the Inception model can look as follows:

pool5     EXECUTED       layerType: Pooling            realTime: 143       cpu: 143        execType: jit_avx512_BF16
fc6       EXECUTED       layerType: FullyConnected     realTime: 47723     cpu: 47723      execType: jit_gemm_BF16
relu6     NOT_RUN        layerType: ReLU               realTime: 0         cpu: 0          execType: undef
fc7       EXECUTED       layerType: FullyConnected     realTime: 7558      cpu: 7558       execType: jit_gemm_BF16
relu7     NOT_RUN        layerType: ReLU               realTime: 0         cpu: 0          execType: undef
fc8       EXECUTED       layerType: FullyConnected     realTime: 2193      cpu: 2193       execType: jit_gemm_BF16
prob      EXECUTED       layerType: SoftMax            realTime: 68        cpu: 68         execType: jit_avx512_FP32

The execType column of the table includes inference primitives with specific suffixes.