Microscaling (MX) Quantization#

Microscaling (MX) Quantization method has been introduced to enable users to quantize LLMs with a high compression rate at minimal cost of accuracy. The method helps maintain model performance comparable to that of the conventional FP32. It increases compute and storage efficiency by using low bit-width floating point and integer-based data formats:

Data format

Data type

Description

MXFP8

FP8 (E5M2)
FP8 (E4M3)
Floating point, 8-bit
Floating point, 8-bit

MXFP6

FP6 (E3M2)
FP6 (E2M3)
Floating point, 6-bit
Floating point, 6-bit

MXFP4

FP4 (E2M1)

Floating point, 4-bit

MXINT8

INT8

Integer, 8-bit

Currently, only the MXFP4 (E2M1) data format is supported in NNCF and for quantization on CPU. E2M1 may be considered for improving accuracy, however, quantized models will not be faster than the ones compressed to INT8_ASYM.

Quantization to the E2M1 data type will compress weights to 4-bit without a zero point and with 8-bit E8M0 scales. To quantize a model to E2M1, set mode=CompressWeightsMode.E2M1 in nncf.compress_weights(). It is recommended to use group size = 32. See the example below:

from nncf import compress_weights, CompressWeightsMode
compressed_model = compress_weights(model, mode=CompressWeightsMode.E2M1, group_size=32, all_layers=True)

Note

Different values for group_size and ratio are also supported.

Additional Resources#