Microscaling (MX) Quantization#

Microscaling (MX) Quantization method has been introduced to enable users to quantize LLMs with a high compression rate at minimal cost of accuracy. The method helps maintain model performance comparable to that of the conventional FP32. It increases compute and storage efficiency by using low bit-width floating point and integer-based data formats:

Data format	Data type	Description
MXFP8	FP8 (E5M2) FP8 (E4M3)	Floating point, 8-bit Floating point, 8-bit
MXFP6	FP6 (E3M2) FP6 (E2M3)	Floating point, 6-bit Floating point, 6-bit
MXFP4	FP4 (E2M1)	Floating point, 4-bit
MXINT8	INT8	Integer, 8-bit

Currently, only the MXFP4 (E2M1) data format is supported in NNCF and for quantization on CPU. E2M1 may be considered for improving accuracy, however, quantized models will not be faster than the ones compressed to INT8_ASYM.

Quantization to the E2M1 data type will compress weights to 4-bit without a zero point and with 8-bit E8M0 scales. To quantize a model to E2M1, set mode=CompressWeightsMode.E2M1 in nncf.compress_weights(). It is recommended to use group size = 32. See the example below:

from nncf import compress_weights, CompressWeightsMode
compressed_model = compress_weights(model, mode=CompressWeightsMode.E2M1, group_size=32, all_layers=True)

Note

Different values for group_size and ratio are also supported.

Microscaling (MX) Quantization#

Additional Resources#