Microscaling (MX) Quantization#
Microscaling (MX) Quantization method has been introduced to enable users to quantize LLMs with a high compression rate at minimal cost of accuracy. The method helps maintain model performance comparable to that of the conventional FP32. It increases compute and storage efficiency by using low bit-width floating point and integer-based data formats:
Data format |
Data type |
Description |
---|---|---|
MXFP8 |
FP8 (E5M2)
FP8 (E4M3)
|
Floating point, 8-bit
Floating point, 8-bit
|
MXFP6 |
FP6 (E3M2)
FP6 (E2M3)
|
Floating point, 6-bit
Floating point, 6-bit
|
MXFP4 |
FP4 (E2M1) |
Floating point, 4-bit |
MXINT8 |
INT8 |
Integer, 8-bit |
Currently, only the MXFP4 (E2M1) data format is supported in NNCF and for quantization on CPU. E2M1 may be considered for improving accuracy, however, quantized models will not be faster than the ones compressed to INT8_ASYM.
Quantization to the E2M1 data type will compress weights to 4-bit without a zero
point and with 8-bit E8M0 scales. To quantize a model to E2M1, set
mode=CompressWeightsMode.E2M1
in nncf.compress_weights()
. It is
recommended to use group size = 32
. See the example below:
from nncf import compress_weights, CompressWeightsMode
compressed_model = compress_weights(model, mode=CompressWeightsMode.E2M1, group_size=32, all_layers=True)
Note
Different values for group_size
and ratio
are also supported.