Compressing a Model to FP16

Optionally, all relevant floating-point weights can be compressed to FP16 data type during model conversion. It results in creating a “compressed FP16 model”, which occupies about half of the original space in the file system. The compression may introduce a minor drop in accuracy, but it is negligible for most models.

To compress the model, use the compress_to_fp16=True option:

from openvino.tools.mo import convert_model
ov_model = convert_model(INPUT_MODEL, compress_to_fp16=True)
mo --input_model INPUT_MODEL --compress_to_fp16=True

For details on how plugins handle compressed FP16 models, see Working with devices.

Note

FP16 compression is sometimes used as the initial step for INT8 quantization. Refer to the Post-training optimization guide for more information about that.

Note

Some large models (larger than a few GB) when compressed to FP16 may consume an overly large amount of RAM on the loading phase of the inference. If that is the case for your model, try to convert it without compression: convert_model(INPUT_MODEL, compress_to_fp16=False) or convert_model(INPUT_MODEL)