Quantization-aware Training (QAT) with PyTorch#
Below are the steps required to integrate QAT from NNCF into a training script written with PyTorch:
1. Apply Post Training Quantization to the Model#
Quantize the model using the Post-Training Quantization method.
model = TorchModel() # instance of torch.nn.Module
model = nncf.quantize(model, ...)
2. Fine-tune the Model#
This step assumes applying fine-tuning to the model the same way it is done for the baseline model. For QAT, it is required to train the model for a few epochs with a small learning rate, for example, 1e-5. Quantized models perform all computations in floating-point precision during fine-tuning by modeling quantization errors in both forward and backward passes.
... # fine-tuning preparations, e.g. dataset, loss, optimizer setup, etc.
# tune quantized model for 5 epochs as the baseline
for epoch in range(0, 5):
for i, data in enumerate(train_loader):
... # training loop body
Note
The precision of weights transitions to INT8 only after converting the model to OpenVINO Intermediate Representation. You can expect a reduction in model footprint only for that format.
These steps outline the basics of applying the QAT method from the NNCF. However, in some cases, it is required to save/load model checkpoints during training. Since NNCF wraps the original model with its own object, it provides an API for these needs.
3. (Optional) Save Checkpoint#
To save a model checkpoint, use the following API:
checkpoint = {
'state_dict': model.state_dict(),
'nncf_config': model.nncf.get_config(),
... # the rest of the user-defined objects to save
}
torch.save(checkpoint, path_to_checkpoint)
4. (Optional) Restore from Checkpoint#
To restore the model from checkpoint, use the following API:
resuming_checkpoint = torch.load(path_to_checkpoint)
nncf_config = resuming_checkpoint['nncf_config']
quantized_model = nncf.torch.load_from_config(model, nncf_config, example_input)
state_dict = resuming_checkpoint['state_dict']
model.load_state_dict(state_dict)
Deploying the Quantized Model#
The model can be converted into the OpenVINO Intermediate Representation (IR) if needed, compiled, and run with OpenVINO without any additional steps.
import openvino as ov
input_fp32 = ... # FP32 model input
# convert PyTorch model to OpenVINO model
ov_quantized_model = ov.convert_model(quantized_model, example_input=input_fp32)
# compile the model to transform quantized operations to int8
model_int8 = ov.compile_model(ov_quantized_model)
res = model_int8(input_fp32)
# save the model
ov.save_model(ov_quantized_model, "quantized_model.xml")
For more details, see the corresponding documentation.