Quantization-aware Training (QAT) with TensorFlow#

Below are the steps required to integrate QAT from NNCF into a training script written with TensorFlow:

Note

Currently, NNCF for TensorFlow supports optimization of the models created using Keras Sequential API or Functional API.

1. Import NNCF API#

Add NNCF-related imports in the beginning of the training script:

import tensorflow as tf

from nncf import NNCFConfig
from nncf.tensorflow import create_compressed_model, register_default_init_args

2. Create NNCF Configuration#

Define NNCF configuration which consists of model-related parameters (the "input_info" section) and parameters of optimization methods (the "compression" section). For faster convergence, it is also recommended to register a dataset object specific to the DL framework. The data object will be used at the model creation step to initialize quantization parameters.

nncf_config_dict = {
    "input_info": {"sample_size": [1, 3, 224, 224]}, # input shape required for model tracing
    "compression": {
        "algorithm": "quantization",  # 8-bit quantization with default settings
    },
}
nncf_config = NNCFConfig.from_dict(nncf_config_dict)
nncf_config = register_default_init_args(nncf_config, train_dataset, batch_size=1) # train_dataset is an instance of tf.data.Dataset

3. Apply Optimization Methods#

Wrap the original model object with the create_compressed_model() API using the configuration defined in the previous step. This method returns a so-called compression controller and a wrapped model that can be used the same way as the original model. Optimization methods are applied at this step, so that the model undergoes a set of corresponding transformations and contains additional operations required for optimization. In case of QAT, the compression controller object is used for model export and, optionally, in distributed training as demonstrated below.

model = KerasModel() # instance of the tensorflow.keras.Model
compression_ctrl, model = create_compressed_model(model, nncf_config)

4. Fine-tune the Model#

This step assumes applying fine-tuning to the model the same way it is done for the baseline model. For QAT, it is required to train the model for a few epochs with a small learning rate, for example, 10e-5. In principle, you can skip this step, meaning that the post-training optimization will be applied to the model.

... # fine-tuning preparations, e.g. dataset, loss, optimization setup, etc.

# create compression callbacks to control optimization parameters and dump compression statistics
compression_callbacks = create_compression_callbacks(compression_ctrl, log_dir="./compression_log")
# tune quantized model for 5 epochs the same way as the baseline
model.fit(train_dataset, epochs=5, callbacks=compression_callbacks)

5. Multi-GPU Distributed Training#

In the case of distributed multi-GPU training (not DataParallel), call compression_ctrl.distributed() before fine-tuning. This informs optimization methods to make adjustments to function in the distributed mode.

compression_ctrl.distributed() # call it before the training

Note

The precision of weights transitions to INT8 only after converting the model to OpenVINO Intermediate Representation. You can expect a reduction in model footprint only for that format.

These steps outline the basics of applying the QAT method from the NNCF. However, in some cases, it is required to save/load model checkpoints during training. Since NNCF wraps the original model with its own object, it provides an API for these needs.

6. (Optional) Save Checkpoint#

To save a model checkpoint, use the following API:

from nncf.tensorflow.utils.state import TFCompressionState
from nncf.tensorflow.callbacks.checkpoint_callback import CheckpointManagerCallback

checkpoint = tf.train.Checkpoint(model=model,
                                 compression_state=TFCompressionState(compression_ctrl),
                                 ... # the rest of the user-defined objects to save
                                 )
callbacks = []
callbacks.append(CheckpointManagerCallback(checkpoint, path_to_checkpoint))
...
model.fit(..., callbacks=callbacks)

7. (Optional) Restore from Checkpoint#

To restore the model from checkpoint, use the following API:

from nncf.tensorflow.utils.state import TFCompressionStateLoader

checkpoint = tf.train.Checkpoint(compression_state=TFCompressionStateLoader())
checkpoint.restore(path_to_checkpoint)
compression_state = checkpoint.compression_state.state

compression_ctrl, model = create_compressed_model(model, nncf_config, compression_state)
checkpoint = tf.train.Checkpoint(model=model,
                                 ...)
checkpoint.restore(path_to_checkpoint)

For more details on saving/loading checkpoints in the NNCF, see the corresponding NNCF documentation.

Deploying quantized model#

The model can be converted into the OpenVINO Intermediate Representation (IR) if needed, compiled and run with OpenVINO. No extra steps or options are required.

import openvino as ov

# convert TensorFlow model to OpenVINO model
ov_quantized_model = ov.convert_model(quantized_model)

# compile the model to transform quantized operations to int8
model_int8 = ov.compile_model(ov_quantized_model)

input_fp32 = ... # FP32 model input
res = model_int8(input_fp32)

# save the model
ov.save_model(ov_quantized_model, "quantized_model.xml")

For more details, see the corresponding documentation.

Examples#

Quantizing TensorFlow model with NNCF