Post-Training Quantization with TensorFlow Classification Model

This tutorial is also available as a Jupyter notebook that can be cloned directly from GitHub. See the installation guide for instructions to run this tutorial locally on Windows, Linux or macOS.

Github

This example demonstrates how to quantize the OpenVINO model that was created in 301-tensorflow-training-openvino.ipynb, to improve inference speed. Quantization is performed with Post-Training Optimization Tool (POT). A custom dataloader and metric will be defined, and accuracy and performance will be computed for the original IR model and the quantized model.

Preparation

The notebook requires that the training notebook has been run and that the Intermediate Representation (IR) models are created. If the IR models do not exist, running the next cell will run the training notebook. This will take a while.

from pathlib import Path

import tensorflow as tf

model_xml = Path("model/flower/flower_ir.xml")
dataset_url = (
    "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
)
data_dir = Path(tf.keras.utils.get_file("flower_photos", origin=dataset_url, untar=True))

if not model_xml.exists():
    print("Executing training notebook. This will take a while...")
    %run 301-tensorflow-training-openvino.ipynb

Imports

The Post Training Optimization API is implemented in the compression library.

import copy
import os
import sys
import urllib

import cv2
import matplotlib.pyplot as plt
import numpy as np
from addict import Dict
from compression.api import DataLoader, Metric
from compression.engines.ie_engine import IEEngine
from compression.graph import load_model, save_model
from compression.graph.model_utils import compress_model_weights
from compression.pipeline.initializer import create_pipeline
from openvino.inference_engine import IECore
from PIL import Image

sys.path.append("../utils")
from notebook_utils import benchmark_model
15:56:29 accuracy_checker WARNING: /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/defusedxml/__init__.py:30: DeprecationWarning: defusedxml.cElementTree is deprecated, import from defusedxml.ElementTree instead.
  from . import cElementTree

15:56:30 accuracy_checker WARNING: /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/compression/algorithms/quantization/optimization/algorithm.py:29: UserWarning: Nevergrad package could not be imported. If you are planning to useany hyperparameter optimization algo, consider installing itusing pip. This implies advanced usage of the tool.Note that nevergrad is compatible only with Python 3.6+
  warnings.warn(

Settings

In the next cell, the settings for running quantization are defined. The default settings use the mixed preset and the DefaultQuantization algorithm. This enables reasonably fast quantization, with possible drop in accuracy. The performance preset can result in faster inference on the quantized model, the AccuracyAwareQuantization algorithm quantizes the model to a defined maximal accuracy drop, which may not achieve the greatest performance boost but avoids further drop in accuracy.

See the Post-Training Optimization Best Practices page for more information about the configurable parameters and best practices for post-training quantization.

The POT methods expect configuration dictionaries as arguments. They are defined in the cell below.

model_config = Dict(
    {
        "model_name": "flower",
        "model": "model/flower/flower_ir.xml",
        "weights": "model/flower/flower_ir.bin",
    }
)

engine_config = Dict({"device": "CPU", "stat_requests_number": 2, "eval_requests_number": 2})

algorithms = [
    {
        "name": "DefaultQuantization",
        "params": {
            "target_device": "CPU",
            "preset": "performance",
            "stat_subset_size": 1000,
        },
    }
]

Create DataLoader Class

OpenVINO’s compression library contains a DataLoader class. The DataLoader defines how to load data and annotations. For the TensorFlow flowers dataset, images are stored in a directory per category. The DataLoader loads images from a given data_source directory and assigns a label based on the position of the directory in class_names (where class_names is a list of directory names in alphabetical order).

class ClassificationDataLoader(DataLoader):
    """
    DataLoader for image data that is stored in a directory per category. For example, for
    categories _rose_ and _daisy_, rose images are expected in data_source/rose, daisy images
    in data_source/daisy.
    """

    def __init__(self, data_source):
        """
        :param data_source: path to data directory
        """
        self.data_source = Path(data_source)
        self.dataset = [p for p in data_dir.glob("**/*") if p.suffix in (".png", ".jpg")]
        self.class_names = sorted([item.name for item in Path(data_dir).iterdir() if item.is_dir()])

    def __len__(self):
        """
        Returns the number of elements in the dataset
        """
        return len(self.dataset)

    def __getitem__(self, index):
        """
        Get item from self.dataset at the specified index.
        Returns (annotation, image), where annotation is a tuple (index, class_index)
        and image a preprocessed image in network shape
        """
        if index >= len(self):
            raise IndexError
        filepath = self.dataset[index]
        annotation = (index, self.class_names.index(filepath.parent.name))
        image = self._read_image(filepath)
        return annotation, image

    def _read_image(self, index):
        """
        Read image at dataset[index] to memory, resize, convert to BGR and to network shape

        :param index: dataset index to read
        :return ndarray representation of image batch
        """
        image = cv2.imread(os.path.join(self.data_source, index))[:, :, (2, 1, 0)]
        image = cv2.resize(image, (180, 180)).astype(np.float32)
        return image.transpose(2, 0, 1)

Create Accuracy Metric Class

The accuracy metric is defined as the number of correct predictions divided by the total number of predictions. It is used to validate the accuracy of the quantized model.

The Accuracy class in this tutorial implements the Metric interface of the compression library.

class Accuracy(Metric):
    def __init__(self):
        super().__init__()
        self._name = "accuracy"
        self._matches = []

    @property
    def value(self):
        """Returns accuracy metric value for the last model output."""
        return {self._name: self._matches[-1]}

    @property
    def avg_value(self):
        """
        Returns accuracy metric value for all model outputs. Results per image are stored in
        self._matches, where True means a correct prediction and False a wrong prediction.
        Accuracy is computed as the number of correct predictions divided by the total
        number of predictions.
        """
        num_correct = np.count_nonzero(self._matches)
        return {self._name: num_correct / len(self._matches)}

    def update(self, output, target):
        """Updates prediction matches.

        :param output: model output
        :param target: annotations
        """
        predict = np.argmax(output[0], axis=1)
        match = predict == target
        self._matches.append(match)

    def reset(self):
        """
        Resets the Accuracy metric. This is a required method that should initialize all
        attributes to their initial value.
        """
        self._matches = []

    def get_attributes(self):
        """
        Returns a dictionary of metric attributes {metric_name: {attribute_name: value}}.
        Required attributes: 'direction': 'higher-better' or 'higher-worse'
                             'type': metric type
        """
        return {self._name: {"direction": "higher-better", "type": "accuracy"}}

POT Optimization

After creating the DataLoader and Metric classes, and defining the configuration settings for POT, we can start the quantization process.

# Step 1: Load the model
model = load_model(model_config=model_config)
original_model = copy.deepcopy(model)

# Step 2: Initialize the data loader
data_loader = ClassificationDataLoader(data_source=data_dir)

# Step 3 (Optional. Required for AccuracyAwareQuantization): Initialize the metric
#        Compute metric results on original model
metric = Accuracy()

# Step 4: Initialize the engine for metric calculation and statistics collection
engine = IEEngine(config=engine_config, data_loader=data_loader, metric=metric)

# Step 5: Create a pipeline of compression algorithms
pipeline = create_pipeline(algo_config=algorithms, engine=engine)

# Step 6: Execute the pipeline
compressed_model = pipeline.run(model=model)

# Step 7 (Optional): Compress model weights quantized precision
#                    in order to reduce the size of final .bin file
compress_model_weights(model=compressed_model)

# Step 8: Save the compressed model and get the path to the model
compressed_model_paths = save_model(
    model=compressed_model, save_path=os.path.join(os.path.curdir, "model/optimized")
)
compressed_model_xml = Path(compressed_model_paths[0]["model"])
print(f"The quantized model is stored in {compressed_model_xml}")
The quantized model is stored in model/optimized/flower_ir.xml
# Step 9 (Optional): Evaluate the original and compressed model. Print the results
original_metric_results = pipeline.evaluate(original_model)
if original_metric_results:
    print(f"Accuracy of the original model:  {next(iter(original_metric_results.values())):.5f}")

quantized_metric_results = pipeline.evaluate(compressed_model)
if quantized_metric_results:
    print(f"Accuracy of the quantized model: {next(iter(quantized_metric_results.values())):.5f}")
Accuracy of the original model:  0.79155
Accuracy of the quantized model: 0.79019

Run Inference on Quantized Model

Copy the preprocess function from the training notebook and run inference on the quantized model with Inference Engine. See the OpenVINO API tutorial for more information about running inference with Inference Engine Python API.

def pre_process_image(imagePath, img_height=180):
    # Model input format
    n, c, h, w = [1, 3, img_height, img_height]
    image = Image.open(imagePath)
    image = image.resize((h, w), resample=Image.BILINEAR)

    # Convert to array and change data layout from HWC to CHW
    image = np.array(image)
    image = image.transpose((2, 0, 1))
    input_image = image.reshape((n, c, h, w))

    return input_image
# Load the optimized model and get the names of the input and output layer
ie = IECore()
net_pot = ie.read_network(model="model/optimized/flower_ir.xml")
exec_net_pot = ie.load_network(net_pot, "CPU")
input_layer = next(iter(exec_net_pot.input_info))
output_layer = next(iter(exec_net_pot.outputs))

# Get the class names: a list of directory names in alphabetical order
class_names = sorted([item.name for item in Path(data_dir).iterdir() if item.is_dir()])

# Run Inference on an input image...
inp_img_url = (
    "https://upload.wikimedia.org/wikipedia/commons/4/48/A_Close_Up_Photo_of_a_Dandelion.jpg"
)
inp_file_name = "output/A_Close_Up_Photo_of_a_Dandelion.jpg"

# Download the image from the storage
urllib.request.urlretrieve(inp_img_url, inp_file_name)

# Pre-process the image and get it ready for inference.
input_image = pre_process_image(imagePath=inp_file_name)

res = exec_net_pot.infer(inputs={input_layer: input_image})
res = res[output_layer]

score = tf.nn.softmax(res[0])

# Show the results
image = Image.open(inp_file_name)
plt.imshow(image)
print(
    "This image most likely belongs to {} with a {:.2f} percent confidence.".format(
        class_names[np.argmax(score)], 100 * np.max(score)
    )
)
This image most likely belongs to dandelion with a 98.98 percent confidence.
../_images/301-tensorflow-training-openvino-pot-with-output_17_1.png

Compare Inference Speed

Measure inference speed with the OpenVINO Benchmark App.

Benchmark App is a command line tool that measures raw inference performance for a specified OpenVINO IR model. Run benchmark_app --help to see a list of available parameters. By default, Benchmark App tests the performance of the model specified with the -m parameter with asynchronous inference on CPU, for one minute. Use the -d parameter to test performance on a different device, for example an Intel integrated Graphics (iGPU), and -t to set the number of seconds to run inference. See the documentation for more information.

In this tutorial, we use a wrapper function from Notebook Utils. It prints the benchmark_app command with the chosen parameters.

In the next cells, inference speed will be measured for the original and quantized model on CPU. If an iGPU is available, inference speed will be measured for CPU+GPU as well. The number of seconds is set to 15.

NOTE: For the most accurate performance estimation, we recommended running benchmark_app in a terminal/command prompt after closing other applications.

# print the available devices on this system
ie = IECore()
print("Device information:")
print(ie.get_metric("CPU", "FULL_DEVICE_NAME"))
if "GPU" in ie.available_devices:
    print(ie.get_metric("GPU", "FULL_DEVICE_NAME"))
Device information:
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
# Original model - CPU
benchmark_model(model_path=model_xml, device="CPU", seconds=15)

Benchmark flower_ir.xml with CPU for 15 seconds with async inference

Benchmark command: benchmark_app -m model/flower/flower_ir.xml -d CPU -t 15 -api async -b 1 -cdir model_cache

Count:      6197 iterations
Duration:   15002.35 ms
Latency:    2.26 ms
Throughput: 413.07 FPS

Device: Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
# Quantized model - CPU
benchmark_model(model_path=compressed_model_xml, device="CPU", seconds=15)

Benchmark flower_ir.xml with CPU for 15 seconds with async inference

Benchmark command: benchmark_app -m model/optimized/flower_ir.xml -d CPU -t 15 -api async -b 1 -cdir model_cache

Count:      6548 iterations
Duration:   15003.34 ms
Latency:    2.14 ms
Throughput: 436.44 FPS

Device: Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz

Benchmark on MULTI:CPU,GPU

With a recent Intel CPU, the best performance can often be achieved by doing inference on both the CPU and the iGPU, with OpenVINO’s Multi Device Plugin. It takes a bit longer to load a model on GPU than on CPU, so this benchmark will take a bit longer to complete than the CPU benchmark, when run for the first time. Benchmark App supports caching, by specifying the --cdir parameter. In the cells below, the model will cached to the model_cache directory.

# Original model - MULTI:CPU,GPU
if "GPU" in ie.available_devices:
    benchmark_model(model_path=model_xml, device="MULTI:CPU,GPU", seconds=15)
else:
    print("A supported integrated GPU is not available on this system.")
A supported integrated GPU is not available on this system.
# Quantized model - MULTI:CPU,GPU
if "GPU" in ie.available_devices:
    benchmark_model(model_path=compressed_model_xml, device="MULTI:CPU,GPU", seconds=15)
else:
    print("A supported integrated GPU is not available on this system.")
A supported integrated GPU is not available on this system.
# print the available devices on this system
print("Device information:")
print(ie.get_metric("CPU", "FULL_DEVICE_NAME"))
if "GPU" in ie.available_devices:
    print(ie.get_metric("GPU", "FULL_DEVICE_NAME"))
Device information:
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz

Original IR model - CPU

benchmark_output = %sx benchmark_app -m $model_xml -t 15
# Remove logging info from benchmark_app output and show only the results
benchmark_result = [line for line in benchmark_output if not (line.startswith(r"[") or line.startswith("  ") or line=="")]
print("\n".join(benchmark_result))
Count:      6189 iterations
Duration:   15004.08 ms
Latency:    2.26 ms
Throughput: 412.49 FPS

Quantized IR model - CPU

benchmark_output = %sx benchmark_app -m $compressed_model_xml -t 15
# Remove logging info from benchmark_app output and show only the results
benchmark_result = [line for line in benchmark_output if not (line.startswith(r"[") or line.startswith("  ") or line=="")]
print("\n".join(benchmark_result))
Count:      6540 iterations
Duration:   15002.16 ms
Latency:    2.15 ms
Throughput: 435.94 FPS

Original IR model - MULTI:CPU,GPU

With a recent Intel CPU, the best performance can often be achieved by doing inference on both the CPU and the iGPU, with OpenVINO’s Multi Device Plugin. It takes a bit longer to load a model on GPU than on CPU, so this benchmark will take a bit longer to complete than the CPU benchmark.

ie = IECore()
if "GPU" in ie.available_devices:
    benchmark_output = %sx benchmark_app -m $model_xml -d MULTI:CPU,GPU -t 15
    # Remove logging info from benchmark_app output and show only the results
    benchmark_result = [line for line in benchmark_output if not (line.startswith(r"[") or line.startswith("  ") or line=="")]
    print("\n".join(benchmark_result))
else:
    print("An integrated GPU is not available on this system.")
An integrated GPU is not available on this system.

Quantized IR model - MULTI:CPU,GPU

ie = IECore()
if "GPU" in ie.available_devices:
    benchmark_output = %sx benchmark_app -m $compressed_model_xml -d MULTI:CPU,GPU -t 15
    # Remove logging info from benchmark_app output and show only the results
    benchmark_result = [line for line in benchmark_output if not (line.startswith(r"[") or line.startswith("  ") or line=="")]
    print("\n".join(benchmark_result))
else:
    print("An integrated GPU is not available on this system.")
An integrated GPU is not available on this system.