The notebook requires that the training notebook has been run and that the Intermediate Representation (IR) models are created. If the IR models do not exist, running the next cell will run the training notebook. This will take a while.

from pathlib import Path

import tensorflow as tf

model_xml = Path("model/flower/flower_ir.xml")
dataset_url = (
data_dir = Path(tf.keras.utils.get_file("flower_photos", origin=dataset_url, untar=True))

if not model_xml.exists():
    print("Executing training notebook. This will take a while...")
    %run 301-tensorflow-training-openvino.ipynb
Found 3670 files belonging to 5 classes.
Using 734 files for validation.
['daisy', 'dandelion', 'roses', 'sunflowers', 'tulips']
(32, 180, 180, 3)
0.02886732 1.0
Model: "sequential_2"
 Layer (type)                Output Shape              Param #
 sequential_1 (Sequential)   (None, 180, 180, 3)       0

 rescaling_2 (Rescaling)     (None, 180, 180, 3)       0

 conv2d_3 (Conv2D)           (None, 180, 180, 16)      448

 max_pooling2d_3 (MaxPooling  (None, 90, 90, 16)       0

 conv2d_4 (Conv2D)           (None, 90, 90, 32)        4640

 max_pooling2d_4 (MaxPooling  (None, 45, 45, 32)       0

 conv2d_5 (Conv2D)           (None, 45, 45, 64)        18496

 max_pooling2d_5 (MaxPooling  (None, 22, 22, 64)       0

 dropout (Dropout)           (None, 22, 22, 64)        0

 flatten_1 (Flatten)         (None, 30976)             0

 dense_2 (Dense)             (None, 128)               3965056

 outputs (Dense)             (None, 5)                 645

Total params: 3,989,285
Trainable params: 3,989,285
Non-trainable params: 0
output/A_Close_Up_Photo_of_a_Dandelion.jpg:   0%|          | 0.00/21.7k [00:00<?, ?B/s]
(1, 180, 180, 3)
This image most likely belongs to dandelion with a 98.99 percent confidence.


The Post Training Quantization API is implemented in the nncf library.

import sys

import matplotlib.pyplot as plt
import numpy as np
import nncf
from openvino.runtime import Core
from openvino.runtime import serialize
from PIL import Image
from sklearn.metrics import accuracy_score

from notebook_utils import download_file
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino

Post-training Quantization with NNCF

NNCF provides a suite of advanced algorithms for Neural Networks inference optimization in OpenVINO with minimal accuracy drop.

Create a quantized model from the pre-trained FP32 model and the calibration dataset. The optimization process contains the following steps:

  1. Create a Dataset for quantization.

  2. Run nncf.quantize for getting an optimized model.

The validation dataset already defined in the training notebook.

img_height = 180
img_width = 180
val_dataset = tf.keras.preprocessing.image_dataset_from_directory(
  image_size=(img_height, img_width),

for a, b in val_dataset:
    print(type(a), type(b))
Found 3670 files belonging to 5 classes.
Using 734 files for validation.
<class 'tensorflow.python.framework.ops.EagerTensor'> <class 'tensorflow.python.framework.ops.EagerTensor'>
The validation dataset can be reused in quantization process. But it returns a tuple (images, labels), whereas calibration_dataset should only return images. The transformation function helps to transform a user validation dataset to the calibration dataset.

def transform_fn(data_item):
    The transformation function transforms a data item into model input data.
    This function should be passed when the data item cannot be used as model's input.
    images, _ = data_item
    return images.numpy()

calibration_dataset = nncf.Dataset(val_dataset, transform_fn)

Download Intermediate Representation (IR) model.

core = Core()
ir_model = core.read_model(model_xml)

Use Basic Quantization Flow. To use the most advanced quantization flow that allows to apply 8-bit quantization to the model with accuracy control see Quantizing with accuracy control.

quantized_model = nncf.quantize(
Statistics collection:  73%|███████▎  | 734/1000 [00:04<00:01, 168.06it/s]
Applying Fast Bias correction: 100%|██████████| 5/5 [00:01<00:00,  3.98it/s]

Save quantized model to benchmark.

compressed_model_dir = Path("model/optimized")
compressed_model_dir.mkdir(parents=True, exist_ok=True)
compressed_model_xml = compressed_model_dir / "flower_ir.xml"
serialize(quantized_model, str(compressed_model_xml))

Select inference device

select device from dropdown list for running inference using OpenVINO

import ipywidgets as widgets

device = widgets.Dropdown(
    options=core.available_devices + ["AUTO"],

Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')

Compare Metrics

Define a metric to determine the performance of the model.

For this demo we define validate function to compute accuracy metrics.

def validate(model, validation_loader):
    Evaluate model and compute accuracy metrics.

    :param model: Model to validate
    :param validation_loader: Validation dataset
    :returns: Accuracy scores
    predictions = []
    references = []

    output = model.outputs[0]

    for images, target in validation_loader:
        pred = model(images.numpy())[output]

        predictions.append(np.argmax(pred, axis=1))

    predictions = np.concatenate(predictions, axis=0)
    references = np.concatenate(references, axis=0)

    scores = accuracy_score(references, predictions)

    return scores

Calculate accuracy for the original model and the quantized model.

original_compiled_model = core.compile_model(model=ir_model, device_name=device.value)
quantized_compiled_model = core.compile_model(model=quantized_model, device_name=device.value)

original_accuracy = validate(original_compiled_model, val_dataset)
quantized_accuracy = validate(quantized_compiled_model, val_dataset)

print(f"Accuracy of the original model: {original_accuracy:.3f}")
print(f"Accuracy of the quantized model: {quantized_accuracy:.3f}")
Accuracy of the original model: 0.723
Accuracy of the quantized model: 0.729

Compare file size of the models.

original_model_size = model_xml.with_suffix(".bin").stat().st_size / 1024
quantized_model_size = compressed_model_xml.with_suffix(".bin").stat().st_size / 1024

print(f"Original model size: {original_model_size:.2f} KB")
print(f"Quantized model size: {quantized_model_size:.2f} KB")
Original model size: 7791.65 KB
Quantized model size: 3897.08 KB

So, we can see that the original and quantized models have similar accuracy with a much smaller size of the quantized model.

Run Inference on Quantized Model

Copy the preprocess function from the training notebook and run inference on the quantized model with Inference Engine. See the OpenVINO API tutorial for more information about running inference with Inference Engine Python API.

def pre_process_image(imagePath, img_height=180):
    # Model input format
    n, c, h, w = [1, 3, img_height, img_height]
    image =
    image = image.resize((h, w), resample=Image.BILINEAR)

    # Convert to array and change data layout from HWC to CHW
    image = np.array(image)

    input_image = image.reshape((n, h, w, c))

    return input_image
# Get the names of the input and output layer
# model_pot = ie.read_model(model="model/optimized/flower_ir.xml")
input_layer = quantized_compiled_model.input(0)
output_layer = quantized_compiled_model.output(0)

# Get the class names: a list of directory names in alphabetical order
class_names = sorted([ for item in Path(data_dir).iterdir() if item.is_dir()])

# Run inference on an input image...
inp_img_url = (
directory = "output"
inp_file_name = "A_Close_Up_Photo_of_a_Dandelion.jpg"
file_path = Path(directory)/Path(inp_file_name)
# Download the image if it does not exist yet
if not Path(inp_file_name).exists():
    download_file(inp_img_url, inp_file_name, directory=directory)

# Pre-process the image and get it ready for inference.
input_image = pre_process_image(imagePath=file_path)
print(f'input image shape: {input_image.shape}')
print(f'input layer shape: {input_layer.shape}')

res = quantized_compiled_model([input_image])[output_layer]

score = tf.nn.softmax(res[0])

# Show the results
image =
    "This image most likely belongs to {} with a {:.2f} percent confidence.".format(
        class_names[np.argmax(score)], 100 * np.max(score)
'output/A_Close_Up_Photo_of_a_Dandelion.jpg' already exists.
input image shape: (1, 180, 180, 3)
input layer shape: [1,180,180,3]
This image most likely belongs to dandelion with a 98.94 percent confidence.

Compare Inference Speed

Measure inference speed with the OpenVINO Benchmark App.

Benchmark App is a command line tool that measures raw inference performance for a specified OpenVINO IR model. Run benchmark_app --help to see a list of available parameters. By default, Benchmark App tests the performance of the model specified with the -m parameter with asynchronous inference on CPU, for one minute. Use the -d parameter to test performance on a different device, for example an Intel integrated Graphics (iGPU), and -t to set the number of seconds to run inference. See the documentation for more information.

This tutorial uses a wrapper function from Notebook Utils. It prints the benchmark_app command with the chosen parameters.

In the next cells, inference speed will be measured for the original and quantized model on CPU. If an iGPU is available, inference speed will be measured for CPU+GPU as well. The number of seconds is set to 15.

NOTE: For the most accurate performance estimation, it is recommended to run benchmark_app in a terminal/command prompt after closing other applications.

# print the available devices on this system
print("Device information:")
print(core.get_property("CPU", "FULL_DEVICE_NAME"))
if "GPU" in core.available_devices:
    print(core.get_property("GPU", "FULL_DEVICE_NAME"))
Device information:
Intel(R) Core(TM) i9-10920X CPU @ 3.50GHz
# Original model - CPU
! benchmark_app -m $model_xml -d CPU -t 15 -api async
Benchmark on MULTI:CPU,GPU

With a recent Intel CPU, the best performance can often be achieved by doing inference on both the CPU and the iGPU, with OpenVINO’s Multi Device Plugin. It takes a bit longer to load a model on GPU than on CPU, so this benchmark will take a bit longer to complete than the CPU benchmark, when run for the first time. Benchmark App supports caching, by specifying the --cdir parameter. In the cells below, the model will cached to the model_cache directory.

# Original model - MULTI:CPU,GPU
if "GPU" in core.available_devices:
    ! benchmark_app -m $model_xml -d MULTI:CPU,GPU -t 15 -api async
    print("A supported integrated GPU is not available on this system.")
A supported integrated GPU is not available on this system.
# Quantized model - MULTI:CPU,GPU
if "GPU" in core.available_devices:
    ! benchmark_app -m $compressed_model_xml -d MULTI:CPU,GPU -t 15 -api async
    print("A supported integrated GPU is not available on this system.")
A supported integrated GPU is not available on this system.
# print the available devices on this system
print("Device information:")
print(core.get_property("CPU", "FULL_DEVICE_NAME"))
if "GPU" in core.available_devices:
    print(core.get_property("GPU", "FULL_DEVICE_NAME"))
Device information:
Intel(R) Core(TM) i9-10920X CPU @ 3.50GHz

