Quantize a Segmentation Model and Show Live Inference

This tutorial is also available as a Jupyter notebook that can be cloned directly from GitHub. See the installation guide for instructions to run this tutorial locally on Windows, Linux or macOS.


Kidney Segmentation with PyTorch Lightning and OpenVINO™ - Part 3

This tutorial is part of a series on how to train, optimize, quantize and show live inference on a medical segmentation model. The goal is to accelerate inference on a kidney segmentation model. The UNet model is trained from scratch; the data is from Kits19.

This third tutorial in the series shows how to:

  • Convert an ONNX model to OpenVINO IR with Model Optimizer,

  • Quantize a model with OpenVINO’s Post-Training Optimization Tool API.

  • Evaluate the F1 score metric of the original model and the quantized model

  • Benchmark performance of the original model and the quantized model

  • Show live inference with OpenVINO’s async API and MULTI plugin

All notebooks in this series:


This notebook needs a trained UNet model that is converted to ONNX format. We provide a pretrained model trained for 20 epochs with the full Kits-19 frames dataset, which has an F1 score on the validation set of 0.9. The training code will be made available soon. Running this notebook with the full dataset will take a long time. For demonstration purposes, this tutorial will download one converted CT scan and use that scan for quantization and inference. For production use, please use a larger dataset for more generalizable results.

To install the requirements for running this notebook, please follow the instructions in the README.


The Post Training Optimization API is implemented in the compression library.

import glob
import os
import random
import sys
import time
import warnings
import zipfile
from pathlib import Path
from typing import List

import cv2
import matplotlib.pyplot as plt
import numpy as np
from addict import Dict
from async_inference import CTAsyncPipeline, SegModel
from compression.api import Metric
from compression.engines.ie_engine import IEEngine
from compression.graph import load_model, save_model
from compression.graph.model_utils import compress_model_weights
from compression.pipeline.initializer import create_pipeline
from IPython.display import Image, display
from omz_python.models import model as omz_model
from openvino.inference_engine import IECore
from yaspin import yaspin

from notebook_utils import benchmark_model, download_file
15:07:00 accuracy_checker WARNING: /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/defusedxml/__init__.py:30: DeprecationWarning: defusedxml.cElementTree is deprecated, import from defusedxml.ElementTree instead.
  from . import cElementTree


To use the pretrained models, set ONNX_PATH to "pretrained_model/unet44.onnx". To use a model that you trained or optimized yourself, adjust ONNX_PATH. MODEL_DIR is the directory where the IR model will be saved. By default, this notebook will quantize one CT scan from the KITS19 dataset. To use the full dataset, set BASEDIR to the path of the dataset.

BASEDIR = Path("kits19_frames_1")
ONNX_PATH = Path("pretrained_model/unet44.onnx")
MODEL_DIR = Path("model")

ir_path = (MODEL_DIR / ONNX_PATH.stem).with_suffix(".xml")

Download CT-scan Data

# The CT scan case number. For example: 16 for data from the case_00016 directory
# Currently only 16 is supported
case = 16
if not (BASEDIR / f"case_{case:05d}").exists():
    filename = download_file(
    with zipfile.ZipFile(filename, "r") as zip_ref:
    os.remove(filename)  # remove zipfile
    print(f"Downloaded and extracted data for case_{case:05d}")
    print(f"Data for case_{case:05d} exists")
Data for case_00016 exists

Convert Model to OpenVINO IR

Call the Model Optimizer tool to convert the ONNX model to OpenVINO IR, with FP16 precision. The model files are saved to the MODEL_DIR directory. See the Model Optimizer Developer Guide for more information.

Model Optimization was successful if the last lines of the output include [ SUCCESS ] Generated IR version 10 model.

!mo --input_model $ONNX_PATH --output_dir $MODEL_DIR --data_type FP16
Model Optimizer arguments:
Common parameters:
    - Path to the Input Model:  /home/runner/work/openvino_notebooks/openvino_notebooks/notebooks/110-ct-segmentation-quantize/pretrained_model/unet44.onnx
    - Path for generated IR:    /home/runner/work/openvino_notebooks/openvino_notebooks/notebooks/110-ct-segmentation-quantize/model
    - IR output name:   unet44
    - Log level:    ERROR
    - Batch:    Not specified, inherited from the model
    - Input layers:     Not specified, inherited from the model
    - Output layers:    Not specified, inherited from the model
    - Input shapes:     Not specified, inherited from the model
    - Mean values:  Not specified
    - Scale values:     Not specified
    - Scale factor:     Not specified
    - Precision of IR:  FP16
    - Enable fusing:    True
    - Enable grouped convolutions fusing:   True
    - Move mean values to preprocess section:   None
    - Reverse input channels:   False
ONNX specific parameters:
    - Inference Engine found in:    /opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/openvino
Inference Engine version:   2021.4.2-3976-0943ed67223-refs/pull/539/head
Model Optimizer version:    2021.4.2-3976-0943ed67223-refs/pull/539/head
[ SUCCESS ] Generated IR version 10 model.
[ SUCCESS ] XML file: /home/runner/work/openvino_notebooks/openvino_notebooks/notebooks/110-ct-segmentation-quantize/model/unet44.xml
[ SUCCESS ] BIN file: /home/runner/work/openvino_notebooks/openvino_notebooks/notebooks/110-ct-segmentation-quantize/model/unet44.bin
[ SUCCESS ] Total execution time: 24.96 seconds.
[ SUCCESS ] Memory consumed: 93 MB.

Post-Training Optimization Tool (POT) Quantization

The Post-Training Optimization Tool (POT) compression API defines base classes for Metric and DataLoader. In this notebook, we use a custom Metric and DataLoader that show all the required methods.



Define a metric to determine the performance of the model. For the Default Quantization algorithm that is used in this tutorial, defining a metric is optional. The metric is used to compare the quantized INT8 model with the original FP16 IR model.

A metric for POT inherits from compression.api.Metric and should implement all the methods in this example.

For this demo, the F1 score, or Dice coefficient, is used.

# The sigmoid function is used to transform the result of the network
# to binary segmentation masks
def sigmoid(x):
    return np.exp(-np.logaddexp(0, -x))

class BinaryF1(Metric):
    Metric to compute F1/Dice score for binary segmentation. F1 is computed as
    (2 * precision * recall) / (precision + recall) where precision is computed as
    the ratio of pixels that were correctly predicted as true and all actual true pixels,
    and recall as the ratio of pixels that were correctly predicted as true and all
    predicted true pixels.

    See https://en.wikipedia.org/wiki/F-score

    # Required methods
    def __init__(self):
        self._name = "F1"
        self.y_true = 0
        self.y_pred = 0
        self.correct_true = 0

    def value(self):
        """Returns metric value for the last model output.
        Possible format: {metric_name: [metric_values_per_image]}
        return {self._name: [0, 0]}

    def avg_value(self):
        """Returns average metric value for all model outputs.
        Possible format: {metric_name: metric_value}
        recall = self.correct_true / self.y_pred
        precision = self.correct_true / self.y_true

        f1 = (2 * precision * recall) / (precision + recall)
        return {self._name: f1}

    def update(self, output, target):
        :param output: model output
        :param target: annotations for model output
        label = target[0].astype(np.byte)
        prediction = sigmoid(output[0]).round().astype(np.byte)

        self.y_true += np.sum(label)
        self.y_pred += np.sum(prediction)

        correct_true = np.sum(
            (label == prediction).astype(np.byte) * (label == 1).astype(np.byte)

        self.correct_true += correct_true

    def reset(self):
        """Resets metric"""
        self.y_true = 0
        self.y_pred = 0
        self.correct_true = 0

    def get_attributes(self):
        Returns a dictionary of metric attributes {metric_name: {attribute_name: value}}.
        Required attributes: 'direction': 'higher-better' or 'higher-worse'
                             'type': metric type
        return {self._name: {"direction": "higher-better", "type": "F1"}}



The dataset in the next cell is copied from the training notebook. It expects images and masks in the basedir directory, in a folder per patient. For more information about the dataset, see the data preparation notebook. This dataset follows POT’s compression.api.DataLoader interface, which should implement __init__, __getitem__ and __len__. It can therefore be used directly for POT.

class KitsDataset(object):
    def __init__(self, basedir: str, dataset_type: str, transforms=None):
        Dataset class for prepared Kits19 data, for binary segmentation (background/kidney)

        :param basedir: Directory that contains the prepared CT scans, in subdirectories
                        case_00000 until case_00210
        :param dataset_type: either "train" or "val"
        :param transforms: Compose object with augmentations
        allmasks = sorted(glob.glob(f"{basedir}/case_*/segmentation_frames/*png"))

        if len(allmasks) == 0:
            raise ValueError(
                f"basedir: '{basedir}' does not contain data for type '{dataset_type}'"
        self.valpatients = [11, 15, 16, 49, 50, 79, 81, 89, 106, 108, 112, 126, 129, 133,
                            141, 166, 169, 170, 192, 202, 204]  # fmt: skip
        valcases = [f"case_{i:05d}" for i in self.valpatients]
        if dataset_type == "train":
            masks = [mask for mask in allmasks if Path(mask).parents[1].name not in valcases]
        elif dataset_type == "val":
            masks = [mask for mask in allmasks if Path(mask).parents[1].name in valcases]
            raise ValueError("Please choose train or val dataset split")

        if dataset_type == "train":
        self.basedir = basedir
        self.dataset_type = dataset_type
        self.dataset = masks
        self.transforms = transforms
            f"Created {dataset_type} dataset with {len(self.dataset)} items. Base directory for data: {basedir}"

    def __getitem__(self, index):
        Get an item from the dataset at the specified index.

        :return: (annotation, input_image, metadata) where annotation is (index, segmentation_mask)
                 and metadata a dictionary with case and slice number
        mask_path = self.dataset[index]
        # Open the image with OpenCV with `cv2.IMREAD_UNCHANGED` to prevent automatic
        # conversion of 1-channel black and white images to 3-channel BGR images.
        mask = cv2.imread(mask_path, cv2.IMREAD_UNCHANGED)

        image_path = str(Path(mask_path.replace("segmentation", "imaging")).with_suffix(".jpg"))
        img = cv2.imread(image_path, cv2.IMREAD_UNCHANGED)

        if img.shape[:2] != (512, 512):
            img = cv2.resize(img, (512, 512))
            mask = cv2.resize(mask, (512, 512))

        annotation = (index, mask.astype(np.uint8))
        input_image = np.expand_dims(img, axis=0).astype(np.float32)
        return (
            {"case": Path(mask_path).parents[1].name, "slice": Path(mask_path).stem},

    def __len__(self):
        return len(self.dataset)

To test that the data loader returns the expected output, we create a DataLoader instance and show an image and a mask. The image and mask are shown as returned by the dataloader, after resizing and preprocessing. Since this dataset contains a lot of slices without kidneys, we select a slice that contains at least 100 kidney pixels to verify that the annotations look correct.

# Create data loader
data_loader = KitsDataset(BASEDIR, "val")

# Find a slice that contains kidney annotations
# item[0] is the annotation: (id, annotation_data)
annotation, image_data, _ = next(item for item in data_loader if np.count_nonzero(item[0][1]) > 100)

# The data loader returns images as floating point data with (C,H,W) layout. Convert to 8-bit
# integer data and transpose to (H,C,W) for visualization
image = image_data.astype(np.uint8).transpose(1, 2, 0)

# The data loader returns annotations as (index, mask) and mask in shape (1,H,W)
# grab only the mask, and remove the channel dimension for visualization
mask = annotation[1].squeeze()

fig, ax = plt.subplots(1, 2, figsize=(12, 6))
ax[0].imshow(image, cmap="gray")
ax[1].imshow(mask, cmap="gray");
Created val dataset with 178 items. Base directory for data: kits19_frames_1

Quantization Config

POT methods expect configuration dictionaries as arguments, which are defined in the cell below. The variable ir_path is defined in the Settings cell at the top of the notebook. The other variables are defined in the cell above.

See Post-Training Optimization Best Practices for more information on the settings.

# Model config specifies the model name and paths to model .xml and .bin file
model_config = Dict(
        "model_name": f"quantized_{ir_path.stem}",
        "model": ir_path,
        "weights": ir_path.with_suffix(".bin"),

# Engine config
engine_config = Dict({"device": "CPU"})

algorithms = [
        "name": "DefaultQuantization",
        "stat_subset_size": 300,
        "params": {
            "target_device": "ANY",
            "preset": "mixed",  # choose between "mixed" and "performance"

print(f"model_config: {model_config}")
model_config: {'model_name': 'quantized_unet44', 'model': PosixPath('model/unet44.xml'), 'weights': PosixPath('model/unet44.bin')}

Prepare Quantization Pipeline: DataLoader, Model, Metric, Inference Engine

The POT pipeline uses the functions load_model(), IEEngine and create_pipeline(). load_model() loads an IR model, specified in model_config, IEEngine is a POT implementation of Inference Engine, that will be passed to the POT pipeline created by create_pipeline(). The POT classes and functions expect a config argument. These configs are created in the Config section. The F1 metric and SegmentationDataLoader are defined earlier in this notebook.

Running the POT quantization pipeline takes just two lines of code. We create the pipeline with the create_pipeline function, and then run that pipeline with pipeline.run(). To reuse the quantized model later, we compress the model weights and save the compressed model to disk.

# Step 1: create data loader
data_loader = KitsDataset(BASEDIR, "val")

# Step 2: load model
ir_model = load_model(model_config=model_config)

# Step 3: initialize the metric
metric = BinaryF1()

# Step 4: Initialize the engine for metric calculation and statistics collection.
engine = IEEngine(config=engine_config, data_loader=data_loader, metric=metric)

# Step 5: Create a pipeline of compression algorithms.
# quantization_algorithm is defined in the Settings
pipeline = create_pipeline(algorithms, engine)

# Step 6: Execute the pipeline to quantize the model
algorithm_name = pipeline.algo_seq[0].name
with yaspin(text=f"Executing POT pipeline on {model_config['model']} with {algorithm_name}") as sp:
    start_time = time.perf_counter()
    compressed_model = pipeline.run(ir_model)
    end_time = time.perf_counter()
    sp.text = f"Quantization finished in {end_time - start_time:.2f} seconds"

# Step 7 (Optional): Compress model weights to quantized precision
#                    in order to reduce the size of the final .bin file.

# Step 8: Save the compressed model to the desired path.
# Set save_path to the directory where the directory
compressed_model_paths = save_model(
    model=compressed_model, save_path="optimized_model", model_name=ir_model.name

compressed_model_path = compressed_model_paths[0]["model"]
print("The quantized model is stored at", compressed_model_path)
Created val dataset with 178 items. Base directory for data: kits19_frames_1
✔ Quantization finished in 179.58 seconds
The quantized model is stored at optimized_model/quantized_unet44.xml

Compare Metric of FP16 and INT8 Model

# Compute the F1 score on the quantized model and compare with the F1 score on the FP16 IR model.
ir_model = load_model(model_config=model_config)
evaluation_pipeline = create_pipeline(algo_config=algorithms, engine=engine)

with yaspin(text="Evaluating original IR model") as sp:
    original_metric = evaluation_pipeline.evaluate(ir_model)

with yaspin(text="Evaluating quantized IR model") as sp:
    quantized_metric = pipeline.evaluate(compressed_model)

if quantized_metric:
    for key, value in quantized_metric.items():
        print(f"The {key} score of the quantized INT8 model is {value:.3f}")

if original_metric:
    for key, value in original_metric.items():
        print(f"The {key} score of the original FP16 model is {value:.3f}")
The F1 score of the quantized INT8 model is 0.946
The F1 score of the original FP16 model is 0.968

Compare Performance of the Original and Quantized Models

To measure the inference performance of the FP16 and INT8 models, we use Benchmark Tool, OpenVINO’s inference performance measurement tool. Benchmark tool is a command line application that can be run in the notebook with ! benchmark_app or %sx benchmark_app.

In this tutorial, we use a wrapper function from Notebook Utils. It prints the benchmark_app command with the chosen parameters.

NOTE: For the most accurate performance estimation, we recommended running benchmark_app in a terminal/command prompt after closing other applications. Run benchmark_app --help to see all command line options.

# Show the parameters and docstring for `benchmark_model`
# By default, benchmark on MULTI:CPU,GPU if a GPU is available, otherwise on CPU.
ie = IECore()
device = "MULTI:CPU,GPU" if "GPU" in ie.available_devices else "CPU"
# Uncomment one of the options below to benchmark on other devices
# device = "GPU"
# device = "CPU"
# device = "AUTO"
# Benchmark FP16 model
benchmark_model(model_path=ir_path, device=device, seconds=15)

Benchmark unet44.xml with CPU for 15 seconds with async inference

Benchmark command: benchmark_app -m model/unet44.xml -d CPU -t 15 -api async -b 1 -cdir model_cache

Count:      61 iterations
Duration:   15302.75 ms
Latency:    248.17 ms
Throughput: 3.99 FPS

Device: Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
# Benchmark INT8 model
benchmark_model(model_path=compressed_model_path, device=device, seconds=15)

Benchmark quantized_unet44.xml with CPU for 15 seconds with async inference

Benchmark command: benchmark_app -m optimized_model/quantized_unet44.xml -d CPU -t 15 -api async -b 1 -cdir model_cache

Count:      42 iterations
Duration:   15559.16 ms
Latency:    369.81 ms
Throughput: 2.70 FPS

Device: Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz

Visually Compare Inference Results

Visualize the results of the model on four slices of the validation set. Compare the results of the FP16 IR model with the results of the quantized INT8 model and the reference segmentation annotation.

Medical imaging datasets tend to be very imbalanced: most of the slices in a CT scan do not contain kidney data. The segmentation model should be good at finding kidneys where they exist (in medical terms: have good sensitivity) but also not find spurious kidneys that do not exist (have good specificity). In the next cell, we show four slices: two slices that have no kidney data, and two slices that contain kidney data. For this example, a slice has kidney data if at least 50 pixels in the slices are annotated as kidney.

Run this cell again to show results on a different subset. The random seed is displayed to allow reproducing specific runs of this cell.

Note: the images are shown after optional augmenting and resizing. In the Kits19 dataset all but one of the cases has input shape (512, 512).

num_images = 4
colormap = "gray"

ie = IECore()
net_ir = ie.read_network(ir_path)
net_pot = ie.read_network(compressed_model_path)

exec_net_ir = ie.load_network(network=net_ir, device_name="CPU")
exec_net_pot = ie.load_network(network=net_pot, device_name="CPU")
input_layer = next(iter(net_ir.input_info))
output_layer_ir = next(iter(net_ir.outputs))
output_layer_pot = next(iter(net_pot.outputs))
# Create a dataset, and make a subset of the dataset for visualization
# The dataset items are (annotation, image) where annotation is (index, mask)
background_slices = (item for item in data_loader if np.count_nonzero(item[0][1]) == 0)
kidney_slices = (item for item in data_loader if np.count_nonzero(item[0][1]) > 50)
# Set seed to current time. To reproduce specific results, copy the printed seed
# and manually set `seed` to that value.
seed = int(time.time())
print(f"Visualizing results with seed {seed}")
data_subset = random.sample(list(background_slices), 2) + random.sample(list(kidney_slices), 2)

fig, ax = plt.subplots(nrows=num_images, ncols=4, figsize=(24, num_images * 4))
for i, (annotation, image, meta) in enumerate(data_subset):
    mask = annotation[1]
    res_ir = exec_net_ir.infer(inputs={input_layer: image})
    res_pot = exec_net_pot.infer(inputs={input_layer: image})
    target_mask = mask.astype(np.uint8)

    result_mask_ir = sigmoid(res_ir[output_layer_ir]).round().astype(np.uint8)[0, 0, ::]
    result_mask_pot = sigmoid(res_pot[output_layer_pot]).round().astype(np.uint8)[0, 0, ::]

    ax[i, 0].imshow(image[0, ::], cmap=colormap)
    ax[i, 1].imshow(target_mask, cmap=colormap)
    ax[i, 2].imshow(result_mask_ir, cmap=colormap)
    ax[i, 3].imshow(result_mask_pot, cmap=colormap)
    ax[i, 0].set_title(f"{meta['slice']}")
    ax[i, 1].set_title("Annotation")
    ax[i, 2].set_title("Prediction on FP16 model")
    ax[i, 3].set_title("Prediction on INT8 model")
Visualizing results with seed 1638285258

Show Live Inference

To show live inference on the model in the notebook, we use the asynchronous processing feature of OpenVINO Inference Engine.

If you use a GPU device, with device="GPU" or device="MULTI:CPU,GPU" to do inference on an integrated graphics card, model loading will be slow the first time you run this code. The model will be cached, so after the first time model loading will be fast. See the OpenVINO API tutorial for more information on Inference Engine, including Model Caching.

Visualization Functions

We define a helper function show_array to efficiently show images in the notebook. The do_inference function uses Open Model Zoo’s AsyncPipeline to perform asynchronous inference. After inference on the specified CT scan has completed, the total time and throughput (fps), including preprocessing and displaying, will be printed.

def showarray(frame: np.ndarray, display_handle=None):
    Display array `frame`. Replace information at `display_handle` with `frame`
    encoded as jpeg image

    Create a display_handle with: `display_handle = display(display_id=True)`
    _, frame = cv2.imencode(ext=".jpeg", img=frame)
    if display_handle is None:
        display_handle = display(Image(data=frame.tobytes()), display_id=True)
    return display_handle

def do_inference(imagelist: List, model: omz_model.Model, device: str):
    Do inference of images in `imagelist` on `model` on the given `device` and show
    the results in real time in a Jupyter Notebook

    :param imagelist: list of images/frames to do inference on
    :param model: Model instance for inference
    :param device: Name of device to perform inference on. For example: "CPU"
    display_handle = None
    next_frame_id = 0
    next_frame_id_to_show = 0

    input_layer = next(iter(model.net.input_info))

    # Create asynchronous pipeline and print time it takes to load the model
    load_start_time = time.perf_counter()
    pipeline = CTAsyncPipeline(
        ie=ie, model=model, plugin_config={}, device=device, max_num_requests=0
    load_end_time = time.perf_counter()

    # Perform asynchronous inference
    start_time = time.perf_counter()

    while next_frame_id < len(imagelist) - 1:
        results = pipeline.get_result(next_frame_id_to_show)

        if results:
            # Show next result from async pipeline
            result, meta = results
            display_handle = showarray(result, display_handle)

            next_frame_id_to_show += 1

        if pipeline.is_ready():
            # Submit new image to async pipeline
            image = imagelist[next_frame_id]
                inputs={input_layer: image}, id=next_frame_id, meta={"frame": image}
            next_frame_id += 1
            # If the pipeline is not ready yet and there are no results: wait


    # Show all frames that are in the pipeline after all images have been submitted
    while pipeline.has_completed_request():
        results = pipeline.get_result(next_frame_id_to_show)
        if results:
            result, meta = results
            display_handle = showarray(result, display_handle)
            next_frame_id_to_show += 1

    end_time = time.perf_counter()
    duration = end_time - start_time
    fps = len(imagelist) / duration
    print(f"Loaded model to {device} in {load_end_time-load_start_time:.2f} seconds.")
    print(f"Total time for {next_frame_id+1} frames: {duration:.2f} seconds, fps:{fps:.2f}")

Load Model and Images

Load the segmentation model to Inference Engine with SegModel, based on the Open Model Zoo Model API. Load a CT scan from the BASEDIR directory to a list.

case = 16

ie = IECore()
segmentation_model = SegModel(ie=ie, model_path=Path(compressed_model_path))
demopattern = f"{BASEDIR}/case_{case:05d}/imaging_frames/*jpg"
imlist = sorted(glob.glob(demopattern))
images = [cv2.imread(im, cv2.IMREAD_UNCHANGED) for im in imlist]

print(f"Loaded images from case {case} from directory: {BASEDIR}")
Loaded images from case 16 from directory: kits19_frames_1

Show Inference

In the next cell, we run the do inference function, which loads the model to the specified device (using caching for faster model loading on GPU devices), performs inference, and displays the results in real-time.

# Possible options for device include "CPU", "GPU", "AUTO", "MULTI"
device = "MULTI:CPU,GPU" if "GPU" in ie.available_devices else "CPU"
do_inference(imagelist=images, model=segmentation_model, device=device)
Loaded model to CPU in 0.20 seconds.
Total time for 178 frames: 70.55 seconds, fps:2.52