Quantize a Segmentation Model and Show Live Inference¶
This tutorial is also available as a Jupyter notebook that can be cloned directly from GitHub. See the installation guide for instructions to run this tutorial locally on Windows, Linux or macOS.
Kidney Segmentation with PyTorch Lightning and OpenVINO™ - Part 3¶
This tutorial is part of a series on how to train, optimize, quantize and show live inference on a medical segmentation model. The goal is to accelerate inference on a kidney segmentation model. The UNet model is trained from scratch; the data is from Kits19.
This third tutorial in the series shows how to:
Convert an ONNX model to OpenVINO IR with Model Optimizer,
Quantize a PyTorch model with NNCF
Evaluate the F1 score metric of the original model and the quantized model
Benchmark performance of the original model and the quantized model
Show live inference with OpenVINO’s async API
All notebooks in this series:
Train a 2D-UNet Medical Imaging Model with PyTorch Lightning
Convert and Quantize a Segmentation Model and Show Live Inference (this notebook)
Instructions¶
This notebook needs a trained UNet model. We provide a pretrained model trained for 20 epochs with the full Kits-19 frames dataset, which has an F1 score on the validation set of 0.9. The training code is available in this notebook.
NNCF for PyTorch models requires a C++ compiler. On Windows, please
install Microsoft Visual Studio
2019.
During installation, choose Desktop development with C++ in the
Workloads tab. On macOS, run xcode-select –install
from a Terminal.
On Linux, please install gcc.
Running this notebook with the full dataset will take a long time. For demonstration purposes, this tutorial will download one converted CT scan and use that scan for quantization and inference. For production use, please use a representative dataset for quantizing the model.
Imports¶
# On Windows, try to find the directory that contains x64 cl.exe and add it to the PATH to enable PyTorch
# to find the required C++ tools. This code assumes that Visual Studio is installed in the default
# directory. If you have a different C++ compiler, please add the correct path to os.environ["PATH"]
# directly. Note that the C++ Redistributable is not enough to run this notebook.
# Adding the path to os.environ["LIB"] is not always required - it depends on the system's configuration
import sys
if sys.platform == "win32":
import distutils.command.build_ext
import os
from pathlib import Path
if sys.getwindowsversion().build >= 20000: # Windows 11
search_path = "**/Hostx64/x64/cl.exe"
else:
search_path = "**/Hostx86/x64/cl.exe"
VS_INSTALL_DIR_2019 = r"C:/Program Files (x86)/Microsoft Visual Studio"
VS_INSTALL_DIR_2022 = r"C:/Program Files/Microsoft Visual Studio"
cl_paths_2019 = sorted(list(Path(VS_INSTALL_DIR_2019).glob(search_path)))
cl_paths_2022 = sorted(list(Path(VS_INSTALL_DIR_2022).glob(search_path)))
cl_paths = cl_paths_2019 + cl_paths_2022
if len(cl_paths) == 0:
raise ValueError(
"Cannot find Visual Studio. This notebook requires an x64 C++ compiler. If you installed "
"a C++ compiler, please add the directory that contains cl.exe to `os.environ['PATH']`."
)
else:
# If multiple versions of MSVC are installed, get the most recent version
cl_path = cl_paths[-1]
vs_dir = str(cl_path.parent)
os.environ["PATH"] += f"{os.pathsep}{vs_dir}"
# Code for finding the library dirs from
# https://stackoverflow.com/questions/47423246/get-pythons-lib-path
d = distutils.core.Distribution()
b = distutils.command.build_ext.build_ext(d)
b.finalize_options()
os.environ["LIB"] = os.pathsep.join(b.library_dirs)
print(f"Added {vs_dir} to PATH")
import logging
import os
import random
import sys
import time
import warnings
import zipfile
from pathlib import Path
warnings.filterwarnings("ignore", category=UserWarning)
import cv2
import matplotlib.pyplot as plt
import monai
import numpy as np
import torch
from monai.transforms import LoadImage
from nncf.common.utils.logger import set_log_level
from openvino.inference_engine import IECore
from openvino.runtime import Core
from torch.jit import TracerWarning
from torchmetrics import F1
set_log_level(logging.ERROR) # Disables all NNCF info and warning messages
sys.path.append("../utils")
from models.custom_segmentation import SegmentationModel
from notebook_utils import NotebookAlert, benchmark_model, download_file, show_live_inference
try:
import subprocess
from nncf import NNCFConfig
from nncf.torch import create_compressed_model, register_default_init_args
except subprocess.CalledProcessError:
message = "WARNING: Running this notebook requires an x64 C++ compiler."
NotebookAlert(message=message, alert_class="warning")
raise
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Settings¶
By default, this notebook will download one CT scan from the KITS19
dataset, and use that for quantization. To use the full dataset, set
BASEDIR
to the path of the dataset, as prepared according to the
Data Preparation notebook.
BASEDIR = Path("kits19_frames_1")
# Uncomment the line below to use the full dataset, as prepared in the data preparation notebook
# BASEDIR = Path("~/kits19/kits19_frames").expanduser()
MODEL_DIR = Path("model")
MODEL_DIR.mkdir(exist_ok=True)
Load PyTorch Model¶
Download the pretrained model weights, load the PyTorch model and the state_dict that was saved after training. The model used in this notebook is a BasicUnet model from MONAI. We provide a pretrained checkpoint. To see how this model yourself, check out the training notebook.
state_dict_url = "https://github.com/helena-intel/openvino_notebooks/raw/110-nncf/notebooks/110-ct-segmentation-quantize/pretrained_model/unet_kits19_state_dict.pth"
state_dict_file = download_file(state_dict_url, directory="pretrained_model")
state_dict = torch.load(state_dict_file, map_location=torch.device("cpu"))
new_state_dict = {}
for k, v in state_dict.items():
new_key = k.replace("_model.", "")
new_state_dict[new_key] = v
new_state_dict.pop("loss_function.pos_weight")
model = monai.networks.nets.BasicUNet(spatial_dims=2, in_channels=1, out_channels=1).eval()
model.load_state_dict(new_state_dict)
pretrained_model/unet_kits19_state_dict.pth: 0%| | 0.00/7.58M [00:00<?, ?B/s]
BasicUNet features: (32, 32, 64, 128, 256, 32).
<All keys matched successfully>
We export the PyTorch model to ONNX and convert it to OpenVINO IR, for comparing the performance of the FP32 and INT8 model later in this notebook
dummy_input = torch.randn(1, 1, 512, 512)
fp32_onnx_path = MODEL_DIR / "unet_kits19_fp32.onnx"
torch.onnx.export(model, dummy_input, fp32_onnx_path)
!mo --input_model "$fp32_onnx_path" --output_dir $MODEL_DIR
Model Optimizer arguments:
Common parameters:
- Path to the Input Model: /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-275/.workspace/scm/ov-notebook/notebooks/110-ct-segmentation-quantize/model/unet_kits19_fp32.onnx
- Path for generated IR: /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-275/.workspace/scm/ov-notebook/notebooks/110-ct-segmentation-quantize/model
- IR output name: unet_kits19_fp32
- Log level: ERROR
- Batch: Not specified, inherited from the model
- Input layers: Not specified, inherited from the model
- Output layers: Not specified, inherited from the model
- Input shapes: Not specified, inherited from the model
- Source layout: Not specified
- Target layout: Not specified
- Layout: Not specified
- Mean values: Not specified
- Scale values: Not specified
- Scale factor: Not specified
- Precision of IR: FP32
- Enable fusing: True
- User transformations: Not specified
- Reverse input channels: False
- Enable IR generation for fixed input shape: False
- Use the transformations config file: None
Advanced parameters:
- Force the usage of legacy Frontend of Model Optimizer for model conversion into IR: False
- Force the usage of new Frontend of Model Optimizer for model conversion into IR: False
OpenVINO runtime found in: /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-275/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/openvino
OpenVINO runtime version: 2022.2.0-7713-af16ea1d79a-releases/2022/2
Model Optimizer version: 2022.2.0-7713-af16ea1d79a-releases/2022/2
[ SUCCESS ] Generated IR version 11 model.
[ SUCCESS ] XML file: /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-275/.workspace/scm/ov-notebook/notebooks/110-ct-segmentation-quantize/model/unet_kits19_fp32.xml
[ SUCCESS ] BIN file: /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-275/.workspace/scm/ov-notebook/notebooks/110-ct-segmentation-quantize/model/unet_kits19_fp32.bin
[ SUCCESS ] Total execution time: 0.39 seconds.
[ SUCCESS ] Memory consumed: 87 MB.
[ INFO ] The model was converted to IR v11, the latest model format that corresponds to the source DL framework input/output format. While IR v11 is backwards compatible with OpenVINO Inference Engine API v1.0, please use API v2.0 (as of 2022.1) to take advantage of the latest improvements in IR v11.
Find more information about API v2.0 and IR v11 at https://docs.openvino.ai
Download CT-scan Data¶
# The CT scan case number. For example: 2 for data from the case_00002 directory
# Currently only 117 is supported
CASE = 117
if not (BASEDIR / f"case_{CASE:05d}").exists():
BASEDIR.mkdir(exist_ok=True)
filename = download_file(
f"https://storage.openvinotoolkit.org/data/test_data/openvino_notebooks/kits19/case_{CASE:05d}.zip"
)
with zipfile.ZipFile(filename, "r") as zip_ref:
zip_ref.extractall(path=BASEDIR)
os.remove(filename) # remove zipfile
print(f"Downloaded and extracted data for case_{CASE:05d}")
else:
print(f"Data for case_{CASE:05d} exists")
case_00117.zip: 0%| | 0.00/5.48M [00:00<?, ?B/s]
Downloaded and extracted data for case_00117
Configuration¶
Dataset¶
The KitsDataset class in the next cell expects images and masks in the basedir directory, in a folder per patient. It is a simplified version of the DataSet class in the training notebook.
Images are loaded with MONAI’s
`LoadImage
<https://docs.monai.io/en/stable/transforms.html#loadimage>`__,
to align with the image loading method in the training notebook. This
method rotates and flips the images. We define a rotate_and_flip
method to display the images in the expected orientation.
def rotate_and_flip(image):
"""Rotate `image` by 90 degrees and flip horizontally"""
return cv2.flip(cv2.rotate(image, rotateCode=cv2.ROTATE_90_CLOCKWISE), flipCode=1)
class KitsDataset:
def __init__(self, basedir: str):
"""
Dataset class for prepared Kits19 data, for binary segmentation (background/kidney)
Source data should exist in basedir, in subdirectories case_00000 until case_00210,
with each subdirectory containing directories imaging_frames, with jpg images, and
segmentation_frames with segmentation masks as png files.
See https://github.com/openvinotoolkit/openvino_notebooks/blob/main/notebooks/110-ct-segmentation-quantize/data-preparation-ct-scan.ipynb
:param basedir: Directory that contains the prepared CT scans
"""
masks = sorted(BASEDIR.glob("case_*/segmentation_frames/*png"))
self.basedir = basedir
self.dataset = masks
print(
f"Created dataset with {len(self.dataset)} items. "
f"Base directory for data: {basedir}"
)
def __getitem__(self, index):
"""
Get an item from the dataset at the specified index.
:return: (image, segmentation_mask)
"""
mask_path = self.dataset[index]
image_path = str(mask_path.with_suffix(".jpg")).replace(
"segmentation_frames", "imaging_frames"
)
# Load images with MONAI's LoadImage to match data loading in training notebook
mask = LoadImage(image_only=True, dtype=np.uint8)(str(mask_path)).numpy()
img = LoadImage(image_only=True, dtype=np.float32)(str(image_path)).numpy()
if img.shape[:2] != (512, 512):
img = cv2.resize(img.astype(np.uint8), (512, 512)).astype(np.float32)
mask = cv2.resize(mask, (512, 512))
input_image = np.expand_dims(img, axis=0)
return input_image, mask
def __len__(self):
return len(self.dataset)
To test that the data loader returns the expected output, we show an image and a mask. The image and mask are shown as returned by the dataloader, after resizing and preprocessing. Since this dataset contains a lot of slices without kidneys, we select a slice that contains at least 5000 kidney pixels to verify that the annotations look correct.
dataset = KitsDataset(BASEDIR)
# Find a slice that contains kidney annotations
# item[0] is the annotation: (id, annotation_data)
image_data, mask = next(item for item in dataset if np.count_nonzero(item[1]) > 5000)
# Remove extra image dimension and rotate and flip the image for visualization
image = rotate_and_flip(image_data.squeeze())
# The data loader returns annotations as (index, mask) and mask in shape (H,W)
mask = rotate_and_flip(mask)
fig, ax = plt.subplots(1, 2, figsize=(12, 6))
ax[0].imshow(image, cmap="gray")
ax[1].imshow(mask, cmap="gray");
Created dataset with 69 items. Base directory for data: kits19_frames_1

Metric¶
Define a metric to determine the performance of the model.
For this demo, we use the F1 score, or Dice coefficient, from the TorchMetrics library.
def compute_f1(model: torch.nn.Module, dataset: KitsDataset):
"""
Compute binary F1 score of `model` on `dataset`
F1 score metric is provided by the torchmetrics library
`model` is expected to be a binary segmentation model, images in the
dataset are expected in (N,C,H,W) format where N==C==1
"""
metric = F1(ignore_index=0)
with torch.no_grad():
for image, target in dataset:
input_image = torch.as_tensor(image).unsqueeze(0)
output = model(input_image)
label = torch.as_tensor(target.squeeze()).long()
prediction = torch.sigmoid(output.squeeze()).round().long()
metric.update(label.flatten(), prediction.flatten())
return metric.compute()
Quantization¶
Before quantizing the model, we compute the F1 score on the FP32 model for comparison
fp32_f1 = compute_f1(model, dataset)
print(f"FP32 F1: {fp32_f1:.3f}")
FP32 F1: 0.974
NNCF configuration can be defined in a json file or a dictionary. See the NNCF quantization documentation for more information on the possible values.
# NNCF uses the model loaded at the beginning of this notebook. If after quantizing the model, you
# want to quantize with a different config, reload the model by uncommenting the next two lines
#
# model = monai.networks.nets.BasicUNet(spatial_dims=2, in_channels=1, out_channels=1).eval()
# model.load_state_dict(new_state_dict)
nncf_config_dict = {
"input_info": {"sample_size": [1, 1, 512, 512]},
"target_device": "CPU",
"compression": {
"algorithm": "quantization",
# performance preset uses symmetric weights and activations
"preset": "performance",
# Do not quantize LeakyReLU activations to allow the INT8 model to run on Intel GPU
"ignored_scopes": ["{re}.*LeakyReLU*"],
},
}
nncf_config = NNCFConfig.from_dict(nncf_config_dict)
Create a quantized model from a pre-trained FP32 model and configuration object created with the dictionary defined in the previous cell and a DataLoader. See the NNCF documentation for more information.
data_loader = torch.utils.data.DataLoader(dataset, batch_size=4)
nncf_config = register_default_init_args(nncf_config, data_loader)
compression_ctrl, compressed_model = create_compressed_model(model, nncf_config)
The compressed_model
that was created in the previous cell is a
PyTorch nn.Module
that is wrapped by NNCF.
compressed_model
NNCFNetwork(
(nncf_module): BasicUNet(
(conv_0): TwoConv(
(conv_0): Convolution(
(conv): NNCFConv2d(
1, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
(adn): ADN(
(N): InstanceNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
(D): Dropout(p=0.0, inplace=False)
(A): LeakyReLU(negative_slope=0.1, inplace=True)
)
)
(conv_1): Convolution(
(conv): NNCFConv2d(
32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
(adn): ADN(
(N): InstanceNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
(D): Dropout(p=0.0, inplace=False)
(A): LeakyReLU(negative_slope=0.1, inplace=True)
)
)
)
(down_1): Down(
(max_pooling): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(convs): TwoConv(
(conv_0): Convolution(
(conv): NNCFConv2d(
32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
(adn): ADN(
(N): InstanceNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
(D): Dropout(p=0.0, inplace=False)
(A): LeakyReLU(negative_slope=0.1, inplace=True)
)
)
(conv_1): Convolution(
(conv): NNCFConv2d(
32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
(adn): ADN(
(N): InstanceNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
(D): Dropout(p=0.0, inplace=False)
(A): LeakyReLU(negative_slope=0.1, inplace=True)
)
)
)
)
(down_2): Down(
(max_pooling): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(convs): TwoConv(
(conv_0): Convolution(
(conv): NNCFConv2d(
32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
(adn): ADN(
(N): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
(D): Dropout(p=0.0, inplace=False)
(A): LeakyReLU(negative_slope=0.1, inplace=True)
)
)
(conv_1): Convolution(
(conv): NNCFConv2d(
64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
(adn): ADN(
(N): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
(D): Dropout(p=0.0, inplace=False)
(A): LeakyReLU(negative_slope=0.1, inplace=True)
)
)
)
)
(down_3): Down(
(max_pooling): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(convs): TwoConv(
(conv_0): Convolution(
(conv): NNCFConv2d(
64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
(adn): ADN(
(N): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
(D): Dropout(p=0.0, inplace=False)
(A): LeakyReLU(negative_slope=0.1, inplace=True)
)
)
(conv_1): Convolution(
(conv): NNCFConv2d(
128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
(adn): ADN(
(N): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
(D): Dropout(p=0.0, inplace=False)
(A): LeakyReLU(negative_slope=0.1, inplace=True)
)
)
)
)
(down_4): Down(
(max_pooling): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(convs): TwoConv(
(conv_0): Convolution(
(conv): NNCFConv2d(
128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
(adn): ADN(
(N): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
(D): Dropout(p=0.0, inplace=False)
(A): LeakyReLU(negative_slope=0.1, inplace=True)
)
)
(conv_1): Convolution(
(conv): NNCFConv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
(adn): ADN(
(N): InstanceNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
(D): Dropout(p=0.0, inplace=False)
(A): LeakyReLU(negative_slope=0.1, inplace=True)
)
)
)
)
(upcat_4): UpCat(
(upsample): UpSample(
(deconv): NNCFConvTranspose2d(
256, 128, kernel_size=(2, 2), stride=(2, 2)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
)
(convs): TwoConv(
(conv_0): Convolution(
(conv): NNCFConv2d(
256, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
(adn): ADN(
(N): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
(D): Dropout(p=0.0, inplace=False)
(A): LeakyReLU(negative_slope=0.1, inplace=True)
)
)
(conv_1): Convolution(
(conv): NNCFConv2d(
128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
(adn): ADN(
(N): InstanceNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
(D): Dropout(p=0.0, inplace=False)
(A): LeakyReLU(negative_slope=0.1, inplace=True)
)
)
)
)
(upcat_3): UpCat(
(upsample): UpSample(
(deconv): NNCFConvTranspose2d(
128, 64, kernel_size=(2, 2), stride=(2, 2)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
)
(convs): TwoConv(
(conv_0): Convolution(
(conv): NNCFConv2d(
128, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
(adn): ADN(
(N): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
(D): Dropout(p=0.0, inplace=False)
(A): LeakyReLU(negative_slope=0.1, inplace=True)
)
)
(conv_1): Convolution(
(conv): NNCFConv2d(
64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
(adn): ADN(
(N): InstanceNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
(D): Dropout(p=0.0, inplace=False)
(A): LeakyReLU(negative_slope=0.1, inplace=True)
)
)
)
)
(upcat_2): UpCat(
(upsample): UpSample(
(deconv): NNCFConvTranspose2d(
64, 32, kernel_size=(2, 2), stride=(2, 2)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
)
(convs): TwoConv(
(conv_0): Convolution(
(conv): NNCFConv2d(
64, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
(adn): ADN(
(N): InstanceNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
(D): Dropout(p=0.0, inplace=False)
(A): LeakyReLU(negative_slope=0.1, inplace=True)
)
)
(conv_1): Convolution(
(conv): NNCFConv2d(
32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
(adn): ADN(
(N): InstanceNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
(D): Dropout(p=0.0, inplace=False)
(A): LeakyReLU(negative_slope=0.1, inplace=True)
)
)
)
)
(upcat_1): UpCat(
(upsample): UpSample(
(deconv): NNCFConvTranspose2d(
32, 32, kernel_size=(2, 2), stride=(2, 2)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
)
(convs): TwoConv(
(conv_0): Convolution(
(conv): NNCFConv2d(
64, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
(adn): ADN(
(N): InstanceNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
(D): Dropout(p=0.0, inplace=False)
(A): LeakyReLU(negative_slope=0.1, inplace=True)
)
)
(conv_1): Convolution(
(conv): NNCFConv2d(
32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
(adn): ADN(
(N): InstanceNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=False)
(D): Dropout(p=0.0, inplace=False)
(A): LeakyReLU(negative_slope=0.1, inplace=True)
)
)
)
)
(final_conv): NNCFConv2d(
32, 1, kernel_size=(1, 1), stride=(1, 1)
(pre_ops): ModuleDict(
(0): UpdateWeight(
(op): SymmetricQuantizer(bit=8, ch=True)
)
)
(post_ops): ModuleDict()
)
)
(external_quantizers): ModuleDict(
(/nncf_model_input_0|OUTPUT): SymmetricQuantizer(bit=8, ch=False)
(BasicUNet/Down[down_1]/TwoConv[convs]/Convolution[conv_0]/ADN[adn]/LeakyReLU[A]/leaky_relu_0|OUTPUT): SymmetricQuantizer(bit=8, ch=False)
(BasicUNet/Down[down_1]/TwoConv[convs]/Convolution[conv_1]/ADN[adn]/LeakyReLU[A]/leaky_relu_0|OUTPUT;BasicUNet/UpCat[upcat_2]/UpSample[upsample]/NNCFConvTranspose2d[deconv]/conv_transpose2d_0|OUTPUT): SymmetricQuantizer(bit=8, ch=False)
(BasicUNet/Down[down_2]/TwoConv[convs]/Convolution[conv_0]/ADN[adn]/LeakyReLU[A]/leaky_relu_0|OUTPUT): SymmetricQuantizer(bit=8, ch=False)
(BasicUNet/Down[down_2]/TwoConv[convs]/Convolution[conv_1]/ADN[adn]/LeakyReLU[A]/leaky_relu_0|OUTPUT;BasicUNet/UpCat[upcat_3]/UpSample[upsample]/NNCFConvTranspose2d[deconv]/conv_transpose2d_0|OUTPUT): SymmetricQuantizer(bit=8, ch=False)
(BasicUNet/Down[down_3]/TwoConv[convs]/Convolution[conv_0]/ADN[adn]/LeakyReLU[A]/leaky_relu_0|OUTPUT): SymmetricQuantizer(bit=8, ch=False)
(BasicUNet/Down[down_3]/TwoConv[convs]/Convolution[conv_1]/ADN[adn]/LeakyReLU[A]/leaky_relu_0|OUTPUT;BasicUNet/UpCat[upcat_4]/UpSample[upsample]/NNCFConvTranspose2d[deconv]/conv_transpose2d_0|OUTPUT): SymmetricQuantizer(bit=8, ch=False)
(BasicUNet/Down[down_4]/TwoConv[convs]/Convolution[conv_0]/ADN[adn]/LeakyReLU[A]/leaky_relu_0|OUTPUT): SymmetricQuantizer(bit=8, ch=False)
(BasicUNet/Down[down_4]/TwoConv[convs]/Convolution[conv_1]/ADN[adn]/LeakyReLU[A]/leaky_relu_0|OUTPUT): SymmetricQuantizer(bit=8, ch=False)
(BasicUNet/TwoConv[conv_0]/Convolution[conv_0]/ADN[adn]/LeakyReLU[A]/leaky_relu_0|OUTPUT): SymmetricQuantizer(bit=8, ch=False)
(BasicUNet/TwoConv[conv_0]/Convolution[conv_1]/ADN[adn]/LeakyReLU[A]/leaky_relu_0|OUTPUT;BasicUNet/UpCat[upcat_1]/UpSample[upsample]/NNCFConvTranspose2d[deconv]/conv_transpose2d_0|OUTPUT): SymmetricQuantizer(bit=8, ch=False)
(BasicUNet/UpCat[upcat_1]/TwoConv[convs]/Convolution[conv_0]/ADN[adn]/LeakyReLU[A]/leaky_relu_0|OUTPUT): SymmetricQuantizer(bit=8, ch=False)
(BasicUNet/UpCat[upcat_1]/TwoConv[convs]/Convolution[conv_1]/ADN[adn]/LeakyReLU[A]/leaky_relu_0|OUTPUT): SymmetricQuantizer(bit=8, ch=False)
(BasicUNet/UpCat[upcat_2]/TwoConv[convs]/Convolution[conv_0]/ADN[adn]/LeakyReLU[A]/leaky_relu_0|OUTPUT): SymmetricQuantizer(bit=8, ch=False)
(BasicUNet/UpCat[upcat_2]/TwoConv[convs]/Convolution[conv_1]/ADN[adn]/LeakyReLU[A]/leaky_relu_0|OUTPUT): SymmetricQuantizer(bit=8, ch=False)
(BasicUNet/UpCat[upcat_3]/TwoConv[convs]/Convolution[conv_0]/ADN[adn]/LeakyReLU[A]/leaky_relu_0|OUTPUT): SymmetricQuantizer(bit=8, ch=False)
(BasicUNet/UpCat[upcat_3]/TwoConv[convs]/Convolution[conv_1]/ADN[adn]/LeakyReLU[A]/leaky_relu_0|OUTPUT): SymmetricQuantizer(bit=8, ch=False)
(BasicUNet/UpCat[upcat_4]/TwoConv[convs]/Convolution[conv_0]/ADN[adn]/LeakyReLU[A]/leaky_relu_0|OUTPUT): SymmetricQuantizer(bit=8, ch=False)
(BasicUNet/UpCat[upcat_4]/TwoConv[convs]/Convolution[conv_1]/ADN[adn]/LeakyReLU[A]/leaky_relu_0|OUTPUT): SymmetricQuantizer(bit=8, ch=False)
)
)
In this notebook we demonstrate post-training quantization with NNCF. To use the quantized model for faster inference with OpenVINO, we export the model to ONNX, and convert the ONNX model to OpenVINO’s IR format.
NNCF also supports quantization-aware training, and other algorithms than quantization. See the NNCF documentation in the NNCF repository for more information.
int8_onnx_path = MODEL_DIR / "unet_kits19_int8.onnx"
warnings.filterwarnings("ignore", category=TracerWarning) # Ignore export warnings
warnings.filterwarnings("ignore", category=UserWarning)
compression_ctrl.export_model(str(int8_onnx_path))
print(f"INT8 ONNX model exported to {int8_onnx_path}.")
!mo --input_model "$int8_onnx_path" --input_shape "[1,1,512,512]" --output_dir "$MODEL_DIR"
INT8 ONNX model exported to model/unet_kits19_int8.onnx.
Model Optimizer arguments:
Common parameters:
- Path to the Input Model: /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-275/.workspace/scm/ov-notebook/notebooks/110-ct-segmentation-quantize/model/unet_kits19_int8.onnx
- Path for generated IR: /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-275/.workspace/scm/ov-notebook/notebooks/110-ct-segmentation-quantize/model
- IR output name: unet_kits19_int8
- Log level: ERROR
- Batch: Not specified, inherited from the model
- Input layers: Not specified, inherited from the model
- Output layers: Not specified, inherited from the model
- Input shapes: [1,1,512,512]
- Source layout: Not specified
- Target layout: Not specified
- Layout: Not specified
- Mean values: Not specified
- Scale values: Not specified
- Scale factor: Not specified
- Precision of IR: FP32
- Enable fusing: True
- User transformations: Not specified
- Reverse input channels: False
- Enable IR generation for fixed input shape: False
- Use the transformations config file: None
Advanced parameters:
- Force the usage of legacy Frontend of Model Optimizer for model conversion into IR: False
- Force the usage of new Frontend of Model Optimizer for model conversion into IR: False
OpenVINO runtime found in: /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-275/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/openvino
OpenVINO runtime version: 2022.2.0-7713-af16ea1d79a-releases/2022/2
Model Optimizer version: 2022.2.0-7713-af16ea1d79a-releases/2022/2
[ SUCCESS ] Generated IR version 11 model.
[ SUCCESS ] XML file: /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-275/.workspace/scm/ov-notebook/notebooks/110-ct-segmentation-quantize/model/unet_kits19_int8.xml
[ SUCCESS ] BIN file: /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-275/.workspace/scm/ov-notebook/notebooks/110-ct-segmentation-quantize/model/unet_kits19_int8.bin
[ SUCCESS ] Total execution time: 0.49 seconds.
[ SUCCESS ] Memory consumed: 92 MB.
[ INFO ] The model was converted to IR v11, the latest model format that corresponds to the source DL framework input/output format. While IR v11 is backwards compatible with OpenVINO Inference Engine API v1.0, please use API v2.0 (as of 2022.1) to take advantage of the latest improvements in IR v11.
Find more information about API v2.0 and IR v11 at https://docs.openvino.ai
Compare FP32 and INT8 Model¶
Compare File Size¶
fp32_ir_path = Path(fp32_onnx_path).with_suffix(".xml")
int8_ir_path = Path(int8_onnx_path).with_suffix(".xml")
original_model_size = fp32_ir_path.with_suffix(".bin").stat().st_size / 1024
quantized_model_size = int8_ir_path.with_suffix(".bin").stat().st_size / 1024
print(f"FP32 model size: {original_model_size:.2f} KB")
print(f"INT8 model size: {quantized_model_size:.2f} KB")
FP32 model size: 7728.27 KB
INT8 model size: 1953.49 KB
Compare Metrics¶
int8_f1 = compute_f1(compressed_model, dataset)
print(f"FP32 F1: {fp32_f1:.3f}")
print(f"INT8 F1: {int8_f1:.3f}")
FP32 F1: 0.974
INT8 F1: 0.967
Compare Performance of the Original and Quantized Models¶
To measure the inference performance of the FP32 and INT8 models, we use
Benchmark
Tool,
OpenVINO’s inference performance measurement tool. Benchmark tool is a
command line application, part of OpenVINO development tools, that can
be run in the notebook with ! benchmark_app
or
%sx benchmark_app
.
In this tutorial, we use a wrapper function from Notebook
Utils.
It prints the benchmark_app
command with the chosen parameters.
NOTE: For the most accurate performance estimation, we recommended running
benchmark_app
in a terminal/command prompt after closing other applications. Runbenchmark_app --help
to see all command line options.
# Show the parameters and docstring for `benchmark_model`
benchmark_model?
device = "CPU"
# Benchmark FP32 model
benchmark_model(model_path=fp32_ir_path, device=device, seconds=15)
Benchmark unet_kits19_fp32.xml with CPU for 15 seconds with async inference
Benchmark command:
benchmark_app -m model/unet_kits19_fp32.xml -d CPU -t 15 -api async -b 1 -cdir model_cache
Count: 462 iterations
Duration: 15287.59 ms
Latency:
Median: 197.02 ms
AVG: 197.82 ms
MIN: 146.06 ms
MAX: 288.85 ms
Throughput: 30.22 FPS
Device: Intel(R) Core(TM) i9-10920X CPU @ 3.50GHz
# Benchmark INT8 model
benchmark_model(model_path=int8_ir_path, device=device, seconds=15)
Benchmark unet_kits19_int8.xml with CPU for 15 seconds with async inference
Benchmark command:
benchmark_app -m model/unet_kits19_int8.xml -d CPU -t 15 -api async -b 1 -cdir model_cache
Count: 780 iterations
Duration: 15172.01 ms
Latency:
Median: 117.11 ms
AVG: 116.31 ms
MIN: 98.11 ms
MAX: 138.50 ms
Throughput: 51.41 FPS
Device: Intel(R) Core(TM) i9-10920X CPU @ 3.50GHz
Visually Compare Inference Results¶
Visualize the results of the model on four slices of the validation set. Compare the results of the FP32 IR model with the results of the quantized INT8 model and the reference segmentation annotation.
Medical imaging datasets tend to be very imbalanced: most of the slices in a CT scan do not contain kidney data. The segmentation model should be good at finding kidneys where they exist (in medical terms: have good sensitivity) but also not find spurious kidneys that do not exist (have good specificity). In the next cell, we show four slices: two slices that have no kidney data, and two slices that contain kidney data. For this example, a slice has kidney data if at least 50 pixels in the slices are annotated as kidney.
Run this cell again to show results on a different subset. The random seed is displayed to allow reproducing specific runs of this cell.
Note: the images are shown after optional augmenting and resizing. In the Kits19 dataset all but one of the cases has input shape
(512, 512)
.
# The sigmoid function is used to transform the result of the network
# to binary segmentation masks
def sigmoid(x):
return np.exp(-np.logaddexp(0, -x))
num_images = 4
colormap = "gray"
# Load FP32 and INT8 models
core = Core()
fp_model = core.read_model(fp32_ir_path)
int8_model = core.read_model(int8_ir_path)
compiled_model_fp = core.compile_model(fp_model, device_name="CPU")
compiled_model_int8 = core.compile_model(int8_model, device_name="CPU")
output_layer_fp = compiled_model_fp.output(0)
output_layer_int8 = compiled_model_int8.output(0)
# Create subset of dataset
background_slices = (item for item in dataset if np.count_nonzero(item[1]) == 0)
kidney_slices = (item for item in dataset if np.count_nonzero(item[1]) > 50)
data_subset = random.sample(list(background_slices), 2) + random.sample(list(kidney_slices), 2)
# Set seed to current time. To reproduce specific results, copy the printed seed
# and manually set `seed` to that value.
seed = int(time.time())
random.seed(seed)
print(f"Visualizing results with seed {seed}")
fig, ax = plt.subplots(nrows=num_images, ncols=4, figsize=(24, num_images * 4))
for i, (image, mask) in enumerate(data_subset):
display_image = rotate_and_flip(image.squeeze())
target_mask = rotate_and_flip(mask).astype(np.uint8)
# Add batch dimension to image and do inference on FP and INT8 models
input_image = np.expand_dims(image, 0)
res_fp = compiled_model_fp([input_image])
res_int8 = compiled_model_int8([input_image])
# Process inference outputs and convert to binary segementation masks
result_mask_fp = sigmoid(res_fp[output_layer_fp]).squeeze().round().astype(np.uint8)
result_mask_int8 = sigmoid(res_int8[output_layer_int8]).squeeze().round().astype(np.uint8)
result_mask_fp = rotate_and_flip(result_mask_fp)
result_mask_int8 = rotate_and_flip(result_mask_int8)
# Display images, annotations, FP32 result and INT8 result
ax[i, 0].imshow(display_image, cmap=colormap)
ax[i, 1].imshow(target_mask, cmap=colormap)
ax[i, 2].imshow(result_mask_fp, cmap=colormap)
ax[i, 3].imshow(result_mask_int8, cmap=colormap)
ax[i, 2].set_title("Prediction on FP32 model")
ax[i, 3].set_title("Prediction on INT8 model")
Visualizing results with seed 1668547890

Show Live Inference¶
To show live inference on the model in the notebook, we use the asynchronous processing feature of OpenVINO.
We use the show_live_inference
function from Notebook
Utils to show live inference. This
function uses Open Model
Zoo’s
AsyncPipeline and Model API to perform asynchronous inference. After
inference on the specified CT scan has completed, the total time and
throughput (fps), including preprocessing and displaying, will be
printed.
NOTE: you may experience flickering on Firefox. Please consider using Chrome or Edge to run this notebook.
Load Model and List of Image Files¶
We load the segmentation model to OpenVINO Runtime with
SegmentationModel
, based on the Open Model
Zoo Model API.
This model implementation includes pre and post processing for the
model. For SegmentationModel
this includes the code to create an
overlay of the segmentation mask on the original image/frame.
CASE = 117
# The Live inference function uses the OpenVINO Runtime API which is compatible with
# OpenVINO LTS release 2021.4
ie = IECore()
segmentation_model = SegmentationModel(
ie=ie, model_path=Path(int8_ir_path), sigmoid=True, rotate_and_flip=True
)
case_path = BASEDIR / f"case_{CASE:05d}"
image_paths = sorted(case_path.glob("imaging_frames/*jpg"))
print(f"{case_path.name}, {len(image_paths)} images")
case_00117, 69 images
Show Inference¶
In the next cell, we run the show_live_inference
function, which
loads the segmentation_model
to the specified device
(using
caching for faster model loading on GPU devices), loads the images,
performs inference, and displays the results on the frames loaded in
images
in real-time.
# Possible options for device include "CPU", "GPU", "AUTO", "MULTI:CPU,GPU"
device = "CPU"
reader = LoadImage(image_only=True, dtype=np.uint8)
show_live_inference(
ie=ie, image_paths=image_paths, model=segmentation_model, device=device, reader=reader
)

Loaded model to CPU in 0.20 seconds.
Total time for 68 frames: 2.19 seconds, fps:31.58
References¶
OpenVINO - NNCF Repository - Neural Network Compression Framework for fast model inference - OpenVINO API Tutorial - OpenVINO PyPI (pip install openvino-dev)
Kits19 Data - Kits19 Challenge Homepage - Kits19 Github Repository - The KiTS19 Challenge Data: 300 Kidney Tumor Cases with Clinical Context, CT Semantic Segmentations, and Surgical Outcomes - The state of the art in kidney and kidney tumor segmentation in contrast-enhanced CT imaging: Results of the KiTS19 challenge