Quantization of Image Classification Models#
This Jupyter notebook can be launched on-line, opening an interactive environment in a browser window. You can also make a local installation. Choose one of the following options:
This tutorial demonstrates how to apply INT8
quantization to Image
Classification model using
NNCF. It uses the
MobileNet V2 model, trained on Cifar10 dataset. The code is designed to
be extendable to custom models and datasets. The tutorial uses OpenVINO
backend for performing model quantization in NNCF, if you interested how
to apply quantization on PyTorch model, please check this
tutorial.
This tutorial consists of the following steps:
Prepare the model for quantization.
Define a data loading functionality.
Perform quantization.
Compare accuracy of the original and quantized models.
Compare performance of the original and quantized models.
Compare results on one picture.
Table of contents:
Installation Instructions#
This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.
import platform
# Install required packages
%pip install -q "openvino>=2023.1.0" "nncf>=2.6.0" torch torchvision tqdm --extra-index-url https://download.pytorch.org/whl/cpu
if platform.system() != "Windows":
%pip install -q "matplotlib>=3.4"
else:
%pip install -q "matplotlib>=3.4,<3.7"
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
from pathlib import Path
# Set the data and model directories
DATA_DIR = Path("data")
MODEL_DIR = Path("model")
model_repo = "pytorch-cifar-models"
DATA_DIR.mkdir(exist_ok=True)
MODEL_DIR.mkdir(exist_ok=True)
Prepare the Model#
Model preparation stage has the following steps:
Download a PyTorch model
Convert model to OpenVINO Intermediate Representation format (IR) using model conversion Python API
Serialize converted model on disk
import sys
if not Path(model_repo).exists():
!git clone https://github.com/chenyaofo/pytorch-cifar-models.git
sys.path.append(model_repo)
Cloning into 'pytorch-cifar-models'...
remote: Enumerating objects: 282, done.[K
remote: Counting objects: 100% (281/281), done.[K
remote: Compressing objects: 100% (96/96), done.[K
remote: Total 282 (delta 135), reused 269 (delta 128), pack-reused 1 (from 1)[K
Receiving objects: 100% (282/282), 9.22 MiB | 22.15 MiB/s, done.
Resolving deltas: 100% (135/135), done.
from pytorch_cifar_models import cifar10_mobilenetv2_x1_0
model = cifar10_mobilenetv2_x1_0(pretrained=True)
OpenVINO supports PyTorch models via conversion to OpenVINO Intermediate
Representation format using model conversion Python API.
ov.convert_model
accept PyTorch model instance and convert it into
openvino.runtime.Model
representation of model in OpenVINO.
Optionally, you may specify example_input
which serves as a helper
for model tracing and input_shape
for converting the model with
static shape. The converted model is ready to be loaded on a device for
inference and can be saved on a disk for next usage via the
save_model
function. More details about model conversion Python API
can be found on this
page.
import openvino as ov
model.eval()
ov_model = ov.convert_model(model, input=[1, 3, 32, 32])
ov.save_model(ov_model, MODEL_DIR / "mobilenet_v2.xml")
Prepare Dataset#
We will use CIFAR10 dataset from torchvision. Preprocessing for model obtained from training config
import torch
from torchvision import transforms
from torchvision.datasets import CIFAR10
transform = transforms.Compose(
[
transforms.ToTensor(),
transforms.Normalize((0.4914, 0.4822, 0.4465), (0.247, 0.243, 0.261)),
]
)
dataset = CIFAR10(root=DATA_DIR, train=False, transform=transform, download=True)
val_loader = torch.utils.data.DataLoader(
dataset,
batch_size=1,
shuffle=False,
num_workers=0,
pin_memory=True,
)
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to data/cifar-10-python.tar.gz
100%|██████████| 170498071/170498071 [00:07<00:00, 23597971.56it/s]
Extracting data/cifar-10-python.tar.gz to data
Perform Quantization#
NNCF provides a suite of advanced algorithms for Neural Networks inference optimization in OpenVINO with minimal accuracy drop. We will use 8-bit quantization in post-training mode (without the fine-tuning pipeline) to optimize MobileNetV2. The optimization process contains the following steps:
Create a Dataset for quantization.
Run
nncf.quantize
for getting an optimized model.Serialize an OpenVINO IR model, using the
openvino.save_model
function.
Create Dataset for Validation#
NNCF is compatible with torch.utils.data.DataLoader
interface. For
performing quantization it should be passed into nncf.Dataset
object
with transformation function, which prepares input data to fit into
model during quantization, in our case, to pick input tensor from pair
(input tensor and label) and convert PyTorch tensor to numpy.
import nncf
def transform_fn(data_item):
image_tensor = data_item[0]
return image_tensor.numpy()
quantization_dataset = nncf.Dataset(val_loader, transform_fn)
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino
Run nncf.quantize for Getting an Optimized Model#
nncf.quantize
function accepts model and prepared quantization
dataset for performing basic quantization. Optionally, additional
parameters like subset_size
, preset
, ignored_scope
can be
provided to improve quantization result if applicable. More details
about supported parameters can be found on this
page
quant_ov_model = nncf.quantize(ov_model, quantization_dataset)
2024-08-28 02:38:36.128317: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2024-08-28 02:38:36.160156: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-08-28 02:38:36.699848: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Output()
Output()
Serialize an OpenVINO IR model#
Similar to ov.convert_model
, quantized model is ov.Model
object
which ready to be loaded into device and can be serialized on disk using
ov.save_model
.
ov.save_model(quant_ov_model, MODEL_DIR / "quantized_mobilenet_v2.xml")
Compare Accuracy of the Original and Quantized Models#
from tqdm.notebook import tqdm
import numpy as np
def test_accuracy(ov_model, data_loader):
correct = 0
total = 0
for batch_imgs, batch_labels in tqdm(data_loader):
result = ov_model(batch_imgs)[0]
top_label = np.argmax(result)
correct += top_label == batch_labels.numpy()
total += 1
return correct / total
Select inference device#
select device from dropdown list for running inference using OpenVINO
import ipywidgets as widgets
core = ov.Core()
device = widgets.Dropdown(
options=core.available_devices + ["AUTO"],
value="AUTO",
description="Device:",
disabled=False,
)
device
Dropdown(description='Device:', index=1, options=('CPU', 'AUTO'), value='AUTO')
core = ov.Core()
compiled_model = core.compile_model(ov_model, device.value)
optimized_compiled_model = core.compile_model(quant_ov_model, device.value)
orig_accuracy = test_accuracy(compiled_model, val_loader)
optimized_accuracy = test_accuracy(optimized_compiled_model, val_loader)
0%| | 0/10000 [00:00<?, ?it/s]
0%| | 0/10000 [00:00<?, ?it/s]
print(f"Accuracy of the original model: {orig_accuracy[0] * 100 :.2f}%")
print(f"Accuracy of the optimized model: {optimized_accuracy[0] * 100 :.2f}%")
Accuracy of the original model: 93.61%
Accuracy of the optimized model: 93.57%
Compare Performance of the Original and Quantized Models#
Finally, measure the inference performance of the FP32
and INT8
models, using Benchmark
Tool
- an inference performance measurement tool in OpenVINO.
NOTE: For more accurate performance, it is recommended to run benchmark_app in a terminal/command prompt after closing other applications. Run
benchmark_app -m model.xml -d CPU
to benchmark async inference on CPU for one minute. Change CPU to GPU to benchmark on GPU. Runbenchmark_app --help
to see an overview of all command-line options.
# Inference FP16 model (OpenVINO IR)
!benchmark_app -m "model/mobilenet_v2.xml" -d $device.value -api async -t 15
[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2024.4.0-16508-1d6e97cabaa
[ INFO ]
[ INFO ] Device info:
[ INFO ] AUTO
[ INFO ] Build ................................. 2024.4.0-16508-1d6e97cabaa
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Performance hint was not explicitly specified in command line. Device(AUTO) performance hint will be set to PerformanceMode.THROUGHPUT.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 9.60 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ] x (node: x) : f32 / [...] / [1,3,32,32]
[ INFO ] Model outputs:
[ INFO ] x.17 (node: aten::linear/Add) : f32 / [...] / [1,10]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 1
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ] x (node: x) : u8 / [N,C,H,W] / [1,3,32,32]
[ INFO ] Model outputs:
[ INFO ] x.17 (node: aten::linear/Add) : f32 / [...] / [1,10]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 175.13 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ] NETWORK_NAME: Model2
[ INFO ] EXECUTION_DEVICES: ['CPU']
[ INFO ] PERFORMANCE_HINT: PerformanceMode.THROUGHPUT
[ INFO ] OPTIMAL_NUMBER_OF_INFER_REQUESTS: 12
[ INFO ] MULTI_DEVICE_PRIORITIES: CPU
[ INFO ] CPU:
[ INFO ] AFFINITY: Affinity.CORE
[ INFO ] CPU_DENORMALS_OPTIMIZATION: False
[ INFO ] CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE: 1.0
[ INFO ] DYNAMIC_QUANTIZATION_GROUP_SIZE: 32
[ INFO ] ENABLE_CPU_PINNING: True
[ INFO ] ENABLE_HYPER_THREADING: True
[ INFO ] EXECUTION_DEVICES: ['CPU']
[ INFO ] EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ] INFERENCE_NUM_THREADS: 24
[ INFO ] INFERENCE_PRECISION_HINT: <Type: 'float32'>
[ INFO ] KV_CACHE_PRECISION: <Type: 'float16'>
[ INFO ] LOG_LEVEL: Level.NO
[ INFO ] MODEL_DISTRIBUTION_POLICY: set()
[ INFO ] NETWORK_NAME: Model2
[ INFO ] NUM_STREAMS: 12
[ INFO ] OPTIMAL_NUMBER_OF_INFER_REQUESTS: 12
[ INFO ] PERFORMANCE_HINT: THROUGHPUT
[ INFO ] PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ] PERF_COUNT: NO
[ INFO ] SCHEDULING_CORE_TYPE: SchedulingCoreType.ANY_CORE
[ INFO ] MODEL_PRIORITY: Priority.MEDIUM
[ INFO ] LOADED_FROM_CACHE: False
[ INFO ] PERF_COUNT: False
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'x'!. This input will be filled with random values!
[ INFO ] Fill input 'x' with random values
[Step 10/11] Measuring performance (Start inference asynchronously, 12 inference requests, limits: 15000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 3.12 ms
[Step 11/11] Dumping statistics report
[ INFO ] Execution Devices:['CPU']
[ INFO ] Count: 88164 iterations
[ INFO ] Duration: 15002.32 ms
[ INFO ] Latency:
[ INFO ] Median: 1.85 ms
[ INFO ] Average: 1.86 ms
[ INFO ] Min: 1.40 ms
[ INFO ] Max: 9.58 ms
[ INFO ] Throughput: 5876.69 FPS
# Inference INT8 model (OpenVINO IR)
!benchmark_app -m "model/quantized_mobilenet_v2.xml" -d $device.value -api async -t 15
[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2024.4.0-16508-1d6e97cabaa
[ INFO ]
[ INFO ] Device info:
[ INFO ] AUTO
[ INFO ] Build ................................. 2024.4.0-16508-1d6e97cabaa
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Performance hint was not explicitly specified in command line. Device(AUTO) performance hint will be set to PerformanceMode.THROUGHPUT.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 15.06 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ] x (node: x) : f32 / [...] / [1,3,32,32]
[ INFO ] Model outputs:
[ INFO ] x.17 (node: aten::linear/Add) : f32 / [...] / [1,10]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 1
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ] x (node: x) : u8 / [N,C,H,W] / [1,3,32,32]
[ INFO ] Model outputs:
[ INFO ] x.17 (node: aten::linear/Add) : f32 / [...] / [1,10]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 252.04 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ] NETWORK_NAME: Model2
[ INFO ] EXECUTION_DEVICES: ['CPU']
[ INFO ] PERFORMANCE_HINT: PerformanceMode.THROUGHPUT
[ INFO ] OPTIMAL_NUMBER_OF_INFER_REQUESTS: 12
[ INFO ] MULTI_DEVICE_PRIORITIES: CPU
[ INFO ] CPU:
[ INFO ] AFFINITY: Affinity.CORE
[ INFO ] CPU_DENORMALS_OPTIMIZATION: False
[ INFO ] CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE: 1.0
[ INFO ] DYNAMIC_QUANTIZATION_GROUP_SIZE: 32
[ INFO ] ENABLE_CPU_PINNING: True
[ INFO ] ENABLE_HYPER_THREADING: True
[ INFO ] EXECUTION_DEVICES: ['CPU']
[ INFO ] EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
[ INFO ] INFERENCE_NUM_THREADS: 24
[ INFO ] INFERENCE_PRECISION_HINT: <Type: 'float32'>
[ INFO ] KV_CACHE_PRECISION: <Type: 'float16'>
[ INFO ] LOG_LEVEL: Level.NO
[ INFO ] MODEL_DISTRIBUTION_POLICY: set()
[ INFO ] NETWORK_NAME: Model2
[ INFO ] NUM_STREAMS: 12
[ INFO ] OPTIMAL_NUMBER_OF_INFER_REQUESTS: 12
[ INFO ] PERFORMANCE_HINT: THROUGHPUT
[ INFO ] PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ] PERF_COUNT: NO
[ INFO ] SCHEDULING_CORE_TYPE: SchedulingCoreType.ANY_CORE
[ INFO ] MODEL_PRIORITY: Priority.MEDIUM
[ INFO ] LOADED_FROM_CACHE: False
[ INFO ] PERF_COUNT: False
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'x'!. This input will be filled with random values!
[ INFO ] Fill input 'x' with random values
[Step 10/11] Measuring performance (Start inference asynchronously, 12 inference requests, limits: 15000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 1.66 ms
[Step 11/11] Dumping statistics report
[ INFO ] Execution Devices:['CPU']
[ INFO ] Count: 167052 iterations
[ INFO ] Duration: 15001.55 ms
[ INFO ] Latency:
[ INFO ] Median: 1.00 ms
[ INFO ] Average: 1.03 ms
[ INFO ] Min: 0.68 ms
[ INFO ] Max: 9.14 ms
[ INFO ] Throughput: 11135.65 FPS
Compare results on four pictures#
# Define all possible labels from the CIFAR10 dataset
labels_names = [
"airplane",
"automobile",
"bird",
"cat",
"deer",
"dog",
"frog",
"horse",
"ship",
"truck",
]
all_pictures = []
all_labels = []
# Get all pictures and their labels.
for i, batch in enumerate(val_loader):
all_pictures.append(batch[0].numpy())
all_labels.append(batch[1].item())
import matplotlib.pyplot as plt
def plot_pictures(indexes: list, all_pictures=all_pictures, all_labels=all_labels):
"""Plot 4 pictures.
:param indexes: a list of indexes of pictures to be displayed.
:param all_batches: batches with pictures.
"""
images, labels = [], []
num_pics = len(indexes)
assert num_pics == 4, f"No enough indexes for pictures to be displayed, got {num_pics}"
for idx in indexes:
assert idx < 10000, "Cannot get such index, there are only 10000"
pic = np.rollaxis(all_pictures[idx].squeeze(), 0, 3)
images.append(pic)
labels.append(labels_names[all_labels[idx]])
f, axarr = plt.subplots(1, 4)
axarr[0].imshow(images[0])
axarr[0].set_title(labels[0])
axarr[1].imshow(images[1])
axarr[1].set_title(labels[1])
axarr[2].imshow(images[2])
axarr[2].set_title(labels[2])
axarr[3].imshow(images[3])
axarr[3].set_title(labels[3])
def infer_on_pictures(model, indexes: list, all_pictures=all_pictures):
"""Inference model on a few pictures.
:param net: model on which do inference
:param indexes: list of indexes
"""
output_key = model.output(0)
predicted_labels = []
for idx in indexes:
assert idx < 10000, "Cannot get such index, there are only 10000"
result = model(all_pictures[idx])[output_key]
result = labels_names[np.argmax(result[0])]
predicted_labels.append(result)
return predicted_labels
indexes_to_infer = [7, 12, 15, 20] # To plot, specify 4 indexes.
plot_pictures(indexes_to_infer)
results_float = infer_on_pictures(compiled_model, indexes_to_infer)
results_quanized = infer_on_pictures(optimized_compiled_model, indexes_to_infer)
print(f"Labels for picture from float model : {results_float}.")
print(f"Labels for picture from quantized model : {results_quanized}.")
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Labels for picture from float model : ['frog', 'dog', 'ship', 'horse'].
Labels for picture from quantized model : ['frog', 'dog', 'ship', 'horse'].