Accelerate Inference of Sparse Transformer Models with OpenVINO™ and 4th Gen Intel® Xeon® Scalable Processors¶
This tutorial is also available as a Jupyter notebook that can be cloned directly from GitHub. See the installation guide for instructions to run this tutorial locally on Windows, Linux or macOS.
This tutorial demonstrates how to improve performance of sparse Transformer models with OpenVINO on 4th Gen Intel® Xeon® Scalable processors. It uses a pre-trained model from the Hugging Face Transformers library and shows how to convert it to the OpenVINO™ IR format and run inference on a CPU using a dedicated runtime option that enables sparsity optimizations. It also demonstrates how to get more performance stacking sparsity with 8-bit quantization. To simplify the user experience, the Hugging Face Optimum library is used to convert the model to OpenVINO™ IR format and quantize it using Neural Network Compression Framework. It consists of the following steps:
Install prerequisites
Download and quantize sparse BERT model from a public source using the OpenVINO integration with Hugging Face Optimum.
Compare sparse 8-bit vs. dense 8-bit inference performance.
Prerequisites¶
!pip install optimum[openvino] datasets
Collecting optimum[openvino]
Using cached optimum-1.7.1-py3-none-any.whl (279 kB)
Requirement already satisfied: datasets in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (2.10.1)
Requirement already satisfied: huggingface-hub>=0.8.0 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from optimum[openvino]) (0.13.1)
Requirement already satisfied: numpy in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from optimum[openvino]) (1.23.4)
Requirement already satisfied: transformers[sentencepiece]>=4.26.0 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from optimum[openvino]) (4.26.1)
Requirement already satisfied: packaging in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from optimum[openvino]) (23.0)
Requirement already satisfied: torch>=1.9 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from optimum[openvino]) (1.13.1+cpu)
Collecting coloredlogs
Using cached coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
Collecting sympy
Using cached sympy-1.11.1-py3-none-any.whl (6.5 MB)
Collecting optimum-intel[openvino]
Using cached optimum_intel-1.7.0-py3-none-any.whl
Requirement already satisfied: pyarrow>=6.0.0 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (11.0.0)
Requirement already satisfied: dill<0.3.7,>=0.3.0 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (0.3.4)
Requirement already satisfied: tqdm>=4.62.1 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (4.65.0)
Requirement already satisfied: multiprocess in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (0.70.12.2)
Requirement already satisfied: aiohttp in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (3.8.4)
Requirement already satisfied: pyyaml>=5.1 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (6.0)
Requirement already satisfied: pandas in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (1.3.5)
Requirement already satisfied: responses<0.19 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (0.18.0)
Requirement already satisfied: fsspec[http]>=2021.11.1 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (2023.3.0)
Requirement already satisfied: requests>=2.19.0 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (2.28.1)
Requirement already satisfied: xxhash in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from datasets) (3.2.0)
Requirement already satisfied: attrs>=17.3.0 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from aiohttp->datasets) (22.2.0)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from aiohttp->datasets) (4.0.2)
Requirement already satisfied: yarl<2.0,>=1.0 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from aiohttp->datasets) (1.8.2)
Requirement already satisfied: multidict<7.0,>=4.5 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from aiohttp->datasets) (6.0.4)
Requirement already satisfied: frozenlist>=1.1.1 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from aiohttp->datasets) (1.3.3)
Requirement already satisfied: aiosignal>=1.1.2 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from aiohttp->datasets) (1.3.1)
Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from aiohttp->datasets) (2.1.1)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from huggingface-hub>=0.8.0->optimum[openvino]) (4.5.0)
Requirement already satisfied: filelock in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from huggingface-hub>=0.8.0->optimum[openvino]) (3.9.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (1.26.14)
Requirement already satisfied: idna<4,>=2.5 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from requests>=2.19.0->datasets) (2022.12.7)
Requirement already satisfied: regex!=2019.12.17 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from transformers[sentencepiece]>=4.26.0->optimum[openvino]) (2022.10.31)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from transformers[sentencepiece]>=4.26.0->optimum[openvino]) (0.13.2)
Requirement already satisfied: protobuf<=3.20.2 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from transformers[sentencepiece]>=4.26.0->optimum[openvino]) (3.19.6)
Requirement already satisfied: sentencepiece!=0.1.92,>=0.1.91 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from transformers[sentencepiece]>=4.26.0->optimum[openvino]) (0.1.97)
Collecting humanfriendly>=9.1
Using cached humanfriendly-10.0-py2.py3-none-any.whl (86 kB)
Requirement already satisfied: scipy in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from optimum-intel[openvino]->optimum[openvino]) (1.9.1)
Requirement already satisfied: onnx in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from optimum-intel[openvino]->optimum[openvino]) (1.12.0)
Collecting onnxruntime
Using cached onnxruntime-1.14.1-cp38-cp38-manylinux_2_27_x86_64.whl (5.0 MB)
Collecting openvino>=2023.0.0.dev20230217
Using cached openvino-2023.0.0.dev20230217-9771-cp38-cp38-manylinux2014_x86_64.whl (38.9 MB)
Requirement already satisfied: pytz>=2017.3 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from pandas->datasets) (2022.7.1)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from pandas->datasets) (2.8.2)
Collecting mpmath>=0.19
Using cached mpmath-1.3.0-py3-none-any.whl (536 kB)
Requirement already satisfied: six>=1.5 in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from python-dateutil>=2.7.3->pandas->datasets) (1.16.0)
Requirement already satisfied: flatbuffers in /opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages (from onnxruntime->optimum-intel[openvino]->optimum[openvino]) (1.12)
Installing collected packages: mpmath, sympy, openvino, humanfriendly, coloredlogs, onnxruntime, optimum, optimum-intel
Attempting uninstall: openvino
Found existing installation: openvino 2022.3.0
Uninstalling openvino-2022.3.0:
Successfully uninstalled openvino-2022.3.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
openvino-dev 2022.3.0 requires openvino==2022.3.0, but you have openvino 2023.0.0.dev20230217 which is incompatible.[0m[31m
[0mSuccessfully installed coloredlogs-15.0.1 humanfriendly-10.0 mpmath-1.3.0 onnxruntime-1.14.1 openvino-2023.0.0.dev20230217 optimum-1.7.1 optimum-intel-1.7.0 sympy-1.11.1
Imports¶
from functools import partial
from pathlib import Path
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from optimum.intel.openvino import OVQuantizer
from optimum.intel.openvino import OVConfig
2023-03-09 23:02:22.718915: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, onnx, openvino
/opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-358/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/openvino/offline_transformations/__init__.py:10: FutureWarning: The module is private and following namespace offline_transformations will be removed in the future. warnings.warn( No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Quantize the model using Hugging Face Optimum API¶
The sparsity acceleration MatMul operations are only available cases when these operations are quantized into 8-bit precision. If the model is not in 8-bit precision, it can be quantized using either method available for OpenVINO models. For more details, please refer to the Model Optimization Guide. In this tutorial we use the Hugging Face Optimum API to quantize the model. The Hugging Face Optimum API is a high-level API that allows us to convert and quantize models from the Hugging Face Transformers library to the OpenVINO™ IR format. For more details refer to the Hugging Face Optimum documentation.
model_id = "neuralmagic/oBERT-12-downstream-pruned-unstructured-90-mnli"
quantized_sparse_dir = Path("bert_90_sparse_quantized")
# Instantiate model and tokenizer in PyTorch and load them from the HF Hub
torch_model = AutoModelForSequenceClassification.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
def preprocess_function(examples, tokenizer):
"""
Define a function that tokenizes the data and returns it in the format expected by the model.
:param: examples: a dictionary containing the input data which are the items from caliration dataset.
tokenizer: a tokenizer object that is used to tokenize the text data.
:returns:
the data that can be fed directly to the model.
"""
return tokenizer(
examples["premise"], examples["hypothesis"], padding="max_length", max_length=128, truncation=True
)
# Create quantization config (default) and OVQuantizer
# OVConfig is a wrapper class on top of NNCF config.
# Use "compression" field to control quantization parameters
# For more information about the parameters refer to NNCF GitHub documentatioin
quantization_config = OVConfig()
quantizer = OVQuantizer.from_pretrained(torch_model, feature="sequence-classification")
# Instantiate a dataset and convert it to calibration dataset using HF API
# The latter one produces a model input
dataset = load_dataset("glue", "mnli")
calibration_dataset = quantizer.get_calibration_dataset(
"glue",
dataset_config_name="mnli",
preprocess_function=partial(preprocess_function, tokenizer=tokenizer),
num_samples=100,
dataset_split="train",
)
# Apply static quantization and export the resulting quantized model to OpenVINO IR format
quantizer.quantize(
quantization_config=quantization_config, calibration_dataset=calibration_dataset, save_directory=quantized_sparse_dir
)
feature is deprecated and will be removed in a future version. Use task instead.
[ WARNING ] Found cached dataset glue (/opt/home/k8sworker/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
[ WARNING ] Found cached dataset glue (/opt/home/k8sworker/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
0%| | 0/5 [00:00<?, ?it/s]
[ WARNING ] Found cached dataset glue (/opt/home/k8sworker/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
[ WARNING ] Found cached dataset glue (/opt/home/k8sworker/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
[ WARNING ] Loading cached shuffled indices for dataset at /opt/home/k8sworker/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-2bba8406484faf80.arrow
[ WARNING ] Loading cached shuffled indices for dataset at /opt/home/k8sworker/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-2bba8406484faf80.arrow
[ WARNING ] Loading cached processed dataset at /opt/home/k8sworker/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-92d8560dcfa916bd.arrow
[ WARNING ] Loading cached processed dataset at /opt/home/k8sworker/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-92d8560dcfa916bd.arrow
INFO:nncf:Ignored adding weight quantizer for: BertForSequenceClassification/BertModel[bert]/BertEmbeddings[embeddings]/NNCFEmbedding[word_embeddings]/embedding_0
INFO:nncf:Ignored adding weight quantizer for: BertForSequenceClassification/BertModel[bert]/BertEmbeddings[embeddings]/NNCFEmbedding[token_type_embeddings]/embedding_0
INFO:nncf:Ignored adding weight quantizer for: BertForSequenceClassification/BertModel[bert]/BertEmbeddings[embeddings]/NNCFEmbedding[position_embeddings]/embedding_0
INFO:nncf:Not adding activation input quantizer for operation: 4 BertForSequenceClassification/BertModel[bert]/BertEmbeddings[embeddings]/NNCFEmbedding[word_embeddings]/embedding_0
INFO:nncf:Not adding activation input quantizer for operation: 5 BertForSequenceClassification/BertModel[bert]/BertEmbeddings[embeddings]/NNCFEmbedding[token_type_embeddings]/embedding_0
INFO:nncf:Not adding activation input quantizer for operation: 6 BertForSequenceClassification/BertModel[bert]/BertEmbeddings[embeddings]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 8 BertForSequenceClassification/BertModel[bert]/BertEmbeddings[embeddings]/__iadd___0
INFO:nncf:Not adding activation input quantizer for operation: 9 BertForSequenceClassification/BertModel[bert]/BertEmbeddings[embeddings]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 10 BertForSequenceClassification/BertModel[bert]/BertEmbeddings[embeddings]/Dropout[dropout]/dropout_0
INFO:nncf:Not adding activation input quantizer for operation: 23 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[0]/BertAttention[attention]/BertSelfAttention[self]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 26 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[0]/BertAttention[attention]/BertSelfAttention[self]/matmul_1
INFO:nncf:Not adding activation input quantizer for operation: 32 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[0]/BertAttention[attention]/BertSelfOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 33 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[0]/BertAttention[attention]/BertSelfOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 38 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[0]/BertOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 39 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[0]/BertOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 52 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[1]/BertAttention[attention]/BertSelfAttention[self]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 55 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[1]/BertAttention[attention]/BertSelfAttention[self]/matmul_1
INFO:nncf:Not adding activation input quantizer for operation: 61 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[1]/BertAttention[attention]/BertSelfOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 62 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[1]/BertAttention[attention]/BertSelfOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 67 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[1]/BertOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 68 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[1]/BertOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 81 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[2]/BertAttention[attention]/BertSelfAttention[self]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 84 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[2]/BertAttention[attention]/BertSelfAttention[self]/matmul_1
INFO:nncf:Not adding activation input quantizer for operation: 90 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[2]/BertAttention[attention]/BertSelfOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 91 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[2]/BertAttention[attention]/BertSelfOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 96 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[2]/BertOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 97 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[2]/BertOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 110 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[3]/BertAttention[attention]/BertSelfAttention[self]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 113 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[3]/BertAttention[attention]/BertSelfAttention[self]/matmul_1
INFO:nncf:Not adding activation input quantizer for operation: 119 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[3]/BertAttention[attention]/BertSelfOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 120 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[3]/BertAttention[attention]/BertSelfOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 125 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[3]/BertOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 126 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[3]/BertOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 139 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[4]/BertAttention[attention]/BertSelfAttention[self]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 142 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[4]/BertAttention[attention]/BertSelfAttention[self]/matmul_1
INFO:nncf:Not adding activation input quantizer for operation: 148 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[4]/BertAttention[attention]/BertSelfOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 149 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[4]/BertAttention[attention]/BertSelfOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 154 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[4]/BertOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 155 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[4]/BertOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 168 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[5]/BertAttention[attention]/BertSelfAttention[self]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 171 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[5]/BertAttention[attention]/BertSelfAttention[self]/matmul_1
INFO:nncf:Not adding activation input quantizer for operation: 177 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[5]/BertAttention[attention]/BertSelfOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 178 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[5]/BertAttention[attention]/BertSelfOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 183 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[5]/BertOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 184 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[5]/BertOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 197 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[6]/BertAttention[attention]/BertSelfAttention[self]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 200 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[6]/BertAttention[attention]/BertSelfAttention[self]/matmul_1
INFO:nncf:Not adding activation input quantizer for operation: 206 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[6]/BertAttention[attention]/BertSelfOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 207 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[6]/BertAttention[attention]/BertSelfOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 212 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[6]/BertOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 213 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[6]/BertOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 226 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[7]/BertAttention[attention]/BertSelfAttention[self]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 229 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[7]/BertAttention[attention]/BertSelfAttention[self]/matmul_1
INFO:nncf:Not adding activation input quantizer for operation: 235 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[7]/BertAttention[attention]/BertSelfOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 236 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[7]/BertAttention[attention]/BertSelfOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 241 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[7]/BertOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 242 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[7]/BertOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 255 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[8]/BertAttention[attention]/BertSelfAttention[self]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 258 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[8]/BertAttention[attention]/BertSelfAttention[self]/matmul_1
INFO:nncf:Not adding activation input quantizer for operation: 264 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[8]/BertAttention[attention]/BertSelfOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 265 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[8]/BertAttention[attention]/BertSelfOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 270 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[8]/BertOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 271 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[8]/BertOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 284 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[9]/BertAttention[attention]/BertSelfAttention[self]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 287 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[9]/BertAttention[attention]/BertSelfAttention[self]/matmul_1
INFO:nncf:Not adding activation input quantizer for operation: 293 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[9]/BertAttention[attention]/BertSelfOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 294 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[9]/BertAttention[attention]/BertSelfOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 299 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[9]/BertOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 300 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[9]/BertOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 313 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[10]/BertAttention[attention]/BertSelfAttention[self]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 316 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[10]/BertAttention[attention]/BertSelfAttention[self]/matmul_1
INFO:nncf:Not adding activation input quantizer for operation: 322 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[10]/BertAttention[attention]/BertSelfOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 323 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[10]/BertAttention[attention]/BertSelfOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 328 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[10]/BertOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 329 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[10]/BertOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 342 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[11]/BertAttention[attention]/BertSelfAttention[self]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 345 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[11]/BertAttention[attention]/BertSelfAttention[self]/matmul_1
INFO:nncf:Not adding activation input quantizer for operation: 351 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[11]/BertAttention[attention]/BertSelfOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 352 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[11]/BertAttention[attention]/BertSelfOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Not adding activation input quantizer for operation: 357 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[11]/BertOutput[output]/__add___0
INFO:nncf:Not adding activation input quantizer for operation: 358 BertForSequenceClassification/BertModel[bert]/BertEncoder[encoder]/ModuleList[layer]/BertLayer[11]/BertOutput[output]/NNCFLayerNorm[LayerNorm]/layer_norm_0
INFO:nncf:Collecting tensor statistics |█ | 4 / 38
INFO:nncf:Collecting tensor statistics |███ | 8 / 38
INFO:nncf:Collecting tensor statistics |█████ | 12 / 38
INFO:nncf:Compiling and loading torch extension: quantized_functions_cpu...
INFO:nncf:Finished loading torch extension: quantized_functions_cpu
Benchmark quantized dense inference performance¶
Benchmark dense inference performance using parallel execution on four CPU cores to simulate a small instance in the cloud infrastructure. Sequense length is set to 16 which is common for multiple use cases, e.g. conversational AI.
# Dump benchmarking config for dense inference
with open("perf_config.json", "w") as outfile:
outfile.write(
"""
{
"CPU": {"NUM_STREAMS": 4, "INFERENCE_NUM_THREADS": 4}
}
"""
)
!benchmark_app -m bert_90_sparse_quantized/openvino_model.xml -shape "input_ids[1,16],attention_mask[1,16],token_type_ids[1,16]" -load_config perf_config.json
[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2023.0.0-9771-d7f47aa1228
[ INFO ]
[ INFO ] Device info:
[ INFO ] CPU
[ INFO ] Build ................................. 2023.0.0-9771-d7f47aa1228
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Performance hint was not explicitly specified in command line. Device(CPU) performance hint will be set to THROUGHPUT.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 176.10 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ] input_ids (node: input_ids) : i64 / [...] / [?,?]
[ INFO ] attention_mask (node: attention_mask) : i64 / [...] / [?,?]
[ INFO ] token_type_ids (node: token_type_ids) : i64 / [...] / [?,?]
[ INFO ] Model outputs:
[ INFO ] logits (node: logits) : f32 / [...] / [?,3]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 1
[ INFO ] Reshaping model: 'input_ids': [1,16], 'attention_mask': [1,16], 'token_type_ids': [1,16]
[ INFO ] Reshape model took 37.82 ms
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ] input_ids (node: input_ids) : i64 / [...] / [1,16]
[ INFO ] attention_mask (node: attention_mask) : i64 / [...] / [1,16]
[ INFO ] token_type_ids (node: token_type_ids) : i64 / [...] / [1,16]
[ INFO ] Model outputs:
[ INFO ] logits (node: logits) : f32 / [...] / [1,3]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 1205.36 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ] NETWORK_NAME: torch_jit
[ INFO ] OPTIMAL_NUMBER_OF_INFER_REQUESTS: 4
[ INFO ] NUM_STREAMS: 4
[ INFO ] AFFINITY: Affinity.CORE
[ INFO ] INFERENCE_NUM_THREADS: 4
[ INFO ] PERF_COUNT: False
[ INFO ] INFERENCE_PRECISION_HINT: <Type: 'float32'>
[ INFO ] PERFORMANCE_HINT: PerformanceMode.THROUGHPUT
[ INFO ] PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ] EXECUTION_DEVICES: ['CPU']
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'input_ids'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'attention_mask'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'token_type_ids'!. This input will be filled with random values!
[ INFO ] Fill input 'input_ids' with random values
[ INFO ] Fill input 'attention_mask' with random values
[ INFO ] Fill input 'token_type_ids' with random values
[Step 10/11] Measuring performance (Start inference asynchronously, 4 inference requests, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 16.58 ms
[Step 11/11] Dumping statistics report
[ INFO ] Execution Devices:['CPU']
[ INFO ] Count: 19840 iterations
[ INFO ] Duration: 60016.66 ms
[ INFO ] Latency:
[ INFO ] Median: 11.92 ms
[ INFO ] Average: 11.94 ms
[ INFO ] Min: 10.90 ms
[ INFO ] Max: 16.64 ms
[ INFO ] Throughput: 330.57 FPS
Benchmark quantized sparse inference performance¶
# Dump benchmarking config for dense inference
# "CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE" controls minimum sparsity rate for weights to consider
# for sparse optimization at the runtime.
with open("perf_config_sparse.json", "w") as outfile:
outfile.write(
"""
{
"CPU": {"NUM_STREAMS": 4, "INFERENCE_NUM_THREADS": 4, "CPU_SPARSE_WEIGHTS_DECOMPRESSION_RATE": 0.8}
}
"""
)
!benchmark_app -m bert_90_sparse_quantized/openvino_model.xml -shape "input_ids[1,16],attention_mask[1,16],token_type_ids[1,16]" -load_config perf_config_sparse.json
[Step 1/11] Parsing and validating input arguments
[ INFO ] Parsing input parameters
[Step 2/11] Loading OpenVINO Runtime
[ INFO ] OpenVINO:
[ INFO ] Build ................................. 2023.0.0-9771-d7f47aa1228
[ INFO ]
[ INFO ] Device info:
[ INFO ] CPU
[ INFO ] Build ................................. 2023.0.0-9771-d7f47aa1228
[ INFO ]
[ INFO ]
[Step 3/11] Setting device configuration
[ WARNING ] Performance hint was not explicitly specified in command line. Device(CPU) performance hint will be set to THROUGHPUT.
[Step 4/11] Reading model files
[ INFO ] Loading model files
[ INFO ] Read model took 181.89 ms
[ INFO ] Original model I/O parameters:
[ INFO ] Model inputs:
[ INFO ] input_ids (node: input_ids) : i64 / [...] / [?,?]
[ INFO ] attention_mask (node: attention_mask) : i64 / [...] / [?,?]
[ INFO ] token_type_ids (node: token_type_ids) : i64 / [...] / [?,?]
[ INFO ] Model outputs:
[ INFO ] logits (node: logits) : f32 / [...] / [?,3]
[Step 5/11] Resizing model to match image sizes and given batch
[ INFO ] Model batch size: 1
[ INFO ] Reshaping model: 'input_ids': [1,16], 'attention_mask': [1,16], 'token_type_ids': [1,16]
[ INFO ] Reshape model took 38.20 ms
[Step 6/11] Configuring input of the model
[ INFO ] Model inputs:
[ INFO ] input_ids (node: input_ids) : i64 / [...] / [1,16]
[ INFO ] attention_mask (node: attention_mask) : i64 / [...] / [1,16]
[ INFO ] token_type_ids (node: token_type_ids) : i64 / [...] / [1,16]
[ INFO ] Model outputs:
[ INFO ] logits (node: logits) : f32 / [...] / [1,3]
[Step 7/11] Loading the model to the device
[ INFO ] Compile model took 1204.08 ms
[Step 8/11] Querying optimal runtime parameters
[ INFO ] Model:
[ INFO ] NETWORK_NAME: torch_jit
[ INFO ] OPTIMAL_NUMBER_OF_INFER_REQUESTS: 4
[ INFO ] NUM_STREAMS: 4
[ INFO ] AFFINITY: Affinity.CORE
[ INFO ] INFERENCE_NUM_THREADS: 4
[ INFO ] PERF_COUNT: False
[ INFO ] INFERENCE_PRECISION_HINT: <Type: 'float32'>
[ INFO ] PERFORMANCE_HINT: PerformanceMode.THROUGHPUT
[ INFO ] PERFORMANCE_HINT_NUM_REQUESTS: 0
[ INFO ] EXECUTION_DEVICES: ['CPU']
[Step 9/11] Creating infer requests and preparing input tensors
[ WARNING ] No input files were given for input 'input_ids'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'attention_mask'!. This input will be filled with random values!
[ WARNING ] No input files were given for input 'token_type_ids'!. This input will be filled with random values!
[ INFO ] Fill input 'input_ids' with random values
[ INFO ] Fill input 'attention_mask' with random values
[ INFO ] Fill input 'token_type_ids' with random values
[Step 10/11] Measuring performance (Start inference asynchronously, 4 inference requests, limits: 60000 ms duration)
[ INFO ] Benchmarking in inference only mode (inputs filling are not included in measurement loop).
[ INFO ] First inference took 16.16 ms
[Step 11/11] Dumping statistics report
[ INFO ] Execution Devices:['CPU']
[ INFO ] Count: 19836 iterations
[ INFO ] Duration: 60015.12 ms
[ INFO ] Latency:
[ INFO ] Median: 11.93 ms
[ INFO ] Average: 11.95 ms
[ INFO ] Min: 11.48 ms
[ INFO ] Max: 16.93 ms
[ INFO ] Throughput: 330.52 FPS
When this might be helpful¶
This feauture can improve inference performance for models with sparse weights in the scenarios when the model is deployed to handle multiple requests in parallel asyncronously. It is especially helpful in the case of small sequence length, e.g. 32 and lower.
For more details about asynchronous inference with OpenVINO please refer to the following documentation: - Deployment Optimization Guide - Inference Request API