Zero-shot Image Classification with OpenAI CLIP and OpenVINO™

This tutorial is also available as a Jupyter notebook that can be cloned directly from GitHub. See the installation guide for instructions to run this tutorial locally on Windows, Linux or macOS.

Github

Zero-shot image classification is a computer vision task to classify images into one of several classes without any prior training or knowledge of the classes.

zero-shot-pipeline

zero-shot-pipeline

**image source*

Zero-shot learning resolves several challenges in image retrieval systems. For example, with the rapid growth of categories on the web, it is challenging to index images based on unseen categories. We can associate unseen categories to images with zero-shot learning by exploiting attributes to model’s relationship between visual features and labels. In this tutorial, we will use the OpenAI CLIP model to perform zero-shot image classification. The notebook contains the following steps: 1. Download the model. 2. Instantiate the PyTorch model. 3. Export the ONNX model and convert it to OpenVINO IR, using the Model Optimizer tool. 4. Run CLIP with OpenVINO.

Instantiate model

CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on various (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task. CLIP uses a ViT like transformer to get visual features and a causal language model to get the text features. The text and visual features are then projected into a latent space with identical dimensions. The dot product between the projected image and text features is then used as a similarity score.

clip

clip

**image_source*

You can find more information about this model in the research paper, OpenAI blog, model card and GitHub repository.

In this notebook, we will use openai/clip-vit-base-patch16, available via Hugging Face Transformers, but the same steps are applicable for other CLIP family models.

First, we need to create CLIPModel class object and initialize it with model configuration and weights, using from_pretrained method. The model will be automatically downloaded from Hugging Face Hub and cached for the next usage. CLIPProcessor class is a wrapper for input data preprocessing. It includes both encoding the text using tokenizer and preparing the images.

from transformers import CLIPProcessor, CLIPModel

# load pre-trained model
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
# load preprocessor for model input
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")
2023-05-17 23:40:03.249418: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0.
2023-05-17 23:40:03.281197: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-17 23:40:03.757457: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
from typing import List
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image


def visualize_result(image:Image, labels:List[str], probs:np.ndarray, top:int = 5):
    """
    Utility function for visualization classification results
    params:
      image: input image
      labels: list of classification labels
      probs: model predicted softmaxed probabilities for each label
      top: number of the highest probability results for visualization
    returns:
      None
    """
    plt.figure(figsize=(64, 64))
    top_labels = np.argsort(-probs)[:min(top, probs.shape[0])]
    top_probs = probs[top_labels]
    plt.subplot(8, 8, 1)
    plt.imshow(image)
    plt.axis("off")

    plt.subplot(8, 8, 2)
    y = np.arange(top_probs.shape[-1])
    plt.grid()
    plt.barh(y, top_probs)
    plt.gca().invert_yaxis()
    plt.gca().set_axisbelow(True)
    plt.yticks(y, [labels[index] for index in top_labels])
    plt.xlabel("probability")

Run PyTorch model inference

To perform classification, define labels and load an image in RGB format. To give the model wider text context and improve guidance, we extend the labels description using the template “This is a photo of a”. Both the list of label descriptions and image should be passed through the processor to obtain a dictionary with input data in the model-specific format. The model predicts an image-text similarity score in raw logits format, which can be normalized to the [0, 1] range using the softmax function. Then, we select labels with the highest similarity score for the final result.

image = Image.open('../data/image/coco.jpg')
input_labels = ['cat', 'dog', 'wolf', 'tiger', 'man', 'horse', 'frog', 'tree', 'house', 'computer']
text_descriptions = [f"This is a photo of a {label}" for label in input_labels]

inputs = processor(text=text_descriptions, images=[image], return_tensors="pt", padding=True)

results = model(**inputs)
logits_per_image = results['logits_per_image']  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1).detach().numpy()  # we can take the softmax to get the label probabilities
visualize_result(image, input_labels, probs[0])
../_images/228-clip-zero-shot-image-classification-with-output_5_0.png

Convert model to OpenVINO Intermediate Representation (IR) format.

For best results with OpenVINO, it is recommended to convert the model to OpenVINO IR format. OpenVINO supports PyTorch via ONNX conversion. The torch.onnx.export function enables conversion of PyTorch models to ONNX format. It requires to provide initialized model object, example of inputs for tracing and path for saving result. The model contains operations which supported for ONNX tracing starting with opset 14, it is recommended to use it as opset_version parameter. Besides that, we need to have opportunity to provide descriptions various of length and images with different sizes, for preserving this capability after ONNX conversion, dynamic_axes parameter can be used. More information about PyTorch to ONNX exporting can be found in this tutorial and PyTorch documentation. We will use mo.convert_model functionality to convert the ONNX model. The mo.convert_model Python function returns an OpenVINO model ready to load on the device and start making predictions. We can save it on disk for the next usage with openvino.runtime.serialize.

import torch

torch.onnx.export(
    model,  # model being run
    # model input in one of acceptable format: torch.Tensor (for single input), tuple or list of tensors for multiple inputs or dictionary with string keys and tensors as values.
    dict(inputs),
    "clip-vit-base-patch16.onnx",  # where to save the model
    opset_version=14,  # the ONNX version to export the model to
    input_names=["input_ids", "pixel_values", "attention_mask"],  # the model's input names
    output_names=["logits_per_image", "logits_per_text", "text_embeds", "image_embeds"],  # the model's output names
    dynamic_axes={  # variable length axes
        "input_ids": {0: "batch", 1: "sequence"},
        "pixel_values": {0: "batch", 1: "num_channels", 2: "height", 3: "width"},
        "attention_mask": {0: "batch", 1: "sequence"},
        "logits_per_image": {0: "batch"},
        "logits_per_text": {0: "batch"},
        "text_embeds": {0: "batch"},
        "image_embeds": {0: "batch"}
    }
)
/opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-408/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/transformers/models/clip/modeling_clip.py:284: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
/opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-408/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/transformers/models/clip/modeling_clip.py:324: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
/opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-408/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/transformers/models/clip/modeling_clip.py:292: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len):
/opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-408/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/transformers/models/clip/modeling_clip.py:301: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attention_mask.size() != (bsz, 1, tgt_len, src_len):
/opt/home/k8sworker/cibuilds/ov-notebook/OVNotebookOps-408/.workspace/scm/ov-notebook/.venv/lib/python3.8/site-packages/torch/onnx/symbolic_opset9.py:5408: UserWarning: Exporting aten::index operator of advanced indexing in opset 14 is achieved by combination of multiple ONNX operators, including Reshape, Transpose, Concat, and Gather. If indices include negative values, the exported graph will produce incorrect results.
  warnings.warn(
from openvino.runtime import serialize
from openvino.tools import mo

ov_model = mo.convert_model('clip-vit-base-patch16.onnx', compress_to_fp16=True)
serialize(ov_model, 'clip-vit-base-patch16.xml')
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using tokenizers before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Warning: One or more of the values of the Constant can't fit in the float16 data type. Those values were casted to the nearest limit value, the model can produce incorrect results.
Warning: One or more of the values of the Constant can't fit in the float16 data type. Those values were casted to the nearest limit value, the model can produce incorrect results.

Run OpenVINO model

The steps for making predictions with the OpenVINO CLIP model are similar to the PyTorch model. Let us check the model result using the same input data from the example above with PyTorch.

import numpy as np
from scipy.special import softmax
from openvino.runtime import Core

# create OpenVINO core object instance
core = Core()
# compile model for loading on device
compiled_model = core.compile_model(ov_model)
# obtain output tensor for getting predictions
logits_per_image_out = compiled_model.output(0)
# run inference on preprocessed data and get image-text similarity score
ov_logits_per_image = compiled_model(dict(inputs))[logits_per_image_out]
# perform softmax on score
probs = softmax(ov_logits_per_image, axis=1)
# visualize prediction
visualize_result(image, input_labels, probs[0])
../_images/228-clip-zero-shot-image-classification-with-output_10_0.png

Great! Looks like we got the same result.

Now, it is your turn! You can provide your own image and comma-separated list of labels for zero-shot classification.

Feel free to upload an image, using the file upload window and type label names into the text field, using comma as the separator (for example, cat,dog,bird)

import ipywidgets as widgets
style = {'description_width': 'initial'}

image_widget = widgets.FileUpload(
    accept='',
    multiple=False,
    description='Upload image',
    style=style
)

labels_widget = widgets.Textarea(
    value='cat,dog,bird',
    placeholder='Type something',
    description='Enter your classes separated by ,:',
    disabled=False,
    style=style
)
widgets.VBox(children=[image_widget, labels_widget])
VBox(children=(FileUpload(value=(), description='Upload image'), Textarea(value='cat,dog,bird', description='E…

Run the next cell to get the result for your submitted data:

import io
# read uploaded image
image = Image.open(io.BytesIO(image_widget.value[-1]['content'])) if image_widget.value else image
# obtain list of labels
labels = labels_widget.value.split(',')
# convert labels to text description
text_descriptions = [f"This is a photo of a {label}" for label in labels]

# preprocess input
inputs = processor(text=text_descriptions, images=[image], return_tensors="np", padding=True)
# run inference
ov_logits_per_image = compiled_model(dict(inputs))[logits_per_image_out]
# perform softmax on score
probs = softmax(ov_logits_per_image, axis=1)
# visualize prediction
visualize_result(image, labels, probs[0])
../_images/228-clip-zero-shot-image-classification-with-output_15_0.png